SPAM EMAIL CLASSIFICATION USING SUPPORT VECTOR MACHINE (SVM) AND TF-IDF: A CASE STUDY WITH THE TREC 2007 AND ENRON-SPAM DATASETS

I Gusti Ngurah Darma Paramartha; I Made Ardi Sudestra; Adie Wahyudi Oktavia Gama; Gede Humaswara Prathama

doi:10.47111/jti.v19i2.22770

Authors

I Gusti Ngurah Darma Paramartha Universitas Pendidikan Nasional
I Made Ardi Sudestra Univesitas Pendidikan Ganesha
Adie Wahyudi Oktavia Gama Universitas Pendidikan Nasional
Gede Humaswara Prathama Universitas Pendidikan Nasional

DOI:

https://doi.org/10.47111/jti.v19i2.22770

Keywords:

Spam Email Classification, Support Vector Machine (SVM), TF-IDF Vectorizer, Machine Learning, Classification

Abstract

Spam emails represent a substantial concern within the digital landscape, impeding users with unsolicited communications. This study elucidates the utilization of a Support Vector Machine (SVM) coupled with a TF-IDF Vectorizer for categorizing emails into spam and non-spam classifications. The model was developed utilizing two publicly accessible pre-processed datasets: the TREC 2007 Public Spam Corpus and the Enron-Spam Dataset. By employing the TF-IDF algorithm, which allocates heightened importance to infrequent yet pertinent terms, alongside SVM, renowned for its efficacy in textual classification, the model exhibits remarkable efficacy, achieving an accuracy of 99.04%, a precision of 98.57% and a recall of 99.62%. These findings underscore the model's formidable capacity to discern spam emails while concurrently minimizing false positives accurately. This is critical for real-world applications where authentic emails must not be erroneously categorized as spam. Furthermore, this study elaborates on the justification for the selection of TF-IDF and SVM in the context of spam email classification, in addition to the evaluation outcomes of the model, which align with existing literature, wherein the integration of SVM with TF-IDF has demonstrated substantial performance in spam detection endeavours.

Downloads

Download data is not yet available.

DOI: 10.47111/jti.v19i2.22770 DOI URL: https://doi.org/10.47111/jti.v19i2.22770

Views: 77 | Downloads: 84

References

F. Jáñez-Martino, E. Fidalgo, S. González-Martínez, and J. Velasco-Mata, “Classification of Spam Emails through Hierarchical Clustering and Supervised Learning,” May 2020, [Online]. Available: http://arxiv.org/abs/2005.08773

D. Mallampati and N. P. Hegde, “Feature Extraction and Classification of Email Spam Detection Using IMTF-IDF+Skip-Thought Vectors,” Ingénierie des systèmes d information, vol. 27, no. 6, pp. 941–948, Dec. 2022, doi: 10.18280/isi.270610.

A. Bhowmick and S. M. Hazarika, “E-Mail Spam Filtering: A Review of Techniques and Trends,” 2018, pp. 583–590. doi: 10.1007/978-981-10-4765-7_61.

I. H. Sarker, “Machine Learning: Algorithms, Real-World Applications and Research Directions,” May 01, 2021, Springer. doi: 10.1007/s42979-021-00592-x.

L. Alzubaidi et al., “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions,” J Big Data, vol. 8, no. 1, Dec. 2021, doi: 10.1186/s40537-021-00444-8.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, “A comprehensive survey on support vector machine classification: Applications, challenges and trends,” Neurocomputing, vol. 408, pp. 189–215, Sep. 2020, doi: 10.1016/j.neucom.2019.10.118.

S. Triest, A. Villaflor, and J. M. Dolan, “Learning Highway Ramp Merging Via Reinforcement Learning with Temporally-Extended Actions,” in 2020 IEEE Intelligent Vehicles Symposium (IV), IEEE, Oct. 2020, pp. 1595–1600. doi: 10.1109/IV47402.2020.9304841.

K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,” Information, vol. 10, no. 4, p. 150, Apr. 2019, doi: 10.3390/info10040150.

Z. Yun-tao, G. Ling, and W. Yong-cheng, “An improved TF-IDF approach for text classification,” Journal of Zhejiang University-SCIENCE A, vol. 6, no. 1, pp. 49–55, Aug. 2005, doi: 10.1631/BF02842477.

L. Almazaydeh, M. Abuhelaleh, A. Al Tawil, and K. Elleithy, “Clinical Text Classification with Word Representation Features and Machine Learning Algorithms,” International Journal of Online and Biomedical Engineering (iJOE), vol. 19, no. 04, pp. 65–76, Apr. 2023, doi: 10.3991/ijoe.v19i04.36099.

N. Kumar, S. Sonowal, and Nishant, “Email Spam Detection Using Machine Learning Algorithms,” in 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), IEEE, Jul. 2020, pp. 108–113. doi: 10.1109/ICIRCA48905.2020.9183098.

F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, E. Fidalgo, and E. Alegre, “Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach.” [Online]. Available: https://talosintelligence.com/reputation_center/email_rep

A. Dasgupta and S. Mehr, “Enhanced MNB Method for SPAM E-mail/SMS Text Detection Using TF-IDF Vectorizer,” American Journal of Mathematical and Computer Modelling, vol. 9, no. 1, pp. 1–8, Apr. 2024, doi: 10.11648/j.ajmcm.20240901.11.

V. S. Vinitha and D. K. Renuka, “Performance Analysis of E-Mail Spam Classification using different Machine Learning Techniques,” in 2019 International Conference on Advances in Computing and Communication Engineering (ICACCE), IEEE, Apr. 2019, pp. 1–5. doi: 10.1109/ICACCE46606.2019.9080000.

C. Wang, J. Zhou, H. Huang, and H. Shen, “Classification Algorithms for Unbalanced High-Dimensional Data with Hyperbox Vertex Over-Sampling Iterative Support Vector Machine Approach,” in 2020 Chinese Control And Decision Conference (CCDC), IEEE, Aug. 2020, pp. 2294–2299. doi: 10.1109/CCDC49329.2020.9164585.

B. Wang, L. Zhou, Y. Gu, and H. Zou, “Density-Convoluted Support Vector Machines for High-Dimensional Classification,” IEEE Trans Inf Theory, vol. 69, no. 4, pp. 2523–2536, Apr. 2023, doi: 10.1109/TIT.2022.3222767.

M. Adam et al., “Sentiment Analysis on Acceptance of COVID-19 Vaccine for Children based on Support Vector Machine,” Journal of Advanced Research in Applied Sciences and Engineering Technology Journal homepage, vol. 58, pp. 252–270, 2026, doi: https://doi.org/10.37934/araset.58.2.252270.

M. Owusu-Adjei, J. Ben Hayfron-Acquah, T. Frimpong, and G. Abdul-Salaam, “A systematic review of prediction accuracy as an evaluation measure for determining machine learning model performance in healthcare systems,” Jun. 04, 2023. doi: 10.1101/2023.06.01.23290837.

S. Chua, A. Tan, P. N. E. Nohuddin, and M. H. Ahmad Hijazi, “Comparing the Effectiveness and Efficiency of Machine Learning Models for Spam Detection on Twitter,” Journal of Advanced Research in Applied Sciences and Engineering Technology, pp. 127–138, Oct. 2024, doi: 10.37934/araset.61.2.127138.

C. N. Mohammed and A. M. Ahmed, “A semantic-based model with a hybrid feature engineering process for accurate spam detection,” Journal of Electrical Systems and Information Technology, vol. 11, no. 1, p. 26, Jul. 2024, doi: 10.1186/s43067-024-00151-3.

M. A. Shaaban, Y. F. Hassan, and S. K. Guirguis, “Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text,” Complex & Intelligent Systems, vol. 8, no. 6, pp. 4897–4909, Dec. 2022, doi: 10.1007/s40747-022-00741-6.

T. S. Guzella and W. M. Caminhas, “A review of machine learning approaches to Spam filtering,” Expert Syst Appl, vol. 36, no. 7, pp. 10206–10222, Sep. 2009, doi: 10.1016/j.eswa.2009.02.037.

C. Dewi, F. A. Indriawan, and H. J. Christanto, “Spam classification problems using support vector machine and grid search,” International Journal of Applied Science and Engineering, vol. 20, no. 4, 2023, doi: 10.6703/IJASE.202312_20(4).006.

J. Bernard, M. Hutter, M. Zeppelzauer, M. Sedlmair, and T. Munzner, “ProSeCo: Visual analysis of class separation measures and dataset characteristics,” Comput Graph, vol. 96, pp. 48–60, May 2021, doi: 10.1016/j.cag.2021.03.004.