ارائه روش انتخاب ویژگی مبتنی بر خوشه‌بندی در مسئله تشخیص هرزنامه

نصرتی, وحید; رحمانی, محسن

doi:10.22034/aimj.2022.170976

ارائه روش انتخاب ویژگی مبتنی بر خوشه‌بندی در مسئله تشخیص هرزنامه

نوع مقاله : مقاله پژوهشی

نویسندگان

وحید نصرتی ¹

محسن رحمانی ²

¹ دانشجوی دکتری، مهندسی کامپیوتر، دانشکده فنی و مهندسی، دانشگاه اراک، اراک، ایران

² دانشیار، مهندسی کامپیوتر، دانشکده فنی و مهندسی، دانشگاه اراک، اراک، ایران

10.22034/aimj.2022.170976

چکیده

یکی از راه‌های تشخیص هرزنامه، دسته‌بندی ایمیل‌ها به دو دسته هرزنامه و غیرهرزنامه است. کارایی بالای روش‌های یادگیری ماشین در مسائل گوناگون، باعث توسعه وسیع آنها در دسته‌بندی متون شده است. استفاده از یک سازوکار کاهش ویژگی کارآمد در الگوریتم‌های یادگیری ماشین مبتنی بر محتوا به‎‏منظور استخراج یک بردار ویژگی کارآمد از میان تعداد بسیار زیادی ایمیل نقش مهمی دارد. برخلاف روش‌های پیشین که فقط ویژگی‌های برتر را انتخاب کرده و باقی ویژگی‌ها را نادیده می‌گیرند، در روش پیشنهادی در این مقاله سعی شده است از ویژگی‌های انتخاب‌نشده نیز استفاده شود. روش کار به این صورت است که ابتدا یک انتخاب ویژگی اولیه اعمال شده و تعدادی ویژگی انتخاب می‌شود. سپس، ویژگی‌های انتخاب‎نشده خوشه‌بندی شده و هر خوشه به یک ویژگی جدید نگاشت می‌شود و بردار ویژگی نهایی شامل ویژگی‌های انتخاب‎شده و ویژگی‌های نگاشت‎شده از هر خوشه خواهد بود. در پژوهش حاضر، با اعمال دو روش انتخاب ویژگی اولیه و همچنین دو تابع نگاشت ویژگی‌های خوشه، در مجموع، چهار روش ارائه شد و نتایج با استفاده از دو پایگاه داده PU2 و PU3 تجزیه و تحلیل شدند. نتایج حاصل از تجزیه ‌و تحلیل انجام‎شده نشان داد که روش مبتنی بر انتخاب ویژگی اولیه DF و تابع نگاشت پیشرفته، در بین کلیه روش‌های پیشنهادی، دارای بالاترین کارایی است. همچنین، روش‏های پیشنهادی در مقایسه با انتخاب ویژگی اولیه (بدون خوشه‌بندی) دارای کارایی بهتری هستند.

کلیدواژه‌ها

انتخاب ویژگی

ایمیل

خوشه‌بندی

دسته‌بندی

کاهش ویژگی

هرزنامه

عنوان مقاله English

A Clustering Based Feature Selection Method in Spam Detection

نویسندگان English

Vahid Nosrati ¹

Mohsen Rahmani ²

¹ Ph.D. Candidate, Computer Engineering, Faculty of Engineering, Arak University, Arak, Iran

² Associate Prof., Computer Engineering, Faculty of Engineering, Arak University, Arak, Iran

چکیده English

One of the ways to detect spam is classifying emails into two categories: spam and non-spam. The high efficiency of machine learning methods in various fields has developed them in text clasification problems. The mechanism of machine learning-based classifiers that classify emails according to their content is based on a set of features, where due to the high volume of emails, using an efficient feature reduction algorithm plays an important role. Unlike the previous methods which select only the superior features and ignore the rest of the unselected features, in the proposed method of this article we try to use unselected features as well. The method is that after applying an initial feature selection, the unselected features are clustered and then each cluster is mapped to a new feature and the final feature vector forms from the selected ones and those mapped from the clusters. In this study, by applying two methods of selecting the initial feature and also two mapping functions, four methods were presented and analyzed using two datasets PU2 and PU3. The results of the analysis showed that the method based on feature selection DF and the advanced mapping function has the highest efficiency among all the proposed methods. Also, the proposed methods are more efficient than base feature selection methods (without clustering).

کلیدواژه‌ها English

Classification

Clustering

Feature Reduction

Feature Selection

Spam

Alharan, A. F., Fatlawi, H. K., & Ali, N. S. (2019). A cluster-based feature selection method for image texture classification. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1433-1442.

Al-Rawashdeh, G., Mamat, R., & Abd Rahim, N. H. B. (2019). Hybrid water cycle optimization algorithm with simulated annealing for spam e-mail detection. IEEE Access, 7, 143721-143734.

Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000, July). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 160-167).

Aziz, R., Verma, C. K., & Srivastava, N. (2017). Dimension reduction methods for microarray data: a review. AIMS Bioengineering, 4(2), 179-197.

Baluja, S., & Caruana, R. (1995). Removing the genetics from the standard genetic algorithm. In Machine Learning Proceedings 1995 (pp. 38-46). Morgan Kaufmann.

Dada, E. G., Bassi, J. S., Chiroma, H., Adetunmbi, A. O., & Ajibuwa, O. E. (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6), e01802.

DeBarr, D., & Wechsler, H. (2012). Spam detection using random boost. Pattern Recognition Letters, 33(10), 1237-1244.

Dehghan, Z., & Mansoori, E. G. (2018). A new feature subset selection using bottom-up clustering. Pattern Analysis and Applications, 21(1), 57-66.

Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive information theoretic feature clustering algorithm for text classification. The Journal of machine learning research, 3, 1265-1287.

Eesa, A. S., Abdulazeez, A. M., & Orman, Z. (2017). A DIDS Based on The Combination of Cuttlefish Algorithm and Decision Tree. Science Journal of University of Zakho, 5(4), 313-318.

Elhussein, M., & Brahimi, S. (2021). Clustering as feature selection method in spam classification: uncovering sick-leave sellers. Applied Computing and Informatics. https://doi.org/10.1108/ACI-09-2021-0248

Ghaleb, S. A., Mohamad, M., Fadzli, S. A., & Ghanem, W. A. H. (2021). Training Neural Networks by Enhance Grasshopper Optimization Algorithm for Spam Detection System. IEEE Access, 9, 116768-116813.

Gibson, S., Issac, B., Zhang, L., & Jacob, S. M. (2020). Detecting spam email with machine learning optimized with bio-inspired metaheuristic algorithms. IEEE Access, 8, 187914-187932.

Hong, Y., Kwong, S., Chang, Y., & Ren, Q. (2008). Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm. Pattern Recognition, 41(9), 2742-2756.

Huang, X., Zhang, L., Wang, B., Li, F., & Zhang, Z. (2018). Feature clustering based support vector machine recursive feature elimination for gene selection. Applied Intelligence, 48(3), 594-607.

Li, S., Xia, R., Zong, C., & Huang, C. R. (2009, August). A framework of feature selection methods for text categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 692-700).

Mao, J., Hu, Y., Jiang, D., Wei, T., & Shen, F. (2020). CBFS: a clustering-based feature selection mechanism for network anomaly detection. IEEE Access, 8, 116216-116225.

Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006, July). Spam filtering with naive bayes-which naive bayes? In CEAS, 17, 28-69.

Mohammad, A. H., & Zitar, R. A. (2011). Application of genetic optimized artificial immune system and neural networks in spam detection. Applied Soft Computing, 11(4), 3827-3845.

Nosrati, V., Rahmani, M. (2021) Ensemble Bayesian Classification Using Genetic Algorithm Wrapper Feature Selection in Spam Detection, Iranian Journal of Information Management, 6 (2), 250-277.

Nosrati, V., Rahmani, M., Jolfaei, A., & Seifollahi, S. (2022). A Weak-Region Enhanced Bayesian Classification for Spam Content-Based Filtering. Transactions on Asian and Low-Resource Language Information Processing.

Rao, S., Verma, A. K., & Bhatia, T. (2021). A review on social spam detection: Challenges, open issues, and future directions. Expert Systems with Applications, 186, 115742.

Ravi Kumar, G., Murthuja, P., Anjan Babu, G., & Nagamani, K. (2022). An Efficient Email Spam Detection Utilizing Machine Learning Approaches. In Innovative Data Communication Technologies and Application (pp. 141-151). Springer, Singapore.

Sohrabi, M. K., & Karimi, F. (2015). A clustering based feature selection approach to detect spam in social networks. International Journal of Information and Communication Technology Research, 7(4), 27-33.

Soneji, H. N., Soman, A. S., Vyas, A., & Puthran, S. (2022). A Comprehensive Review of Fraudulent Email Detection Models. In Proceedings of the Seventh International Conference on Mathematics and Computing (pp. 109-127). Springer, Singapore.

Song, Q., Ni, J., & Wang, G. (2011). A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE transactions on knowledge and data engineering, 25(1), 1-14.

Vega-Pons, S., & Ruiz-Shulcloper, J. (2011). A survey of clustering ensemble algorithms. International Journal of Pattern Recognition and Artificial Intelligence, 25(03), 337-372.

Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., & Saeed, J. (2020). A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. Journal of Applied Science and Technology Trends, 1(2), 56-70.