Ensemble Bayesian Classification Using Genetic Algorithm Wrapper Feature Selection in Spam Detection

Document Type : Original Article

Authors

1 Ph.D. Candidate, Computrt Department, Faculty of Engineering, Arak University, Arak, Iran

2 Associate Prof., Computer Department, Faculty of Engineering, Arak University, Arak, Iran

10.22034/aimj.2021.135034

Abstract

The role of email in communication is seriously threatened by a phenomenon called spam. So far, many methods have been proposed to deal with this phenomenon, one of the most important of which is to classify emails based on their content into two categories: spam and non-spam. Content-based classification mechanisms use the words as features, where applying an efficient feature selection mechanism is critical due to the large number of features. Therefore, the main focus of this paper is to select useful features via proposing a wrapper feature selection approach based on a powerful genetic algorithm. We then apply a Bayesian classifier, which has demonstrated a high efficiency in text classification. The main steps of the proposed method is as follows: first, an initial feature vector is chosen, then it is optimized by multiplying the vector in a matrix called the transformation matrix made by the genetic algorithm, and finally, a set of k feature vectors is generated. An ensemble classification approach composed of k Bayesian classifiers is applied to the feature vectors, and the ultimate class label is determined by voting among ensemble members. The proposed method is implemented on two datasets PU1 and PU2. The results show that the classification accuracy of the proposed method with k=7 reaches 87.86 and 87.91 in PU1 and PU2, rspectively. The results also indicate the efficiency of the proposed method compared to naïve Bayes and two well-known classifiers SVM and KNN.

Keywords

Amini, F., and Hu, G. (2021). A two-layer feature selection method using genetic algorithm and elastic net. Expert Systems with Applications, 166: 114072.
Androutsopoulos, I., Koutsias, J., Chandrinos, K.V. and Spyropoulos, S.D. (2000). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 160-167.
Androutsopoulos, I., Paliouras, G. and Michelakis, E. (2004). Learning to filter unsolicited commercial e-mail. NCSR “Demokritos” Technical Report, No. 2004/2.
Balamurugan, A.A., Rajaram, R., Pramala, S., Rajalakshmi, S., Jeyendran, C. and Surya Prakash, J.D. (2011). NB+: an improved naive Bayesian algorithm. Knowledge-Based Systems, 24(5), 563-569.
Chinavle, D., Kolari, P., Oates, T. and Finin, T. (2009). Ensembles in adversarial classification for spam. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 2015-2018.
Crawford, E., Koprinska, I. and Patrick, J. (2004). Phrases and Feature Selection in E-Mail Classification. Conference: ADCS 2004, Proceedings of the Ninth Australasian Document Computing Symposium, December 13.
Dada, E. G., Bassi, G.S., Chiroma, H., Adetunmbi, A.O. and Ajibuwa, O.E. (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6), e01802.
Davis, L. (1991). Handbook of genetic algorithms. New York: Van Nostrand Reinhold.
DeBarr, D. and Wechsler, H. (2012). Spam detection using random boost. Pattern Recognition Letters, 33(10), 1237-1244.
Domingos, P., and Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple bayesian classier. In Proc. 13th Intl. Conf. Machine Learning. pp. 105-112.
Drucker, H., Wu, D. and Vapnik, V.N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural networks, 10(5), 1048-1054.
Faris, H., Al-Zoubi, A.M., Heidari, A.A., Aljarah, I., Mafarja, M., Hassonah, M.A. and Fujita, H. (2019). An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Information Fusion 48: 67-83.
Freund, Y., Schapire, R. (1999). A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence, 14(771-780), 1612.
Goldberg, D. E. (1989). Genetic algorithms in search. Optimization, and Machine Learning. Reading, MA: Addison-Wesley.
Hu, Y., Guo, C., Ngai, E. W. T., Liu, M. and Chen, Sh. (2010). A scalable intelligent non-content-based spam-filtering framework. Expert systems with applications, 37(12), 8557-8565.
Huang, J., Cai, Y. and Xu, X. (2007). A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern recognition letters, 28 (13), 1825-1844.
Jadhav, S., He, H. and Jenkins, K. (2018). Information gain directed genetic algorithm wrapper feature selection for credit rating. Applied Soft Computing, 69: 541-553.
Kohavi, R., and John, G.H. (1998). The wrapper approach. In: Liu H., Motoda H. (eds) Feature Extraction, Construction and Selection, pp. 33-50. The Springer International Series in Engineering and Computer Science, vol 453. Springer, Boston, MA.
Kołcz, A., Chowdhury, A. and Alspector, J. (2004). The impact of feature selection on signature-driven spam detection. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004).
Li, Ch.H.  and Huang, J.X. (2012). Spam filtering using semantic similarity approach and adaptive BPNN. Neurocomputing, 92: 88-97.
Metsis, V., Androutsopoulos, I. and Paliouras, G. (2006). Spam filtering with naive bayes-which naive bayes?. CEAS 2006 - The Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California, USA.
Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994). Machine learning, neural and statistical classification. Technometrics, 37(4). DOI:10.2307/1269742
Mitchell, T. M. (1997). Machine Learning, McGraw-Hill Higher Education. New York.
Mohammad, A.H., and Abu Zitar, R. (2011). Application of genetic optimized artificial immune system and neural networks in spam detection. Applied Soft Computing, 11(4), 3827-3845.
Mohammadzadeh, H. and Soleimanian Gharehchopogh, F. (2021). A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: Case study Email spam detection. Computational Intelligence, 37(1), 176– 209.
Nadjate, S., Adi, K. and Allili, M.S. (2020). A semantic-based classification approach for an enhanced spam detection. Computers & Security, 94: 101716.
Razmara, M., Asadi, B., Narouei, M. and Ahmadi, M. (2012). A novel approach toward spam detection based on iterative patterns. In 2012 2nd International eConference on Computer and Knowledge Engineering (ICCKE): pp 318-323
Stanimirović, Z. (2012). A genetic algorithm approach for the capacitated single allocation p-hub median problem. Computing and Informatics, 29(1), 117-132.
Su, M.C., Lo, H.H. and Hsu, F.H. (2010). A neural tree and its application to spam e-mail detection. Expert Systems with Applications, 37(12), 7976-7985.
Wang, B., Jones, G. JF and Pan, W. (2006). Using online linear classifiers to filter spam emails. Pattern analysis and applications, 9(4), 339-351.
Wu, C.H. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(3), 4321-4330.
Xu, H., and Yu, B. (2010). Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Systems with Applications, 37(1), 18-23.
Yang, Y., and Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine, 97(412-420), p. 35.
Zorkadis, V., Karras, D.A. & Panayotou, M. (2005). Efficient information theoretic strategies for classifier combination, feature extraction and performance evaluation in improving false positives and false negatives for spam e-mail filtering. Neural Networks, 18(5), 799-807.
Volume 6, Issue 2 - Serial Number 11
February 2021
Pages 250-277
  • Receive Date: 28 February 2021
  • Revise Date: 21 June 2021
  • Accept Date: 07 August 2021