Cybercriminal Personality Detection through Machine Learning

Dr Nurul Hashimah  Ahamed Hassain Malim1Saravanan Sagadevan1, Muhd Baqir Hakim1, Nurul Izzati Ridzuwan1

1 Universiti Sains Malaysia, Penang, Malaysia,



The development of sophisticated forms of communication technologies such as social networks has exponentially raised the number of users that participate in online activities. Although the development encourages and brings many  positive  social  benefits,  the  dark  sides  of  online  communication are  still  a  major  concern  in  virtual interactions. The dark side of online activities occasionally referred as cyber threats or cyber crimes. Over the past two decades, cybercrime cases have increased exponentially and threatened the privacy and life of online users. Occasionally, severe kinds of cyber criminal activities such as cyber bullying and cyber harassment executed through exploiting text messages and the anonymity offered by social network platforms such as Facebook and Twitter. However, the linguistic clues such as patterns of writing and expression in the text messages often act as fingerprints in revealing the personality traits of the culprits who hide behind the anonymity provided by social networks [1]. Personality traits are hidden abstraction that combined emotion, behavior, motivation and thinking patterns of human that often mirror the true characteristics of them through their activities that conducted intentionally or unintentionally [2]. In nature, each individual are differed in terms of their talking and writing styles or patterns and it is hard to observe those differences. The distinct styles of talking and writing are unique from person to person and there are tendencies to decipher the identity of writers by simply observed at the pattern of the writing especially the formation of words, phrases and clauses. Sir Francis Galton was identified as the first person that hypothesized natural language terms might present the personality differences in humankind [3]. Furthermore, Hofstee suggested that nouns, sentences, and actions might have some kind of connotations towards personality [4].  In  the other hand, since several decades ago, the people from forensic psychology, behavioral sciences and the law enforcement agencies have been working together to study and integrate the science  of  psychology into  criminal  profiling  [5].  Through  the  review  of  literature related  with  psychology, linguistics  and  behavior,  it  can  be  affirmed  that  strong  relationship  presented  between  personality  traits especially related with criminals and writing/language skills. Therefore, curiosity raised on whether the writing pattern in social networks by cyber criminals could be identify or detected by using automatic classifiers. If yes, how better will be the performances of the classifiers and what are the words or combination of words that may frequently used by cyber predators. Therefore, in order to find answers to those questions, we conducted an empirical investigation [9] (main study) with two other small scales studies [10,11] (extend the main study) by using the textual sources from Facebook and Twitter and exploiting the descriptions stated in Three Factor Personality Model, and sentiment valences. For the main study, the open source data Facebook [6] and Twitter [7] were used as text input while the data for other two small scale studies were harvested from Twitter using Tweepy, a Python library for accessing the Twitter API. The main study and the second study used data that only written in English language while third study used tweets in Malay Language (Bahasa Malaysia).  In these studies, we employed four main classifiers namely Sequential Minimal Optimization (SMO), Naive Bayes (NB), K- Nearest Neighbor (KNN) and J48 with ZeroR as baseline from Waikato Environment for Knowledge Analysis (WEKA) Machine Learning Tool. The reason to  used the traits  from Three Factor Model in  this  study is  due  to  the widespread use of the model in criminology, less number of traits ease the characteristics categorization process and large number of empirical proved that associated Psychoticism trait with criminal characteristics whereas sentiment valences was used to measure the polarity of sentiment terms.  The major traits of Three Factor Model and its associated characteristics listed in Table 1.

Table 1: Three Factor Model Traits and its characteristics [8].

Traits Specific Characteristics
Extraversion Sociable, lively, active, assertive, sensation seeking, carefree, and dominant.
Neuroticism Anxious, depressed, guilt feelings, low self-esteem, tense, irrational, and moody.
Psychoticism Aggressive, egocentric, impersonal, impulsive, antisocial, creative and tough-minded.

The three studies used similar research framework as following. Step 1 : Data Collection & Preprocessing (Data Cleansing, Stemming, Part-Of-Speech Tagging), Step 2 : Data Annotations, Step 3 : Automatic Classification by the four Classifiers, Step 4 : Performance analysis, criminal related terms identification (using Chi-Square method).  The following tables illustrated the performances of machine learning classifiers of the studies and the list of the terms that identified to be associated to criminal behavior. The class balancing method called Synthetic Minority Over-sampling Technique (SMOTE) was used to overcome the unbalance volume of class instances.

Table 2 : Accuracy of classifiers based on with/without SMOTE class balancing methods[10].

Performance measurement based on True Positive (TP)and False Positive (FP)
Type/Classifier ZeroR NB KNN SMO J48
Without SMOTE 47.2


52.73 58.18 41.82 47.27 52.73 72.73 27.27 78.18 21.82
With SMOTE 40.6


59.38 68.75 31.25 53.13 46.88 73.44 26.56 75.00 25


Table 3 : Accuracy of classifiers based on measuring the effect of class measuring[11].

Performance measurement based on True Positive (TP)and False Positive (FP)


3 53.3 46.7 80.0 20.0 63.3 36.7 73.3 26.7 50.0 50.0
5 53.3 46.7 90.0 10.0 56.7 43.3 70.0 30.0 63.3 36.7
10 53.3 46.7 90.0 10.0 56.7 43.3 86.3 16.7 70.0 30.0


Table 4 : Terms that highly associated with criminal behavior [9].

Facebook Twitter
Unigram Bigram Trigram Unigram Bigram Trigram
Damn The hell I want to Suck Damn It A big ass
Shit Damn it Damn it I Adore The hell A bit more
Fuck Hell i Is a bitch Annoy A bitch A bitch and
Hell My Fuck What the Fuck Asshole A damn A damn good
Ass The shit What the hell Shit A fuck All fuck up
Suck Damn you I feel like Fuck A hell A great fuck
Bad The fuck The hell I Hell Damn you A great night
Feel A bitch Cute A shit A pain in
Hate Fuck yeah Damn My ass A fuck off


As conclusion, our investigation showed that J48 performed better than other classifiers with and without applied the SMOTE class balancing technique and the effect of cross validation vary for each classifiers. However, in overall view, Naïve Bayes performed better on each cross validation experiments. This investigation also produced a list of the words that may used by cyber criminals based on language models specification. Then, for future study, we planned to used deep learning methods to analyze the contents related with cyber terrorism and welcome any collaboration for social networks cyber terrorism textual data collaboration.


  1. Olivia Goldhill.. Digital detectives: solving crimes through Twitter, 2013. The Telegraph.
  2. Navonil  Majumder,  Soujanya  Poria,  Alexander  Gelbukh,  and  Erik  Cambria.  2017.  Deep  learning  based document modeling for personality detection from text. IEEE Intelligent Systems 32(2):74–79.
  3. Sapir, Edward. Language: An Introduction to the Study of Speech. New York: Harcourt, Brace, 1921.
  4. Matthews, G, Ian, J. D., & Martha, C. W. Personality Traits (2nd edition). Cambridge University Press, 2003.
  5. Gierowski, J. K. Podstawowa problematyka psychologiczna w procesie karnym. Psychologia w postępowaniu karnym, Lexis Nexis, Warszawa 2010.
  6. Celli, F., Pianesi, F., Stillwell, D., & Kosinski, M. Workshop on Computational Personality Recognition (Shared Task). In Proceedings of WCPR13, in conjunction with ICWSM-2013.
  7. Alec, G., Richa, B, & Lei, H.. Twitter Sentiment Classification using Distant Supervision, 2009.
  8. Coleta, V. D, Jan, M. A., Janssens, M., & Eric E. J. PEN, Big Five, juvenile delinquency and criminal recidivism. Personality and Individual Differences, 39, (2005) 7–19. DOI:10.1016/j.paid.2004.06.016.
  9. Saravanan Sagadevan. Thesis : Comparison Of Machine Learning Algorithms For Personality Detection In Online Social Networking, 2017.
  10. Muhd, Baqir Hakim. Profiling Online Social Network (OSN) User Using PEN Model and Dark Triad Based on English Text Using Machine Learning Algorithm, 2017 (In Review).
  11. Nurul Izzati Binti Ridzuwan. Online Social Network User-Level Personality Profiling Using Pen Model Based On Malay Text (In Review), 2017.



Nurul Hashimah Ahamed Hassain Malim (Nurul Malim) received her B.Sc (Hons) in computer science and M.Sc in computer science from Universiti Sains Malaysia, Malaysia. She completed her PhD in 2011 from The University of Sheffield, United Kingdom. Her current research interests include chemoinformatics, bioinformatics, data analytics, sentiment analysis and high-performance computing. She is currently a Senior Lecturer in the School of Computer Sciences, Universiti Sains Malaysia, Malaysia.