Dr Nurul Hashimah Ahamed Hassain Malim1, Saravanan Sagadevan1, Muhd Baqir Hakim1, Nurul Izzati Ridzuwan1
1 Universiti Sains Malaysia, Penang, Malaysia, firstname.lastname@example.org
The development of sophisticated forms of communication technologies such as social networks has exponentially raised the number of users that participate in online activities. Although the development encourages and brings many positive social benefits, the dark sides of online communication are still a major concern in virtual interactions. The dark side of online activities occasionally referred as cyber threats or cyber crimes. Over the past two decades, cybercrime cases have increased exponentially and threatened the privacy and life of online users. Occasionally, severe kinds of cyber criminal activities such as cyber bullying and cyber harassment executed through exploiting text messages and the anonymity offered by social network platforms such as Facebook and Twitter. However, the linguistic clues such as patterns of writing and expression in the text messages often act as fingerprints in revealing the personality traits of the culprits who hide behind the anonymity provided by social networks . Personality traits are hidden abstraction that combined emotion, behavior, motivation and thinking patterns of human that often mirror the true characteristics of them through their activities that conducted intentionally or unintentionally . In nature, each individual are differed in terms of their talking and writing styles or patterns and it is hard to observe those differences. The distinct styles of talking and writing are unique from person to person and there are tendencies to decipher the identity of writers by simply observed at the pattern of the writing especially the formation of words, phrases and clauses. Sir Francis Galton was identified as the first person that hypothesized natural language terms might present the personality differences in humankind . Furthermore, Hofstee suggested that nouns, sentences, and actions might have some kind of connotations towards personality . In the other hand, since several decades ago, the people from forensic psychology, behavioral sciences and the law enforcement agencies have been working together to study and integrate the science of psychology into criminal profiling . Through the review of literature related with psychology, linguistics and behavior, it can be affirmed that strong relationship presented between personality traits especially related with criminals and writing/language skills. Therefore, curiosity raised on whether the writing pattern in social networks by cyber criminals could be identify or detected by using automatic classifiers. If yes, how better will be the performances of the classifiers and what are the words or combination of words that may frequently used by cyber predators. Therefore, in order to find answers to those questions, we conducted an empirical investigation  (main study) with two other small scales studies [10,11] (extend the main study) by using the textual sources from Facebook and Twitter and exploiting the descriptions stated in Three Factor Personality Model, and sentiment valences. For the main study, the open source data Facebook  and Twitter  were used as text input while the data for other two small scale studies were harvested from Twitter using Tweepy, a Python library for accessing the Twitter API. The main study and the second study used data that only written in English language while third study used tweets in Malay Language (Bahasa Malaysia). In these studies, we employed four main classifiers namely Sequential Minimal Optimization (SMO), Naive Bayes (NB), K- Nearest Neighbor (KNN) and J48 with ZeroR as baseline from Waikato Environment for Knowledge Analysis (WEKA) Machine Learning Tool. The reason to used the traits from Three Factor Model in this study is due to the widespread use of the model in criminology, less number of traits ease the characteristics categorization process and large number of empirical proved that associated Psychoticism trait with criminal characteristics whereas sentiment valences was used to measure the polarity of sentiment terms. The major traits of Three Factor Model and its associated characteristics listed in Table 1.
Table 1: Three Factor Model Traits and its characteristics .
|Extraversion||Sociable, lively, active, assertive, sensation seeking, carefree, and dominant.|
|Neuroticism||Anxious, depressed, guilt feelings, low self-esteem, tense, irrational, and moody.|
|Psychoticism||Aggressive, egocentric, impersonal, impulsive, antisocial, creative and tough-minded.|
The three studies used similar research framework as following. Step 1 : Data Collection & Preprocessing (Data Cleansing, Stemming, Part-Of-Speech Tagging), Step 2 : Data Annotations, Step 3 : Automatic Classification by the four Classifiers, Step 4 : Performance analysis, criminal related terms identification (using Chi-Square method). The following tables illustrated the performances of machine learning classifiers of the studies and the list of the terms that identified to be associated to criminal behavior. The class balancing method called Synthetic Minority Over-sampling Technique (SMOTE) was used to overcome the unbalance volume of class instances.
Table 2 : Accuracy of classifiers based on with/without SMOTE class balancing methods.
|Performance measurement based on True Positive (TP)and False Positive (FP)|
Table 3 : Accuracy of classifiers based on measuring the effect of class measuring.
|Performance measurement based on True Positive (TP)and False Positive (FP)|
Table 4 : Terms that highly associated with criminal behavior .
|Damn||The hell||I want to||Suck||Damn It||A big ass|
|Shit||Damn it||Damn it I||Adore||The hell||A bit more|
|Fuck||Hell i||Is a bitch||Annoy||A bitch||A bitch and|
|Hell||My Fuck||What the Fuck||Asshole||A damn||A damn good|
|Ass||The shit||What the hell||Shit||A fuck||All fuck up|
|Suck||Damn you||I feel like||Fuck||A hell||A great fuck|
|Bad||The fuck||The hell I||Hell||Damn you||A great night|
|Feel||A bitch||Cute||A shit||A pain in|
|Hate||Fuck yeah||Damn||My ass||A fuck off|
As conclusion, our investigation showed that J48 performed better than other classifiers with and without applied the SMOTE class balancing technique and the effect of cross validation vary for each classifiers. However, in overall view, Naïve Bayes performed better on each cross validation experiments. This investigation also produced a list of the words that may used by cyber criminals based on language models specification. Then, for future study, we planned to used deep learning methods to analyze the contents related with cyber terrorism and welcome any collaboration for social networks cyber terrorism textual data collaboration.
- Olivia Goldhill.. Digital detectives: solving crimes through Twitter, 2013. The Telegraph.
- Navonil Majumder, Soujanya Poria, Alexander Gelbukh, and Erik Cambria. 2017. Deep learning based document modeling for personality detection from text. IEEE Intelligent Systems 32(2):74–79.
- Sapir, Edward. Language: An Introduction to the Study of Speech. New York: Harcourt, Brace, 1921.
- Matthews, G, Ian, J. D., & Martha, C. W. Personality Traits (2nd edition). Cambridge University Press, 2003.
- Gierowski, J. K. Podstawowa problematyka psychologiczna w procesie karnym. Psychologia w postępowaniu karnym, Lexis Nexis, Warszawa 2010.
- Celli, F., Pianesi, F., Stillwell, D., & Kosinski, M. Workshop on Computational Personality Recognition (Shared Task). In Proceedings of WCPR13, in conjunction with ICWSM-2013.
- Alec, G., Richa, B, & Lei, H.. Twitter Sentiment Classification using Distant Supervision, 2009.
- Coleta, V. D, Jan, M. A., Janssens, M., & Eric E. J. PEN, Big Five, juvenile delinquency and criminal recidivism. Personality and Individual Differences, 39, (2005) 7–19. DOI:10.1016/j.paid.2004.06.016.
- Saravanan Sagadevan. Thesis : Comparison Of Machine Learning Algorithms For Personality Detection In Online Social Networking, 2017.
- Muhd, Baqir Hakim. Profiling Online Social Network (OSN) User Using PEN Model and Dark Triad Based on English Text Using Machine Learning Algorithm, 2017 (In Review).
- Nurul Izzati Binti Ridzuwan. Online Social Network User-Level Personality Profiling Using Pen Model Based On Malay Text (In Review), 2017.
Nurul Hashimah Ahamed Hassain Malim (Nurul Malim) received her B.Sc (Hons) in computer science and M.Sc in computer science from Universiti Sains Malaysia, Malaysia. She completed her PhD in 2011 from The University of Sheffield, United Kingdom. Her current research interests include chemoinformatics, bioinformatics, data analytics, sentiment analysis and high-performance computing. She is currently a Senior Lecturer in the School of Computer Sciences, Universiti Sains Malaysia, Malaysia.