Prediction of Drug Target Interaction Using Association Rule Mining (ARM)

Dr Nurul Hashimah  Ahamed Hassain Malim1, Mr Muhammad Jaziem  Mohamed Javed1

1Universiti Sains Malaysia, USMPenang, Malaysia,



Drug repositioning helps to identify new drug indications (i.e. new known disease) for known drugs [1]. It is an innovation stream of pharmaceutical development that offers an edge for both drug developers as well as for patients since the medicines is safe to use. This method is believed as a successful alternative method in the drug discovery process due to several drugs in the past have been successfully repositioned to a new indication, with the most prominent of them being Viagra and Thalidomide, which in turn has brought a higher revenue [2]. The main reason that made drug repositioning possible is the accepted concept of ‘polypharmacology’ [3]. In general, polypharmacology transformed the idea of drug development from “one drug one target” to “one drug multiple target” [4]. Involvement of polypharmacological in the drug discovery area can be seen when (a) single drug acting on multiple targets of a unique disease pathway, or (b) single drug acting on multiple targets in regards to multiple disease pathways and the polypharmacological property within a drug helps us to identify more than one target that it can act on and hence new uses of the respective drug can be discovered [4]. The use of in silico methods in order to predict the interactions between drugs and target proteins provides a crucial leap  for drug repositioning, as it can  remarkably reduce  wet-laboratory  work and lower the cost of the experimental discovery of new drug-target interactions (DTIs) [5].


Similarity Searching technique which falls under ligand-based category can be classify as one of the well- established method since it was used by many researchers in predicting DTIs [6]. Driving the introduction of these new application is the desire to find patentable, more suitable, lead compounds as well as reducing the high failure rates of compounds in the drug discovery and development pipeline [7]. Based on Figure 1.0 below, new prediction of DTIs happens when this method allows another reference ligand (nearest neighbour) to be found whenever a single ligand (active query) with known biological activity is used for searching process [8]. This reference ligand which are discovered after it is being screened against large number of database compounds will then bind to the same target as the query compound did and it is assumed as a potential drug [8]. The rational of this screening method is that true binders/drugs would share similar functional groups and/or geometric shapes given provided interacting hot spots within the binding site of the respective protein [9]. Despite possessing the edge when it comes in identifying a new drug, however similarity searching does have several disadvantages as well. First, this method depends on the availability of known ligands, which may be not heuristics in the earlier stages of the drug discovery process. In other words, it need at least one ligand compound in order to initiate its process [8]. Second, the similarity searching method which is based on the ligand similarity will have difficulties in identifying drugs with novel scaffolds that are contradict with those query compounds [10]. Last limitation that we identified on this technique is that it does not determine the binding position of the ligand compound within the binding site and the correlation binding score between the ligand and the protein [11]. The binding mode within the binding site is crucial in exploring the responsive mechanism between the protein and the ligand and the accuracy of the identified drug lead. The binding energy score, which relies on the forecast of correct binding modes, do play an important role as well when optimizing drug leads.

Knowledge Discovery in Databases (KDD) can be defined as the use of methods from domains such as machine learning, pattern recognition, statistics, and other related fields as to deduce knowledge from huge collections of data, where the respective knowledge is absence from the database structure [12]. Very large amounts of data are also characteristic of the databases of pharmaceutical companies, which has led to the growing use of KDD methods within the drug discovery process. However, lately researchers have diverted their interest to some other methodologies/ideas which can clarify in depth about molecular activity [12]. It is believed that those methods will not improve the prediction accuracy, but it still can assist the medicinal chemists in terms of developing the next marketable drugs [12]. This situation prompted different related techniques from KDD field being introduced to chemoinformatics, with one of them known as Association Rule Mining (ARM) [12]. ARM is a type of classification method that share the same properties with machine learning methods but slightly different in their primary aim as it focused on explanation rather than classification [12]. They focused on the features or group of features which may decide a particular classification for a set of objects [12]. Promising performance of ARM in several instances of target prediction has made it favourable in the case of predicting DTIs.


The information contains within activities classes ranging from heterogenous and homogenous category from ChEMBL database is important as it can be used to build the classification model. In our experiment, using that information we generate appropriate rules that will determine protein targets for a particular ligand. Each rule generated  were  based on  the  support and  confidence level  associate  with  them.  Support indicates how frequently  the items  appear in  the database.  While, confidence specify  the number of times the if/then statements have been found to be true. From the support and confidence scores obtained earlier, we select the best rules for the target prediction and these rules will be used to predict protein target for future ligands. However, the biggest challenge of ARM is that it’s a compute intensive procedures at the frequent itemsets generation. Hence, it is crucial that the execution is done on a high performance machine. At the moment we are lacking in high computing resources and this limit us to fully explore the capability of in relation to our objectives. Nevertheless, we have obtained results based on certain parameter ranges that would be present on the poster later.


Figure 1.0: Conventional similarity searching method used to predict new ligand that will interact with a particular target [8].



[1] L. Yu, X. Ma, L. Zhang, J. Zhang and L. Gao, “Prediction of new drug indications based on clinical data and network modularity”, Scientific Reports, vol. 6, no. 1, 2016.

[2] T. Ashburn, B. K. Thor, Drug Repositioning: Identifying and Developing New Uses for Existing Drugs. Nat. Rev. Drug Discovery, vol. 3, pp. 673−683,2004.

[3]  J.C.  Nacher,   J.M.  Schwartz,  Modularity  in  Protein  Complex  and  Drug  Interactions  Reveals  New Polypharmacological Properties. PLoS One, vol. 7, e30028, 2012.

[4] J. Peters, “Polypharmacology – Foe or Friend?”, Journal of Medicinal Chemistry, vol. 56, no. 22, pp. 8955- 8971, 2013.

[5]   “computational   drug   discovery:   Topics   by”,,   2017.   [Online].   Available: [Accessed: 06- Sep- 2017].

[6] T. Katsila, G. Spyroulias, G. Patrinos and M. Matsoukas, “Computational approaches in target identification and drug discovery”, Computational and Structural Biotechnology Journal, vol. 14, pp. 177-184, 2016.

[7] J. Auer, J. Bajorath. In: Keith J, editor. Bioinformatics. Humana Press, pp. 327–47,2008.

[8] P. Willett, J.M. Barnard, G.M. Downs. Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, vol. 38, pp. 983 – 996. 1998.

[9] S. Huang, M. Li, J. Wang and Y. Pan, “HybridDock: A Hybrid Protein–Ligand Docking Protocol Integrating Protein- and Ligand-Based Approaches”, Journal of Chemical Information and Modeling, vol. 56, no. 6, pp. 1078- 1087, 2016.

[10] N. Wale, I. Watson and G. Karypis, “Indirect Similarity Based Methods for Effective Scaffold-Hopping in Chemical Compounds”, Journal of Chemical Information and Modeling, vol. 48, no. 4, pp. 730-741, 2008.

[11] D. Mobley and K. Dill, “Binding of Small-Molecule Ligands to Proteins: “What You See” Is Not Always “What You Get””, Structure, vol. 17, no. 4, pp. 489-498, 2009.

[12] E. Gardiner and V. Gillet, “Perspectives on Knowledge Discovery Algorithms Recently Introduced in Chemoinformatics: Rough Set Theory, Association Rule Mining, Emerging Patterns, and Formal Concept Analysis”, Journal of Chemical Information and Modeling, vol. 55, no. 9, pp. 1781-1803, 2015.



Nurul Hashimah Ahamed Hassain Malim (Nurul Malim) received her B.Sc (Hons) in computer science and M.Sc in computer science from Universiti Sains Malaysia, Malaysia. She completed her PhD in 2011 from The University of Sheffield, United Kingdom. Her current research interests include chemoinformatics, bioinformatics, data analytics, sentiment analysis and high-performance computing. She is currently a Senior Lecturer in the School of Computer Sciences, Universiti Sains Malaysia, Malaysia.

Recent Comments