Pawsey Supercomputing Centre – Engaging for the Future

Dr Neil Stringfellow1, Dr Daniel Grimwood1

1Pawsey Supercomputing Centre, Kensington, Australia

 

ABSTRACT

The Pawsey Supercomputing Centre continues to evolve and grow.  Recent developments at Pawsey include the new advanced technology testbed called Athena, as well as the expansion of the Zeus commodity Linux cluster.  The Athena testbed includes Intel Xeon Phi “Knight’s Landing” processors as well as Nvidia “Pascal” GPUs.

I will also touch on the longer term vision for Pawsey and the Federal infrastructure roadmap.

 


Biography:

Dr Neil Stringfellow is the Executive Director of the Pawsey Supercomputing Centre.

Neil has led the Pawsey Supercomputing Centre since 2013, overseeing the operational launch of of the Magnus Cray supercomputer.  Neil joined Pawsey from the Swiss National Supercomputing Centre (CSCS) where he was involved in application and science support, the management of strategic partnerships and Switzerland’s flagship supercomputing systems.

Hivebench helps life scientists unlock the full potential of their research data

Mrs Elena Zudilova-Seinstra1, Mr Julien Therier1

1Elsevier, Amsterdam, The Netherlands

 

Title Hivebench helps life scientists unlock the full potential of their research data
Synopsis By integrating Hivebench ELN with an institutional repository, or the free data repository Mendeley Data, you can maximize the potential of your research data (see diagram below) and secure its long-term archiving. Hivebench supports compliance with data mandates and the storage of research process details, making results more transparent, reproducible and easier to store and share.

Indeed storing information in private files or paper notebooks poses challenges, not only for individual life scientists, but for their lab as a whole. An Electronic Lab Notebook stores research data in a well-structured format for ease of reuse, and simplifies the process of sharing and preserving information. It also structures workflows and protocols to improve experiment reproducibility.

Format of demonstration Live Demonstration
Presenter(s) Elena Zudilova-Seinstra, PhD

Sr. Product Manager Research Data

Elsevier RDM, Research Products

Target research community Whatever your role in the lab – researcher, PI, lab manager.
Statement of Research Impact Hivebench’s comprehensive, consistent and structured data capture provides a simple and safe way to manage and preserve protocols and research data.
Request to schedule alongside particular conference session Optional – List relevant conference sessions if any
Any special requirements Access to Internet connection.

 


Biography:

I’m a Senior Product Manager for Research Data at Elsevier. In my current role I focus on delivering tools for sharing and reuse of research data. Since 2014 I have being responsible for the Elsevier’s Research Elements Program focusing on innovative article formats for publishing data, software and other elements of the research cycle. Before joining Elsevier, I worked at the University of Amsterdam, SARA Computing and Networking Services and Corning Inc.

Immersive Visualization Technologies for eResearch

Dr Jason Haga1, Dr David Barnes2, Dr Maxine Brown3, Dr Jason Leigh4

1Cyber-Physical Cloud Research Group, AIST, Tsukuba, Japan,

2Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia,

3Electronic Visualization Laboratory, Univ. of Illinois at Chicago, Chicago, USA,

4Univ. of Hawaii, Manoa, Honolulu, USA

 

Title Immersive Visualization Technologies for eResearch
Synopsis

One Paragraph

It is well known that data is accumulating at an unprecedented rate. These troves of big data are invaluable to all sectors of society, especially eResearch activities. However, the amount of data is posing significant challenges to data-intensive science. The visualization and analysis of data requires an interdisciplinary effort and next generation technologies, specifically interactive environments that can immerse the user in data and provide tools for data analytics. One type of immersive technology is virtual reality (VR) and the Unity development platform, which together are becoming a viable, innovative solution for a wide variety of applications. To highlight this concept, we showcase two prototype VR applications: 1) for river disaster management using over 17,000 different sensors deployed throughout Japan, and 2) a “multi-player” collaborative virtual environment for scientific data that works across the scale of displays from desktop through ultra-scale CAVE2-like systems for two datasets: a segmented Kookaburra anatomy and an archaeological dig in Laos. These applications explore how combinations of 2D and 3D representations of data can support and enhance eResearch efforts using these new VR platforms. This presentation can also generate interest in the live demonstrations at the PRAGMA33 booth at eResearch.
Format of demonstration

Video, Live Demonstration, Slide Show

Video and Slide Show with reference to PRAGMA booth live demonstrations
Presenter(s)

Name, Title, Institution

Jason H. Haga, Senior Researcher, Cyber-Physical Cloud Research Group, AIST, Japan

David G. Barnes, Associate Professor and Director, Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia

Maxine Brown, Director, Electronic Visualization Laboratory, Univ. of Illinois at Chicago

Jason Leigh, Professor, Univ. of Hawaii, Manoa

Target research community

One Sentence

Any research community looking for novel data visualization solutions.
Statement of Research Impact

One (short) Paragraph

Virtual reality and the Unity development platform are becoming a viable, innovative solution for eResearch. This Showcase presentation highlights two data visualization applications for which virtual reality is having a significant impact.
Request to schedule alongside particular conference session

Optional – List relevant conference sessions if any

Request to have our Showcase presentation early in the conference to provide sufficient time for people to visit the PRAGMA booth and experience live demos.
Any special requirements

Audio Visual Needs? Date/Time? Anything else…

 

 


 

Biography:

I am currently a senior researcher in the Cyber-Physical Cloud Research Group at the Information Technology Research Institute of AIST. My past research work involved the design and implementation of applications for grid computing environments and tiled display walls. I also work with cultural heritage institutions to deploy novel interactive exhibits to engage public learning. My research interests are in immersive visualization and analytic environments for large datasets. I have over 13 years of collaborative efforts with members of the PRAGMA community and continue to look for interdisciplinary collaboration opportunities.

orcid.org/0000-0002-6407-0003

Hacky Hour for eResearch Training, Engagement and Community Development

Dr Nick Hamilton1, Ms Belinda Weaver2

1University Of Queensland, St Lucia, Australia,

2Software and Data Carpentry , Brisbane, Australia

 

In the biosciences, as well as many other fields of research, there are often relatively low levels of mathematical and statistical expertise, as well as a lack of basic computing, eresearch and data skills. This is despite such skills becoming increasingly important across many fields to create cutting-edge research. Towards filling this knowledge gap, we have been experimenting with running “Hacky Hours” at The University of Queensland for the last 18 months. These are weekly events, held in an outdoor cafe, where researchers who would like assistance with their research IT can come along and ask questions or just work on whatever they are working on, in the company of other researchers who are into computing. A strong community of both helpers and researchers with questions has built up around this event, with many returning regularly. Often, a researcher with a question one week will come back and be a helper for another week.

During this period, we have been collecting data on all of our attendees and the types of problems they bring, as well as on our helpers and their interests. Typical questions include:

  • getting started with Python and R
  • software tools
  • how to access high performance computing
  • cloud data storage
  • tools for data cleaning and data visualisation.

The disciplines attendees belong to are very diverse and include: the biosciences, economics, psychology, humanities, languages, chemistry, mechanical engineering, nanotechnology, biomedical engineering, and ecology. There are also attendees from  the library. Interestingly, a significantly larger number of women than men come to ask questions at Hacky Hour, though the helpers are approximately gender-balanced.

The Hacky Hour model of training and engagement offers a number of benefits.

The friendly, non-judgmental and informal environment encourages a greater diversity of participants than are often associated with research IT. For the University organisations that allow or encourage their employees to participate as helpers, there is the benefit of presenting a friendly face to IT and eResearch facilities. As time is limited to an hour a week, helpers are more willing to donate their time without fear of problems blowing out and   their being stuck with working on the problem. Hacky Hour often works as a referral service: while the Hacky Hour helpers may not have a solution, they may well know a person or organisation who could help. Similarly, the Hacky Hour helpers build up a knowledge base of common problems and their solutions, as well as resources such as R cheat sheets, short training courses, or good web sites on how to get started with Python. The helpers also gain valuable skills in helping the problem owners understand and define their problems and thus how to develop a solution. Often the problem may not be what the problem owner thinks it is, or the solution may be completely different from what the problem owner thought they needed. While occasionally a helper will solve a problem directly, the Hacky Hour ethos is much more about helping the problem owner develop the skills to find a path to the solution themselves. For the helpers, the informal discussions about solving problems is a good way to share high level expertise with each other and keep up with the latest technical developments. More broadly, the helpers and problem owners have now become a community that can be drawn upon to help at or participate in other training or community events such as Software Carpentry bootcamps, HealthHack or Research Bazaar (ResBaz) events.

In this poster, we will outline our experiences with Hacky Hour, the strategies that we have taken to developand maintain a community of helpers and attract a diverse range of problem owners, as well as the outcomes and benefits we have seen.


 

Biographies

Dr Nick Hamilton is the Institute Bio-Mathematician at the Institute for Molecular Bioscience (IMB), The University of Queensland, and holds a co-appointment with the Research Computing Centre at UQ. He gained a PhD in Pure Mathematics from the University of Western Australia in 1996 and was subsequently awarded Fellowships in Australia and Belgium. In 2002, Nick made the decision to change fields into the exciting new areas of computational biology and bioinformatics, returned to Australia, and subsequently took up a position within the ARC Centre of Excellence in Bioinformatics at The University of Queensland. In 2008 he was appointed as a Laboratory Head at IMB, and Institute Bio-Mathematician in 2014, where he continues to lead a group in bio-image informatics, mathematical modelling and data visualisation, developing methodologies to deal with the current deluge of data that new microscopy imaging technologies have enabled. He also has interests and has participated in many training and engagement models such as Hacky Hour, HealthHack, ResBaz and Software Carpentry, and has Chaired the Winter School in Mathematics and Computational Biology for the last 6 years.
https://orcid.org/0000-0003-0331-3427

Belinda Weaver is the Community Development Lead for Software and Data Carpentry, global organisations that aim to make researchers more productive and their research more reliable by teaching them computational and data skills. She was formerly the eResearch Analyst Team Leader for the Queensland Cyber Infrastructure Foundation, where she helped deliver cloud storage and solutions to Australia researchers. She was a key organiser of the very successful Brisbane Research Bazaar events in 2016 and 2017 – cross-institutional, community-building events that taught a range of digital skills to postgraduate students and early career researchers.

She helped inaugurate the weekly Hacky Hour drop-in research IT advice sessions at The University of Queensland. She is a certified Software Carpentry instructor and instructor trainer and has taught at many Software Carpentry workshops. She organised the two very successful Library Carpentry sprints in 2016 and 2017 which updated and extended the basic lessons. The 2017 hackathon pulled in more than 100 people across 13 sites in seven countries, including the British Library and the National Library of the Netherlands. She will take a Library Carpentry roadshow to staff at the national and state libraries of Australasia during July and August 2017.

Belinda has formerly worked as a librarian, repository manager, project manager, newspaper columnist, Internet trainer and in research data management. She tweets as @cloudaus (https://twitter.com/cloudaus).

The value of an Integrated eResearch Service Catalogue: a La Trobe University case study

Dr Ghulam Murtaza1, Ms Sheila Mukerjee2

1Intersect Australia Ltd, Sydney, Australia, ghulam@intersect.org.au

2La Trobe University, Bundoora, Austalia, sheila.mukerjee@latrobe.edu.au

 

ABSTRACT

The provision of information on the full range of services available to researchers within a university is always challenging. IT groups, research offices and libraries within universities use websites or similar tools to communicate service offerings with varying degrees of effectiveness. Common problems include the large amount of information that needs to be communicated, the use of different terminology by different departments of the university, different methods to request services from different departments and where to find the most relevant and up to date information.

Universities, like other organisations, commonly adopt service desk approaches for their internal service delivery in areas such as IT, HR, facilities, etc. These involve the use of process-oriented tools and aim to achieve economies of scale within an organisation. Increasingly, universities are sourcing some services from external providers and the mix of internal/external service delivery is therefore changing over time. The key objective is to provide information to end users (researchers) that is comprehensive (incorporating both internal and external service delivery), user-friendly, and enables easy access. The researcher should not need to know about the “back office” arrangements used to provide the services.

A number of Intersect member universities are now attempting to improve the quality and effectiveness of eResearh support information and access they provide to researchers through more integrated approaches. In particular, the Intersect eResearch Analysts in several member organisations are working with the local Research Office, IT and Library groups to create an “integrated service catalogue” for researchers. This involves using the local internal communications channel (eg. intranet) to inform researchers and give them an opportunity to request the full range of eResearch support services, regardless of where the service elements are sourced. The consolidated list of services include:

  • Training relevant to researchers, consultation and advice for researchers, grant assistance, research IT planning, research software development and research data management services
  • HPC, storage and compute services offered by all providers

This presentation shares the experience of collaboratively developing an integrated eResearch services catalogue for La Trobe University and Intersect services. The presentation will cover elements such as the architecture of integrated services, how external services were embedded within the university, and the behind the scenes IT led triaging process and service delivery model for eResearch services. We present analysis of the metrics built around this integrated service catalogue to provide insights into the response of the research community and opportunities for new and improved services.


 

Biographies:

Dr Ghulam Murtaza is currently Intersect Digital Research Analyst for La Trobe University. During his time at Intersect, Ghulam has worked with Australian Catholic University and La Trobe University where he has lead multiple eResearch initiatives including the efforts to imbed Intersect services within local eResearch offerings. Ghulam is a published researcher and has previously held research and academic positions at many different reputable universities including UNSW, MAARCS institute of WSU, NEWT and Microsoft Research. Ghulam holds a Bachelor of Science (Honours) and Masters of Science in Computer Science from LUMS, Pakistan. He further completed his PhD in Computer Science from University of New South Wales (UNSW).

Sheila Mukerjee is Manager of Business Engagement for ICT at La Trobe University with a portfolio covering Research and Library. Her role of strategic partner and advisor covers strategy and business plans, future direction, major capital projects, business improvements and the sourcing and building of specialist technology capability for researchers. She has a keen interest in the way universities operate and strategise in the changing landscape of education with particular emphasis on technology and agility. She has published in the areas of data warehousing, student systems and agility in the education sector.

Prediction of Drug Target Interaction Using Association Rule Mining (ARM)

Dr Nurul Hashimah  Ahamed Hassain Malim1, Mr Muhammad Jaziem  Mohamed Javed1

1Universiti Sains Malaysia, USMPenang, Malaysia, nurulhashimah@usm.my

 

INTRODUCTION

Drug repositioning helps to identify new drug indications (i.e. new known disease) for known drugs [1]. It is an innovation stream of pharmaceutical development that offers an edge for both drug developers as well as for patients since the medicines is safe to use. This method is believed as a successful alternative method in the drug discovery process due to several drugs in the past have been successfully repositioned to a new indication, with the most prominent of them being Viagra and Thalidomide, which in turn has brought a higher revenue [2]. The main reason that made drug repositioning possible is the accepted concept of ‘polypharmacology’ [3]. In general, polypharmacology transformed the idea of drug development from “one drug one target” to “one drug multiple target” [4]. Involvement of polypharmacological in the drug discovery area can be seen when (a) single drug acting on multiple targets of a unique disease pathway, or (b) single drug acting on multiple targets in regards to multiple disease pathways and the polypharmacological property within a drug helps us to identify more than one target that it can act on and hence new uses of the respective drug can be discovered [4]. The use of in silico methods in order to predict the interactions between drugs and target proteins provides a crucial leap  for drug repositioning, as it can  remarkably reduce  wet-laboratory  work and lower the cost of the experimental discovery of new drug-target interactions (DTIs) [5].

IN SILICO APPROACHES USED FOR DRUG REPOSITIONING

Similarity Searching technique which falls under ligand-based category can be classify as one of the well- established method since it was used by many researchers in predicting DTIs [6]. Driving the introduction of these new application is the desire to find patentable, more suitable, lead compounds as well as reducing the high failure rates of compounds in the drug discovery and development pipeline [7]. Based on Figure 1.0 below, new prediction of DTIs happens when this method allows another reference ligand (nearest neighbour) to be found whenever a single ligand (active query) with known biological activity is used for searching process [8]. This reference ligand which are discovered after it is being screened against large number of database compounds will then bind to the same target as the query compound did and it is assumed as a potential drug [8]. The rational of this screening method is that true binders/drugs would share similar functional groups and/or geometric shapes given provided interacting hot spots within the binding site of the respective protein [9]. Despite possessing the edge when it comes in identifying a new drug, however similarity searching does have several disadvantages as well. First, this method depends on the availability of known ligands, which may be not heuristics in the earlier stages of the drug discovery process. In other words, it need at least one ligand compound in order to initiate its process [8]. Second, the similarity searching method which is based on the ligand similarity will have difficulties in identifying drugs with novel scaffolds that are contradict with those query compounds [10]. Last limitation that we identified on this technique is that it does not determine the binding position of the ligand compound within the binding site and the correlation binding score between the ligand and the protein [11]. The binding mode within the binding site is crucial in exploring the responsive mechanism between the protein and the ligand and the accuracy of the identified drug lead. The binding energy score, which relies on the forecast of correct binding modes, do play an important role as well when optimizing drug leads.

Knowledge Discovery in Databases (KDD) can be defined as the use of methods from domains such as machine learning, pattern recognition, statistics, and other related fields as to deduce knowledge from huge collections of data, where the respective knowledge is absence from the database structure [12]. Very large amounts of data are also characteristic of the databases of pharmaceutical companies, which has led to the growing use of KDD methods within the drug discovery process. However, lately researchers have diverted their interest to some other methodologies/ideas which can clarify in depth about molecular activity [12]. It is believed that those methods will not improve the prediction accuracy, but it still can assist the medicinal chemists in terms of developing the next marketable drugs [12]. This situation prompted different related techniques from KDD field being introduced to chemoinformatics, with one of them known as Association Rule Mining (ARM) [12]. ARM is a type of classification method that share the same properties with machine learning methods but slightly different in their primary aim as it focused on explanation rather than classification [12]. They focused on the features or group of features which may decide a particular classification for a set of objects [12]. Promising performance of ARM in several instances of target prediction has made it favourable in the case of predicting DTIs.

METHODOLOGIES

The information contains within activities classes ranging from heterogenous and homogenous category from ChEMBL database is important as it can be used to build the classification model. In our experiment, using that information we generate appropriate rules that will determine protein targets for a particular ligand. Each rule generated  were  based on  the  support and  confidence level  associate  with  them.  Support indicates how frequently  the items  appear in  the database.  While, confidence specify  the number of times the if/then statements have been found to be true. From the support and confidence scores obtained earlier, we select the best rules for the target prediction and these rules will be used to predict protein target for future ligands. However, the biggest challenge of ARM is that it’s a compute intensive procedures at the frequent itemsets generation. Hence, it is crucial that the execution is done on a high performance machine. At the moment we are lacking in high computing resources and this limit us to fully explore the capability of in relation to our objectives. Nevertheless, we have obtained results based on certain parameter ranges that would be present on the poster later.

FIGURES

Figure 1.0: Conventional similarity searching method used to predict new ligand that will interact with a particular target [8].

 

REFERENCES

[1] L. Yu, X. Ma, L. Zhang, J. Zhang and L. Gao, “Prediction of new drug indications based on clinical data and network modularity”, Scientific Reports, vol. 6, no. 1, 2016.

[2] T. Ashburn, B. K. Thor, Drug Repositioning: Identifying and Developing New Uses for Existing Drugs. Nat. Rev. Drug Discovery, vol. 3, pp. 673−683,2004.

[3]  J.C.  Nacher,   J.M.  Schwartz,  Modularity  in  Protein  Complex  and  Drug  Interactions  Reveals  New Polypharmacological Properties. PLoS One, vol. 7, e30028, 2012.

[4] J. Peters, “Polypharmacology – Foe or Friend?”, Journal of Medicinal Chemistry, vol. 56, no. 22, pp. 8955- 8971, 2013.

[5]   “computational   drug   discovery:   Topics   by   Science.gov”, Science.gov,   2017.   [Online].   Available: https://www.science.gov/topicpages/c/computational+drug+discovery.html. [Accessed: 06- Sep- 2017].

[6] T. Katsila, G. Spyroulias, G. Patrinos and M. Matsoukas, “Computational approaches in target identification and drug discovery”, Computational and Structural Biotechnology Journal, vol. 14, pp. 177-184, 2016.

[7] J. Auer, J. Bajorath. In: Keith J, editor. Bioinformatics. Humana Press, pp. 327–47,2008.

[8] P. Willett, J.M. Barnard, G.M. Downs. Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, vol. 38, pp. 983 – 996. 1998.

[9] S. Huang, M. Li, J. Wang and Y. Pan, “HybridDock: A Hybrid Protein–Ligand Docking Protocol Integrating Protein- and Ligand-Based Approaches”, Journal of Chemical Information and Modeling, vol. 56, no. 6, pp. 1078- 1087, 2016.

[10] N. Wale, I. Watson and G. Karypis, “Indirect Similarity Based Methods for Effective Scaffold-Hopping in Chemical Compounds”, Journal of Chemical Information and Modeling, vol. 48, no. 4, pp. 730-741, 2008.

[11] D. Mobley and K. Dill, “Binding of Small-Molecule Ligands to Proteins: “What You See” Is Not Always “What You Get””, Structure, vol. 17, no. 4, pp. 489-498, 2009.

[12] E. Gardiner and V. Gillet, “Perspectives on Knowledge Discovery Algorithms Recently Introduced in Chemoinformatics: Rough Set Theory, Association Rule Mining, Emerging Patterns, and Formal Concept Analysis”, Journal of Chemical Information and Modeling, vol. 55, no. 9, pp. 1781-1803, 2015.

 


Biography:

Nurul Hashimah Ahamed Hassain Malim (Nurul Malim) received her B.Sc (Hons) in computer science and M.Sc in computer science from Universiti Sains Malaysia, Malaysia. She completed her PhD in 2011 from The University of Sheffield, United Kingdom. Her current research interests include chemoinformatics, bioinformatics, data analytics, sentiment analysis and high-performance computing. She is currently a Senior Lecturer in the School of Computer Sciences, Universiti Sains Malaysia, Malaysia.

Victorian Marine Data Portal (VMDP) – Leveraging the IMOS AODN Portal work

Dr Christopher McAvaney1, Dr Alex Rattray2, Ms Michelle Watson3

1 Deakin University, Waurn Ponds, Australia, christopher.mcavaney@deakin.edu.au
2 Deakin University, Warrnambool, Australia, alex.r@deakin.edu.au
3 Deakin University, Geelong, Australia, michelle.watson@deakin.edu.au

 

DESCRIPTION

Supported by the High Value Collection (HVC) program of Australian National Data Service (ANDS), Deakin University has collaborated with the University of Tasmania (via the Institute of Marine and Antarctic Studies – IMAS) and the Integrated Marine Observing System (IMOS) NCRIS capability to implement an instance of the Australasian Ocean Data Network (AODN) portal.

The newly launched Victorian Data Marine Portal (VMDP) provides access to marine data collected by Deakin researchers, and brings together data collected by various research organisations including DELWP, Parks Victoria, and the CSIRO. All the data is openly accessible, supporting the search and discovery of Victorian marine spatial data by researchers and governments, as well as community groups and the general public. The portal complements regional, national and global knowledge databases including Seamap Australia.

The poster will provide an overview of the project, highlighting benefits and value including:

  • The collection, collation and preservation of Victorian marine habitat data from various agencies
  • The provision of access to important research data via a single portal
  • The implementation of a classification scheme to describe the data in a uniform way, and to facilitate discovery
  • Support for ongoing research in this area by simplifying the discovery process and encouraging serendipitous discovery of research data
  • Recognising and re-using the work undertaken by IMOS in support of the AODN Portal software stack (Java Tomcat, GeoNetwork and GeoServer on a PostgreSQL database with GIS extensions)
  • Ability to use ArcMap/ArcGIS or QGIS to interact with the portal

Built to support the ongoing ingestion of new marine datasets, the poster will include a detailed view of the VMDP marine research data lifecycle, from the collection of data via instruments through to the ingestion into the portal.

The poster will also touch on future work to support marine research expected to be undertaken over the coming months, including:

  • Enhancing collecting information through digitisation of analogue video
  • Collaboration with UTAS/IMAS for Seamap Connections (aggregation of aggregations)

 

Biography:

Christopher McAvaney is Services Manager, eResearch at Deakin University, responsible for establishing and implementation an eResearch program of work from eSolutions at Deakin University. A key deliverable of Christopher’s work is articulating a range of research services within the ICT service catalogue of the university. Christopher’s role involves working with eSolutions (central ICT), Deakin University Library and Deakin Research (the research office) to ensure that a coherent and consistent approach is followed. An important aspect of his role is collaborating with external partners at the state and national level to build local and global collaboration opportunities. Christopher’s research background is in parallel and distributed systems, in particular applied research around an automated parallelisation tool.

http://orcid.org/0000-0002-8130-0309

Cybercriminal Personality Detection through Machine Learning

Dr Nurul Hashimah  Ahamed Hassain Malim1Saravanan Sagadevan1, Muhd Baqir Hakim1, Nurul Izzati Ridzuwan1

1 Universiti Sains Malaysia, Penang, Malaysia, nurulhashimah@usm.my

 

ABSTRACT

The development of sophisticated forms of communication technologies such as social networks has exponentially raised the number of users that participate in online activities. Although the development encourages and brings many  positive  social  benefits,  the  dark  sides  of  online  communication are  still  a  major  concern  in  virtual interactions. The dark side of online activities occasionally referred as cyber threats or cyber crimes. Over the past two decades, cybercrime cases have increased exponentially and threatened the privacy and life of online users. Occasionally, severe kinds of cyber criminal activities such as cyber bullying and cyber harassment executed through exploiting text messages and the anonymity offered by social network platforms such as Facebook and Twitter. However, the linguistic clues such as patterns of writing and expression in the text messages often act as fingerprints in revealing the personality traits of the culprits who hide behind the anonymity provided by social networks [1]. Personality traits are hidden abstraction that combined emotion, behavior, motivation and thinking patterns of human that often mirror the true characteristics of them through their activities that conducted intentionally or unintentionally [2]. In nature, each individual are differed in terms of their talking and writing styles or patterns and it is hard to observe those differences. The distinct styles of talking and writing are unique from person to person and there are tendencies to decipher the identity of writers by simply observed at the pattern of the writing especially the formation of words, phrases and clauses. Sir Francis Galton was identified as the first person that hypothesized natural language terms might present the personality differences in humankind [3]. Furthermore, Hofstee suggested that nouns, sentences, and actions might have some kind of connotations towards personality [4].  In  the other hand, since several decades ago, the people from forensic psychology, behavioral sciences and the law enforcement agencies have been working together to study and integrate the science  of  psychology into  criminal  profiling  [5].  Through  the  review  of  literature related  with  psychology, linguistics  and  behavior,  it  can  be  affirmed  that  strong  relationship  presented  between  personality  traits especially related with criminals and writing/language skills. Therefore, curiosity raised on whether the writing pattern in social networks by cyber criminals could be identify or detected by using automatic classifiers. If yes, how better will be the performances of the classifiers and what are the words or combination of words that may frequently used by cyber predators. Therefore, in order to find answers to those questions, we conducted an empirical investigation [9] (main study) with two other small scales studies [10,11] (extend the main study) by using the textual sources from Facebook and Twitter and exploiting the descriptions stated in Three Factor Personality Model, and sentiment valences. For the main study, the open source data Facebook [6] and Twitter [7] were used as text input while the data for other two small scale studies were harvested from Twitter using Tweepy, a Python library for accessing the Twitter API. The main study and the second study used data that only written in English language while third study used tweets in Malay Language (Bahasa Malaysia).  In these studies, we employed four main classifiers namely Sequential Minimal Optimization (SMO), Naive Bayes (NB), K- Nearest Neighbor (KNN) and J48 with ZeroR as baseline from Waikato Environment for Knowledge Analysis (WEKA) Machine Learning Tool. The reason to  used the traits  from Three Factor Model in  this  study is  due  to  the widespread use of the model in criminology, less number of traits ease the characteristics categorization process and large number of empirical proved that associated Psychoticism trait with criminal characteristics whereas sentiment valences was used to measure the polarity of sentiment terms.  The major traits of Three Factor Model and its associated characteristics listed in Table 1.

Table 1: Three Factor Model Traits and its characteristics [8].

Traits Specific Characteristics
Extraversion Sociable, lively, active, assertive, sensation seeking, carefree, and dominant.
Neuroticism Anxious, depressed, guilt feelings, low self-esteem, tense, irrational, and moody.
Psychoticism Aggressive, egocentric, impersonal, impulsive, antisocial, creative and tough-minded.

The three studies used similar research framework as following. Step 1 : Data Collection & Preprocessing (Data Cleansing, Stemming, Part-Of-Speech Tagging), Step 2 : Data Annotations, Step 3 : Automatic Classification by the four Classifiers, Step 4 : Performance analysis, criminal related terms identification (using Chi-Square method).  The following tables illustrated the performances of machine learning classifiers of the studies and the list of the terms that identified to be associated to criminal behavior. The class balancing method called Synthetic Minority Over-sampling Technique (SMOTE) was used to overcome the unbalance volume of class instances.

Table 2 : Accuracy of classifiers based on with/without SMOTE class balancing methods[10].

Performance measurement based on True Positive (TP)and False Positive (FP)
Type/Classifier ZeroR NB KNN SMO J48
TP FP TP FP TP FP TP FP TP FP
Without SMOTE 47.2

7

52.73 58.18 41.82 47.27 52.73 72.73 27.27 78.18 21.82
With SMOTE 40.6

3

59.38 68.75 31.25 53.13 46.88 73.44 26.56 75.00 25

 

Table 3 : Accuracy of classifiers based on measuring the effect of class measuring[11].

Performance measurement based on True Positive (TP)and False Positive (FP)
Cross

Validation/Classifier

ZeroR NB KNN SMO J48
TP FP TP FP TP FP TP FP TP FP
3 53.3 46.7 80.0 20.0 63.3 36.7 73.3 26.7 50.0 50.0
5 53.3 46.7 90.0 10.0 56.7 43.3 70.0 30.0 63.3 36.7
10 53.3 46.7 90.0 10.0 56.7 43.3 86.3 16.7 70.0 30.0

 

Table 4 : Terms that highly associated with criminal behavior [9].

Facebook Twitter
Unigram Bigram Trigram Unigram Bigram Trigram
Damn The hell I want to Suck Damn It A big ass
Shit Damn it Damn it I Adore The hell A bit more
Fuck Hell i Is a bitch Annoy A bitch A bitch and
Hell My Fuck What the Fuck Asshole A damn A damn good
Ass The shit What the hell Shit A fuck All fuck up
Suck Damn you I feel like Fuck A hell A great fuck
Bad The fuck The hell I Hell Damn you A great night
Feel A bitch Cute A shit A pain in
Hate Fuck yeah Damn My ass A fuck off

 

As conclusion, our investigation showed that J48 performed better than other classifiers with and without applied the SMOTE class balancing technique and the effect of cross validation vary for each classifiers. However, in overall view, Naïve Bayes performed better on each cross validation experiments. This investigation also produced a list of the words that may used by cyber criminals based on language models specification. Then, for future study, we planned to used deep learning methods to analyze the contents related with cyber terrorism and welcome any collaboration for social networks cyber terrorism textual data collaboration.

REFERENCES

  1. Olivia Goldhill.. Digital detectives: solving crimes through Twitter, 2013. The Telegraph.
  2. Navonil  Majumder,  Soujanya  Poria,  Alexander  Gelbukh,  and  Erik  Cambria.  2017.  Deep  learning  based document modeling for personality detection from text. IEEE Intelligent Systems 32(2):74–79.
  3. Sapir, Edward. Language: An Introduction to the Study of Speech. New York: Harcourt, Brace, 1921.
  4. Matthews, G, Ian, J. D., & Martha, C. W. Personality Traits (2nd edition). Cambridge University Press, 2003.
  5. Gierowski, J. K. Podstawowa problematyka psychologiczna w procesie karnym. Psychologia w postępowaniu karnym, Lexis Nexis, Warszawa 2010.
  6. Celli, F., Pianesi, F., Stillwell, D., & Kosinski, M. Workshop on Computational Personality Recognition (Shared Task). In Proceedings of WCPR13, in conjunction with ICWSM-2013.
  7. Alec, G., Richa, B, & Lei, H.. Twitter Sentiment Classification using Distant Supervision, 2009.
  8. Coleta, V. D, Jan, M. A., Janssens, M., & Eric E. J. PEN, Big Five, juvenile delinquency and criminal recidivism. Personality and Individual Differences, 39, (2005) 7–19. DOI:10.1016/j.paid.2004.06.016.
  9. Saravanan Sagadevan. Thesis : Comparison Of Machine Learning Algorithms For Personality Detection In Online Social Networking, 2017.
  10. Muhd, Baqir Hakim. Profiling Online Social Network (OSN) User Using PEN Model and Dark Triad Based on English Text Using Machine Learning Algorithm, 2017 (In Review).
  11. Nurul Izzati Binti Ridzuwan. Online Social Network User-Level Personality Profiling Using Pen Model Based On Malay Text (In Review), 2017.

 


Biography:

Nurul Hashimah Ahamed Hassain Malim (Nurul Malim) received her B.Sc (Hons) in computer science and M.Sc in computer science from Universiti Sains Malaysia, Malaysia. She completed her PhD in 2011 from The University of Sheffield, United Kingdom. Her current research interests include chemoinformatics, bioinformatics, data analytics, sentiment analysis and high-performance computing. She is currently a Senior Lecturer in the School of Computer Sciences, Universiti Sains Malaysia, Malaysia.

How to Improve Fault Tolerance on HPC Systems?

Dr Ahmed Shamsul Arefin1

1Scientific Computing, Information Management & Technology, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Canberra, Australia

INTRODUCTION

HPC systems of today are complex systems made from hardware and software that were not necessarily designed to work together as one complete system. Therefore, in addition to regular maintenance related downtimes, hardware errors such as voltage fluctuation, temperature variation, electric breakdown, manufacturing defects as well as software malfunctions are common in supercomputers. There are several ways to achieve some fault tolerance which includes (but not limited to): checkpoint/ restart, programming model-based fault tolerance and algorithm-based theoretical solutions, but in real-life none of these cover all the situations mentioned above [1]. In this work, we have reviewed a few of the checkpoint/restart methods and we listed our experience with them including two practical solutions to the problem.

TEST DESCRIPTION

On a small cluster consisting only two nodes (master and compute) constructed using Bright Cluster Manager (BCM) ver. 7.3, we deployed a couple of existing methods (see below) that can support operating systems and software level fault tolerances. Our experience described below. In our model cluster, each node had 16 CPU cores in 2 x Intel Xeon CPU E5-2650 0 @ 2.00GHz, min 64GB RAM, 500GB local HDD, and Ethernet/ Infiniband connections for networking purposes. Slurm ver. 16.05 was installed as a part of BCM’s provisioning on the both nodes along with SLES 12Sp1 as the operating system. The master was set to act as both head and login nodes as well as the job scheduler node.

OUTCOMES

Our first test candidate was the BLCR (Berkeley Lab Checkpoint/Restart). It can allow programs running on Linux to be checkpointed i.e., written entirely to a file and then later restarted. BLCR can be useful for jobs killed unexpectedly due to power outages or exceeding run limits. It neither requires instrumentation or modification to user source code nor recompilation. We were able to install it on our test cluster and successfully checkpointing a basic Slurm job consisting only one statement: sleep 99. However, we faced the following difficulties during its pre and post-installation:

  • The BLCR website [2] lists an obsolete version of the software (ver. 0.8.5, last updated on 29 Jan 2013) which does not work with newer kernels such as what we get on the SLES12SP1. We had to collect a newer version along with a software patch from the BCLR team to get it installed on the test cluster. Slurm had to be recompiled and reconfigured several times following the BLCR installation, requiring ticket interactions with the Slurm, BCM and BLCR teams. Unfortunately, at the end the latest BLCR installation process did not work on the SLES12SP2 due to newer kernel version again (>4.4.21).
  • MPI incompatibility: The BLCR when installed along with Slurm can only work with MVAPICH2 and doesn’t support Infiniband network [2]. This MPI variant was not available in our HPC apps directory, therefore could not be tested.
  • Not available to interactive jobs: The BLCR + Slurm combination could not checkpoint/restart interactive jobs properly.
  • Not available to GPU jobs: The BLCR did not support GPU or Xeon Phi jobs.
  • Not available to licensed software: Talking to a license server, after a checkpoint/ restart did not work.

Our next candidate was the DMTCP (Distributed Multi Threaded Checkpointing). This application level tool claims to transparently checkpoint a single-host or distributed MPI computations in user-space (application level) – with no modifications to user code or to the O/S. As of today, Slurm cannot be integrated with DMTCP [2,4]. It was also noted that after a checkpoint/restart computation results can become inconsistent when compared against the non-checkpoint output, validated by a different CSIRO-IMT (User Services) team member.

Our next candidate was the CRIU (Checkpoint/Restore In Userspace). It claims to freeze a running application or at least a part of it and checkpoint to persistent storage as a collection of files. It works in user space (application level), rathe r than in the kernel. However, it provides no Slurm, MPI and Infiniband supports as of today [2,5]. We therefore have not attempted it on the test cluster.

PRACTICAL SOLUTIONS

After reviewing some of the publicly available tools that could potentially support the HPC fault tolerance, we decided propose the following practical solutions to the CSIRO HPC users.

CHECKPOINT AT THE APPLICATION LEVEL

Application level checkpointing: In this method, user’s application (based on [1]) will be set to explicitly read and write the checkpoints. Only the data needed for recovery was written down and checkpoints need taken at “good” times. However, this can result in a higher time overhead, such as several minutes to perform a single checkpoint on larger jobs (increase in program execution time) as well as user-level development time. Therefore, users may need further supports when applying this solution.

NO-CHECKPOINT, KILL-REQUEUE THE JOB

Slurm supports job preemption, by the act of stopping one or more “low-priority” jobs to let a “high-priority” job run. When a high-priority job has been allocated resources that have already been allocated to one or more low priority jobs, the low priority job(s) are pre-empted (suspended). The low priority job(s) can resume once the high priority job completes. Alternately, the low priority job(s) are killed, requeued and restarted using other resources. To validate this idea,  we’ve  deployed a  new  “killable job”  partition implementing a  “kill  and  requeue” policy.  This  solution was successfully tested by killing low priority jobs and can be further tuned to auto requeue lost jobs following a disaster, resulting an improved fault tolerance on our HPC systems.

CONCLUSION AND FUTURE WORKS

Using a test cluster, we investigated and evaluated some of the existing methods to provide a better fault tolerance on our HPC systems. Our observations suggests that the program level checkpointing would be the best way to safeguard a running code/ job, but at the cost of development times. However, we have not yet attempted a few other methods that could potentially provide an alternative solution, such as: SCR- Scalable Checkpoint/restart for MPI, FTI – Fault Tolerance Interface for hybrid systems Docker, Charm++ and so on, which remains as our future work.

REFERENCES

[1] DeBardeleben et al., SC 15 Tutorial: Practical Fault Tolerance on Today’s HPC Systems SC 2015

[2] Hargrove, P. H., & Duell, J. C. (2006). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.

[3] Rodríguez-Pascual, M. et al. Checkpoint/restart in Slurm: current status and new developments, SLUG 2016

[4] Ansel, J., Arya, K., & Cooperman, G. (2009). DMTCP: Transparent checkpointing for cluster computations and the desktop. In IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.

[5] Emelyanov, P. CRIU: Checkpoint/Restore In Userspace, July 2011.


Biography:

Dr Ahmed Arefin works as an IT Advisor for HPC Systems at the Scientific Computing, IM&T, CSIRO. In the past, he was as a Specialist Engineer for the HPC systems at the University of Southern Queensland. He has done his PhD and Postdoc in the area of HPC & parallel data mining from the University of Newcastle and  published articles in PLOS ONE and Springer journals and IEEE sponsored conference proceedings. His primary research interest focuses on the application of high performance computing in data mining, graphs/networks and visualization.

https://orcid.org/0000-0002-9290-3551

Stemformatics Live Demonstration

Mr Rowland Mosbergen1

1University of Melbourne, Parkville, Australia

 

Title Stemformatics Live Demo
Synopsis Stemformatics is primarily a web-based pocket dictionary for stem cells biologists running on the NeCTAR cloud. Part of the stem cell community for over 6 years, it allows biologists to quickly and easily visualise their private datasets. They can also benchmark their datasets against 330+ high quality, preprocessed public datasets.
Format of demonstration Live Demonstration
Presenter(s) Rowland Mosbergen, Stemformatics Project manager, University of Melbourne
Target research community Biologists and individual bioinformaticians who want their
users to look at their data interactively
Statement of Research Impact Stemformatics allows biologists to benchmark their dataset against 350+ public, manually curated and high quality datasets that include stem cells, leukaemia and infection and immunity samples. This has contributed to the recent influx of biologists wanting to identify potential cells of origin from a particular tissue for expression of some of the genes that they are interested in.
Request to schedule alongside particular conference session I’m giving a talk on Thursday afternoon
Any special requirements Monitor to display Stemformatics

 


Biography:

Rowland Mosbergen is the Project Manager and Lead Developer for the Stemformatics.org collaboration resource. Rowland has 17 years experience in IT while working in research, corporate financial software and small business. He graduated QUT in 1997 with a Bachelor of Engineering in Aerospace Avionics, then worked for GBST, a software company servicing the financial industry, where he worked with National Australia Bank and Merrill Lynch in their Margin Lending products for over 4 years. Rowland owned and ran a computer support business for over 5 years, then worked as a web developer for 2 years before joining the Wells laboratory as part of the Stemformatics team in 2010.

Rowland’s experience in the commercial and private sectors gives him a solid understanding of customer-requirements when designing and implementing web resources. He has implemented scalable design solutions for database querying and data visualisation that services a growing research community. He is a key member of a diverse academic team that is product-focused with an emphasis on quality, responsiveness and customer usefulness. He prides himself on developing web environments that are fast, useful and intuitive.

12346

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd