‘Cursed Forest’ – a random forest implementation for big, extremely highly dimensional data

Piotr Szul1,Aidan O’Brien2, Robert Dunne3, Denis C. Bauer

1Data61, CSIRO, Brisbane, QLD, Australia, piotr.szul@data61.csiro.au

2Health & Biosecurity, CSIRO, Sydney, NSW, Australia

3Data61, CSIRO, Sydney, NSW, Australia

 

INTRODUCTION

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes  to be studied at population-level  rather then for small number of individuals.  This provides  new power to whole genome  association studies (WGAS), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.

As WGAS involve studying  thousands  of genomes,  they pose both technological  and methodological  challenges.  The volume of data is significant, for example the dataset from 1000 Genomes project [6] with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous  and greatly exceeds  the  number of samples,  which  makes  it challenging  to apply  traditional  statistical  approaches  especially  if potential interaction between variants need to be considered.

Random forest [1] is one of the methods that was found to be useful in this context [3], both because its propensity for parallelization  and  robustness  and  the  inherent  ability  to  model interaction  [2].  The  variable  importance  measure extracted from random forest models can be used to identify variants associated with traits of interests.

There is a number of random forest implementation  available for single machine computing with interfaces both for R and Python. Some of them, such as Ranger [4], are specifically designed to process highly dimensional WGAS-like data sets and boast significant performance improvements over the more generic implementations.

These implementations  however are limited by the size of a single machine memory, and for larger datasets distributed solutions,  in  which  data  can  be  partitioned  among  multiple  machine are  needed.  This  approach  underpins  the computational model of many leading big data technologies including Apache Hadoop and more recently Apache Spark [7] –  a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Notably Spark machine learning library (Spark MLlib [8]) comes with a random forest implementation capable of dealing with huge datasets. This implementation however is tuned to work on typical dataset with large number of samples and relatively small number of variable and either fails or is inefficient for highly dimensional data [5].

To address these problems, we have developed the CursedForest – a Spark based, distributed implementation of random forest optimized for large, highly dimensional data sets. We have successfully CursedForest applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying CursedForest, released as part of the VariantSpark [9] toolkit, to a number of WGAS studies.

CURSED FOREST

Typically, random forest implementations operate on the traditional matrix-like data representation with samples in rows and features in columns. In distribute scenario that leads to ‘by row’ partitioning,  which for building decision trees on highly dimensional  data has been proven to be significantly  less efficient that the alternative  ‘by column’ layout and partitioning [5].

CursedForests uses the basic principle of ‘by column’ partitioning but extends it to ensembles of trees and introduces a number of optimizations aimed at efficient processing of extremely highly dimensional genomics dataset which include memory efficient representation of genomics data, optimized communication patterns and computation batching.

RESULTS

We have compared the performance  of CursedForset  on genomics datasets against other leading implementation  and tested  its  ability  to  scale  on  very  large  datasets.  The  results demonstrate  not  only  that  CursedForset  is the  single implementation  able to process  the WGAS-size  datasets  but also that it significantly  outperforms  other methods  on smaller datasets.

Figure 1: Peformance of CursedForest compared to MLlib and Ranger (left); Scaling of CursedForest to WGAS-sized datasets (right).

We have also demonstrated that CursedForest can accurately predict ethnicity from whole genome sequencing profiles of 2,500 individuals being the only implementation scaling to the full dataset. We have trained CursedForest on the 1000 genomes dataset, which consists of 2,504 samples with 81,047,467 features each to predict the ethnicity from genomic profiles. CursedForest achieves an out of bag error of OOB=0.01 and completes in 36 min 54 seconds, demonstrating its capability to run on population-scale cohorts of real world applications.

CONCLUSIONS

We  we  have  developed  the  `CursedForest`  –  a  Spark  based  random  forest  implementation   optimized  for  highly dimensional  data  sets  –  and  demonstrated  that  it  outperforms  other tools  on  genomics  datasets  both  in  terms computation time and data-set size limits and is capable of running on population-scale cohorts of real world applications.

REFERENCES

  1. 1. Breiman, Random Forests. Machine Learning. 2001 October;45(1):5–32.
  2. 2. Lunetta et  a,  Screening  large-scale  association  study  data:  exploiting  interactions  using  random  forests.  BMC Genetics 2004 5:32
  3. 3. Díaz-Uriarte R, Alvaréz de Andres S., Gene selection and classification of microarray data using random fore BMC Bioinformatics. 2006;7:3
  4. 4. Wright, Marvin N., and Andreas Z, “ranger: A fast implementation of random forests for high dimensional data

in C++ and R.” arXiv preprint arXiv:1508.04409 (2015)

  1. 5. Firas Abuzaid  et a,  Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale.  Annual Conference

on Neural Information Processing Systems 2016, December 5-10, 2016,  Barcelona, Spain

  1. 6. 1000  Genomes  Project  Consortium.  An  integrated  map  of  genetic  variation  from  1,092  human  genom  Nature.

2012 Nov;491(7422):56–65.

  1. 7. Apache Spark, available from https://spapache.org/
  2. 8. Spark MLLib, available from: https://spapache.org/mllib/
  3. 9. VariantSpark available from: https://github.com/aehrc/VariantSpark 

Biography:

Piotr Szul is a Senior Engineer in CSIRO Data61. He holds a MSc degree in Computer Science and has over fifteen year of experience in developing various types of commercial and scientific software applications.

Since his joining of CSIRO in 2012 Mr Szul has been involved in a number of project of developing research platforms on clouds or applying emerging big data technologies to research problems. The domains of application included image processing, social media, genomics and material sciences.

In addition to his core software engineering activities Mr Szul has also being involved in building experimental cloud and big data processing infrastructure as well as authored and co-authored a number of paper and delivered a number of presentation on various topics related to big data technologies and their applications.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd