Mr Piotr Szul1, Mr Aidan O’Brien2, Dr Rob Dunne3, Dr Denis Bauer2
1Data61, CSIRO, Brisbane, QLD, Australia, firstname.lastname@example.org
2Health & Biosecurity, CSIRO, Sydney, NSW, Australia
3Data61, CSIRO, Sydney, NSW, Australia
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project  with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches especially if potential interaction between variants need to be considered.
Random forest  is one of the methods that was found to be useful in this context , both because its propensity for parallelization and robustness and the inherent ability to model interaction . The variable importance measure extracted from random forest models can be used to identify variants associated with traits of interests.
There is a number of random forest implementation available for single machine computing with interfaces both for R and Python. Some of them, such as Ranger , are specifically designed to process highly dimensional WGAS-like data sets and boast significant performance improvements over the more generic implementations.
These implementations however are limited by the size of a single machine memory, and for larger datasets distributed solutions, in which data can be partitioned among multiple machine are needed. This approach underpins the computational model of many leading big data technologies including Apache Hadoop and more recently Apache Spark  – a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Notably Spark machine learning library (Spark MLlib ) comes with a random forest implementation capable of dealing with huge datasets. This implementation however is tuned to work on typical dataset with large number of samples and relatively small number of variable and either fails or is inefficient for highly dimensional data .
To address these problems, we have developed the CursedForest – a Spark based, distributed implementation of random forest optimized for large, highly dimensional data sets. We have successfully CursedForest applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying CursedForest, released as part of the VariantSpark  toolkit, to a number of WGAS studies.
Typically, random forest implementations operate on the traditional matrix-like data representation with samples in rows and features in columns. In distribute scenario that leads to ‘by row’ partitioning, which for building decision trees on highly dimensional data has been proven to be significantly less efficient that the alternative ‘by column’ layout and partitioning .
CursedForests uses the basic principle of ‘by column’ partitioning but extends it to ensembles of trees and introduces a number of optimizations aimed at efficient processing of extremely highly dimensional genomics dataset which include memory efficient representation of genomics data, optimized communication patterns and computation batching.
We have compared the performance of CursedForset on genomics datasets against other leading implementation and tested its ability to scale on very large datasets. The results demonstrate not only that CursedForset is the single implementation able to process the WGAS-size datasets but also that it significantly outperforms other methods on smaller datasets.
Figure 1: Peformance of CursedForest compared to MLlib and Ranger (left); Scaling of CursedForest to WGAS-sized datasets (right).
We have also demonstrated that CursedForest can accurately predict ethnicity from whole genome sequencing profiles of 2,500 individuals being the only implementation scaling to the full dataset. We have trained CursedForest on the 1000 genomes dataset, which consists of 2,504 samples with 81,047,467 features each to predict the ethnicity from genomic profiles. CursedForest achieves an out of bag error of OOB=0.01 and completes in 36 min 54 seconds, demonstrating its capability to run on population-scale cohorts of real world applications.
We we have developed the `CursedForest` – a Spark based random forest implementation optimized for highly dimensional data sets – and demonstrated that it outperforms other tools on genomics datasets both in terms computation time and data-set size limits and is capable of running on population-scale cohorts of real world applications.
- Breiman L., Random Forests. Machine Learning. 2001 October;45(1):5–32.
- Lunetta et al., Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 2004 5:32
- Díaz-Uriarte R, Alvaréz de Andres S., Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3
- Wright, Marvin N., and Andreas Ziegler., “ranger: A fast implementation of random forests for high dimensional data in C++ and R.” arXiv preprint arXiv:1508.04409 (2015)
- Firas Abuzaid et al., Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale. Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain
- 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012 Nov;491(7422):56–65.
- Apache Spark, available from https://spark.apache.org/
- Spark MLLib, available from: https://spark.apache.org/mllib/
- VariantSpark available from: https://github.com/csirobigdata/variant-spark
Piotr Szul is a Senior Engineer in CSIRO Data61. He holds a MSc degree in Computer Science and has over fifteen year of experience in developing various types of commercial and scientific software applications.
Since his joining of CSIRO in 2012 Mr Szul has been involved in a number of project of developing research platforms and applying emerging big data technologies to research problems in domains of social media, genomics and material sciences. In addition to his core software engineering activities Mr Szul has also actively involved in building experimental cloud and big data processing infrastructure.