Pipelines and tools to analyse big data sets using machine learning methods
Big datasets in healthcare have very complex structure and particular characteristics. We develop open tools and pipelines based on modern machine learning and prediction modelling methods to facilitate their analysis.
A pipeline based on topological machine learning to identify homogeneous patients and relevant features
Dr Raquel Iniesta and Dr Ewan Carr developed a novel pipeline built on recent advances in topological data analysis (TDA) to identify homogeneous clusters of patients with respect to a characteristic of interest. The pipeline focuses on Mapper, a clustering algorithm to identify topological features in complex data that has shown big potential in uncovering homogeneous subgroups sharing common characteristics. TDA is a growing field providing tools to infer, analyse, and exploit the shape of data. TDA has seen increasing adoption in recent years. It holds particular promise as a set of tools to further precision medicine where we often want to identify groups of patients with similar treatment or prognostic outcome. The analytical tool combines and extends existing software implementations of the Mapper algorithm to provide several unique strengths, as the integration of prior knowledge to inform the clustering process, the restriction of clusters search to significant topological features, the use of multivariable machine learning XGBoost to describe clusters composition, and the ability to incorporate mixed data types. Details about the methodological aspects and implementation, and an application for clustering patients with major depression in terms of their chances to remit are published in this paper (2021).
Two videos introducing TDA and explaining the tool are on our BRC Prediction Modelling Presentation page and on YouTube at Introduction to TDA and Mapper pipeline presentation.
The pipeline can be downloaded at: https://github.com/kcl-bhi/mapper-pipeline
“dCVnet”: a user-friendly tool to develop regularized regression prediction models
Dr Andrew Lawrence developed a software tool “dCVnet” (R wrapper for the glmnet package) to implement regularized logistic regression with double (nested) cross-validation for internal validation and made this easy-to-use tool available for use by the scientific and clinical community as an R package.
In contrast to traditional statistical methods, regularized regression allows the analyses of a large number of predictors relative to sample size. Regularization provides a means to reduce overfitting by constraining the magnitude of the regression coefficients through the introduction of a penalty. DCVnet provides a documented and standardized implementation of this particular machine learning pipeline, making it accessible to researchers lacking the programming experience required for more general machine learning software environments. Details about the methodology and an application to predict of recurrence of depression are published in Lawrence, A. Stahl, D. et al (2022).
A video explaining the tool is on our BRC Prediction Modelling Presentation page and on YouTube.
The toolbox can be downloaded at: github.com/AndrewLawrence/dCVnet
“survcompare”: do I need a simple Cox Proportional Hazards model or a more flexible (but less transparent) machine learning method? An R package to investigate complexity of survival data.
The primary goal of the package is to assist researchers in making informed decisions regarding whether they should choose a flexible yet less transparent machine learning approach or employ a traditional linear method.
The package performs a repeated nested cross-validation to validate predictive performance of the Cox Proportionate Hazards model (or its LASSO regularised extension), and the performance of the Survival Random Forest and tests whether the ensemble method has outperformed the baseline Cox model. If there is no outperformance, the result can justify the employment of CoxPH model and indicate a negligible advantage of using a more flexible model such as Survival Random Forest. In the case of the outperformance, a researcher can 1) decide to go for a more complex model, 2) look for the interaction and non-linear terms that could be added to the baseline Cox model and re-run the test again, or 3) consider still using the Cox model if the difference is not large in the context of the performed task, or not enough to sacrifice model interpretability.
The package was developed by Dr Diana Shamsutdinova and Professor Daniel Stahl and is based on the collaboration with Dr Daniel Stamate and Dr Angus Roberts (see the Conference paper). It can be downloaded in R studio as install.packages(“survcompare”) or from github.com/dianashams/survcompare.