Pipelines and tools to analyse big data sets using machine learning methods
Big datasets in healthcare have very complex structure and particular characteristics. We develop open tools and pipelines based on modern machine learning and prediction modelling methods to facilitate their analysis.
A pipeline based on topological machine learning to identify homogeneous patients and relevant features
Dr Raquel Iniesta and Dr Ewan Carr developed a novel pipeline built on recent advances in topological data analysis (TDA) to identify homogeneous clusters of patients with respect to a characteristic of interest. The pipeline focuses on Mapper, a clustering algorithm to identify topological features in complex data that has shown big potential in uncovering homogeneous subgroups sharing common characteristics. TDA is a growing field providing tools to infer, analyse, and exploit the shape of data. TDA has seen increasing adoption in recent years. It holds particular promise as a set of tools to further precision medicine where we often want to identify groups of patients with similar treatment or prognostic outcome. The analytical tool combines and extends existing software implementations of the Mapper algorithm to provide several unique strengths, as the integration of prior knowledge to inform the clustering process, the restriction of clusters search to significant topological features, the use of multivariable machine learning XGBoost to describe clusters composition, and the ability to incorporate mixed data types. Details about the methodological aspects and implementation, and an application for clustering patients with major depression in terms of their chances to remit are published in this paper (2021).
The pipeline can be downloaded at: https://github.com/kcl-bhi/mapper-pipeline
“dCVnet”: a user-friendly tool to develop regularized regression prediction models
Dr Andrew Lawrence developed a software tool “dCVnet” (R wrapper for the glmnet package) to implement regularized logistic regression with double (nested) cross-validation for internal validation and made this easy-to-use tool available for use by the scientific and clinical community as an R package.
In contrast to traditional statistical methods, regularized regression allows the analyses of a large number of predictors relative to sample size. Regularization provides a means to reduce overfitting by constraining the magnitude of the regression coefficients through the introduction of a penalty. DCVnet provides a documented and standardized implementation of this particular machine learning pipeline, making it accessible to researchers lacking the programming experience required for more general machine learning software environments. Details about the methodology and an application to predict of recurrence of depression are published in Lawrence, A. Stahl, D. et al (2022).
The toolbox can be downloaded at: github.com/AndrewLawrence/dCVnet