tdarec
Modular, interoperable, and extensible topological data analysis in R
Aim
This project was born from a collaboration between:
- Jason Cory Brunson, Research Assistant Professor at University of Florida, Laboratory for Systems Medicine, Division of Pulmonary, Critical Care and Sleep Medicine; and,
- Aymeric Stamm, Research Engineer in Statistics at the French National Centre for Scientific Research (CNRS), Nantes University.
It has been funded by the R Consortium through an ISC grant under the name Modular, interoperable, and extensible topological data analysis in R starting in early 2024.
The goal of the project is to seamlessly integrate popular techniques from topological data analysis (TDA) into common statistical workflows in R. The expected benefit is that these extensions will be more widely used by non-specialist researchers and analysts, which will create sufficient awareness and interest in the community to extend the individual packages and the collection.
Rationale
Topological data analysis (TDA) is a rapidly growing field that uses techniques from algebraic topology to analyze the shape and structure of data. At its core, TDA provides tools to understand the geometric and topological features of datasets across multiple scales, with persistent homology (PH) being one of its fundamental techniques.
Several R packages have emerged to provide TDA capabilities to the R community, including:
- {TDA} which focuses on statistical analysis of PH and density clustering by providing an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus and PHAT. This package also implements methods from Fasy et al. (2014) and Chazal et al. (2014) for analyzing the statistical significance of PH features.
- {TDAstats} which provides a comprehensive toolset for conducting TDA, specifically via the calculation of PH in a Vietoris-Rips complex (Wadhwa et al. 2018).
- {ripserr} which provides an R interface to the Ripser and Cubical Ripser C++ libraries (Bauer 2021b; Kaji et al. 2020a).
- {TDAkit} which provides a variety of algorithms to learn with PH of the data based on functional summaries for clustering, hypothesis testing, and visualization (Wasserman 2018).
- Other packages that have been archived due to lack of maintenance (e.g. {kernelTDA}).
While these packages have made TDA more accessible, they’ve also introduced different data structures for representing persistence data, creating challenges for interoperability and workflow consistency. Moreover, workflows using tools from different packages will rely on some of the same low-level operations, like computing PH and calculating distances between persistence diagrams. When each package is built for purpose, this will lead to either duplication or cascading dependencies.
The ultimate goal of this project is to create a collection of interoperable and extensible R packages for TDA that can be easily integrated into common statistical workflows. By providing a unified toolbox for handling persistence data and integrating TDA techniques into the Tidymodels framework, we aim to make TDA more accessible and usable for a wider range of researchers and analysts in R.
The idea of a TDAverse is originally due to a collaboration between Jason Cory Brunson, Raoul Wadhwa, Matt Piekenbrock and James Otto. This R Consortium ISC grant led to the developement and release of a total of four packages the scope of which are briefly discussed below. The R Consortium hosted a concluding webinar about the project that Cory and I co-animated. The recording is available on Youtube:
Released packages
{phutil}1
The {phutil} package addresses such fragmentation by providing a unified toolbox for handling persistence data. It offers consistent data structures and methods that work seamlessly with outputs from various TDA packages. As part of the TDAverse initiative, {phutil} contributes to creating a coherent ecosystem for topological data analysis in R. Currently, it also includes functions to compute distances between persistence diagrams.
{ripserr}
This is a lightweight wrapper around Ulrich Bauer’s Ripser library (Bauer 2021a), still perhaps the fastest implementation of PH using the Vietoris–Rips (VR) filtration. {ripserr} uses {Rcpp} to bind this C++ engine, as well as the cubical PH engine Cubical Ripser employing the same design principles (Kaji et al. 2020b), to R.
{inphr}2
The goal of {inphr} is precisely to deal with comparing populations of persistence diagrams coming from data sets of different types. This is classically called the task of making inference on the basis of some collected data. Several R packages include inferential capabilities for persistence diagrams, including:
- {TDA} which focuses on statistical analysis of PH and density clustering by providing an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus and PHAT. This package also implements methods from Fasy et al. (2014) and Chazal et al. (2014) for analyzing the statistical significance of PH features.
- {TDAstats} which provides a comprehensive toolset for conducting TDA, specifically via the calculation of PH in a Vietoris-Rips complex (Wadhwa et al. 2018).
- {TDAkit} which provides a variety of algorithms to learn with PH of the data based on functional summaries for clustering, hypothesis testing, and visualization (Wasserman 2018).
While these packages have made inference on persistence diagrams available to the R community, they only deal with inference on a single diagram using bootstrap resampling ({TDA}), or only offer inference to compare two diagrams by permutations (TDAstats), or only compare functional summaries of groups of persistence diagrams to answer whether they come from the same underlying distribution (TDAkit).
The {inphr} package aims at going one step further by offering two sets of functions for making inference:
- in the space of persistence diagrams for testing whether multiple collections of persistence diagrams come from the same underlying distribution;
- in functional spaces for localizing differences between multiple collections of persistence diagrams on the domain of some functional summary of them.
{tdarec}3
Topological data analysis (TDA) is a pretty mature discipline at this point, but in the last several years its assimilation into machine learning (ML) has really taken off. Based on our experience, the plurality of experimental TDA tools are written in Python, and naturally Python is home to most of these applications.
That’s not to say that there are no R packages for TDA-ML. {TDAkit}, {TDApplied}, and others provide tools for specific self-contained analyses and could be used together in larger projects. As with the broader R ecosystem, though, their integration can require some additional work. The combination of several low-level libraries and compounding package dependencies has also made this toolkit fragile, with several packages temporarily or permanently archived.
Meanwhile, the Tidymodels package collection has enabled a new generation of users to build familiarity and proficiency with conventional ML. By harmonizing syntax and smoothing pipelines, Tidymodels makes it quick and easy to adapt usable code to new data types, pre-processing steps, and model families. By using wrappers and extractors, it also allows seasoned users to extend their work beyond its sphere of convenience.
We therefore think that Tidymodels is an ideal starting point for a more sustained and interoperable collection for TDA-ML in R. Since much of the role of TDA in ML has been to extract and vectorize features from spatial, image, and other high-dimensional data, we present an extension to {recipes} for just this purpose. Assembling a comprehensive, general-purpose toolkit is a long-term project. Our contribution is meant to spur that project on.
The {tdarec} package, a {recipes} + {dials} extension, provides a collection of recipes for extracting and vectorizing features from persistence diagrams, which can then be used in any Tidymodels workflow. The package is designed to be modular and extensible, allowing users to easily add new feature extraction methods as they are developed in the TDA community. The package currently includes recipes for several popular feature vectorizations, such as persistence landscapes, persistence images, and persistence silhouettes, among others. The support for vectorization methods is built on top of the {TDAvec} package, which provides efficient implementations of these methods.
References
Footnotes
Dedicated blog post: https://r-consortium.org/posts/unifying-toolbox-for-handling-persistence-data/↩︎
Dedicated blog post: https://r-consortium.org/posts/statistical-inference-for-persistence-diagrams/↩︎
Dedicated blog post: https://r-consortium.org/posts/tidy-topological-machine-learning-with-tdavec-and-tdarec/↩︎