tdarec

Modular, interoperable, and extensible topological data analysis in R

Aim

This project was born from a collaboration between:

  • Jason Cory Brunson, Research Assistant Professor at University of Florida, Laboratory for Systems Medicine, Division of Pulmonary, Critical Care and Sleep Medicine; and,
  • Aymeric Stamm, Research Engineer in Statistics at the French National Centre for Scientific Research (CNRS), Nantes University.

It has been funded by the R Consortium through an ISC grant under the name Modular, interoperable, and extensible topological data analysis in R starting in early 2024.

The goal of the project is to seamlessly integrate popular techniques from topological data analysis (TDA) into common statistical workflows in R. The expected benefit is that these extensions will be more widely used by non-specialist researchers and analysts, which will create sufficient awareness and interest in the community to extend the individual packages and the collection.

Rationale

Topological data analysis (TDA) is a rapidly growing field that uses techniques from algebraic topology to analyze the shape and structure of data. At its core, TDA provides tools to understand the geometric and topological features of datasets across multiple scales, with persistent homology (PH) being one of its fundamental techniques.

Several R packages have emerged to provide TDA capabilities to the R community, including:

  • {TDA} which focuses on statistical analysis of PH and density clustering by providing an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus and PHAT. This package also implements methods from Fasy et al. (2014) and Chazal et al. (2014) for analyzing the statistical significance of PH features.
  • {TDAstats} which provides a comprehensive toolset for conducting TDA, specifically via the calculation of PH in a Vietoris-Rips complex (Wadhwa et al. 2018).
  • {ripserr} which provides an R interface to the Ripser and Cubical Ripser C++ libraries (Bauer 2021b; Kaji et al. 2020a).
  • {TDAkit} which provides a variety of algorithms to learn with PH of the data based on functional summaries for clustering, hypothesis testing, and visualization (Wasserman 2018).
  • Other packages that have been archived due to lack of maintenance (e.g. {kernelTDA}).

While these packages have made TDA more accessible, they’ve also introduced different data structures for representing persistence data, creating challenges for interoperability and workflow consistency. Moreover, workflows using tools from different packages will rely on some of the same low-level operations, like computing PH and calculating distances between persistence diagrams. When each package is built for purpose, this will lead to either duplication or cascading dependencies.

The ultimate goal of this project is to create a collection of interoperable and extensible R packages for TDA that can be easily integrated into common statistical workflows. By providing a unified toolbox for handling persistence data and integrating TDA techniques into the Tidymodels framework, we aim to make TDA more accessible and usable for a wider range of researchers and analysts in R.

The idea of a TDAverse is originally due to a collaboration between Jason Cory Brunson, Raoul Wadhwa, Matt Piekenbrock and James Otto. This R Consortium ISC grant led to the developement and release of a total of four packages the scope of which are briefly discussed below. The R Consortium hosted a concluding webinar about the project that Cory and I co-animated. The recording is available on Youtube:

Released packages

{phutil}1

The {phutil} package addresses such fragmentation by providing a unified toolbox for handling persistence data. It offers consistent data structures and methods that work seamlessly with outputs from various TDA packages. As part of the TDAverse initiative, {phutil} contributes to creating a coherent ecosystem for topological data analysis in R. Currently, it also includes functions to compute distances between persistence diagrams.

{ripserr}

This is a lightweight wrapper around Ulrich Bauer’s Ripser library (Bauer 2021a), still perhaps the fastest implementation of PH using the Vietoris–Rips (VR) filtration. {ripserr} uses {Rcpp} to bind this C++ engine, as well as the cubical PH engine Cubical Ripser employing the same design principles (Kaji et al. 2020b), to R.

{inphr}2

The goal of {inphr} is precisely to deal with comparing populations of persistence diagrams coming from data sets of different types. This is classically called the task of making inference on the basis of some collected data. Several R packages include inferential capabilities for persistence diagrams, including:

  • {TDA} which focuses on statistical analysis of PH and density clustering by providing an R interface for the efficient algorithms of the C++ libraries GUDHI, Dionysus and PHAT. This package also implements methods from Fasy et al. (2014) and Chazal et al. (2014) for analyzing the statistical significance of PH features.
  • {TDAstats} which provides a comprehensive toolset for conducting TDA, specifically via the calculation of PH in a Vietoris-Rips complex (Wadhwa et al. 2018).
  • {TDAkit} which provides a variety of algorithms to learn with PH of the data based on functional summaries for clustering, hypothesis testing, and visualization (Wasserman 2018).

While these packages have made inference on persistence diagrams available to the R community, they only deal with inference on a single diagram using bootstrap resampling ({TDA}), or only offer inference to compare two diagrams by permutations (TDAstats), or only compare functional summaries of groups of persistence diagrams to answer whether they come from the same underlying distribution (TDAkit).

The {inphr} package aims at going one step further by offering two sets of functions for making inference:

  1. in the space of persistence diagrams for testing whether multiple collections of persistence diagrams come from the same underlying distribution;
  2. in functional spaces for localizing differences between multiple collections of persistence diagrams on the domain of some functional summary of them.

{tdarec}3

Topological data analysis (TDA) is a pretty mature discipline at this point, but in the last several years its assimilation into machine learning (ML) has really taken off. Based on our experience, the plurality of experimental TDA tools are written in Python, and naturally Python is home to most of these applications.

That’s not to say that there are no R packages for TDA-ML. {TDAkit}, {TDApplied}, and others provide tools for specific self-contained analyses and could be used together in larger projects. As with the broader R ecosystem, though, their integration can require some additional work. The combination of several low-level libraries and compounding package dependencies has also made this toolkit fragile, with several packages temporarily or permanently archived.

Meanwhile, the Tidymodels package collection has enabled a new generation of users to build familiarity and proficiency with conventional ML. By harmonizing syntax and smoothing pipelines, Tidymodels makes it quick and easy to adapt usable code to new data types, pre-processing steps, and model families. By using wrappers and extractors, it also allows seasoned users to extend their work beyond its sphere of convenience.

We therefore think that Tidymodels is an ideal starting point for a more sustained and interoperable collection for TDA-ML in R. Since much of the role of TDA in ML has been to extract and vectorize features from spatial, image, and other high-dimensional data, we present an extension to {recipes} for just this purpose. Assembling a comprehensive, general-purpose toolkit is a long-term project. Our contribution is meant to spur that project on.

The {tdarec} package, a {recipes} + {dials} extension, provides a collection of recipes for extracting and vectorizing features from persistence diagrams, which can then be used in any Tidymodels workflow. The package is designed to be modular and extensible, allowing users to easily add new feature extraction methods as they are developed in the TDA community. The package currently includes recipes for several popular feature vectorizations, such as persistence landscapes, persistence images, and persistence silhouettes, among others. The support for vectorization methods is built on top of the {TDAvec} package, which provides efficient implementations of these methods.

References

Abramowicz, Konrad, Alessia Pini, Lina Schelin, Sara Sjöstedt de Luna, Aymeric Stamm, and Simone Vantini. 2023. “Domain Selection and Familywise Error Rate for Functional Data: A Unified Framework.” Biometrics 79 (2): 1119–32.
Ali, Dashti, Aras Asaad, Maria-Jose Jimenez, Vidit Nanda, Eduardo Paluzo-Hidalgo, and Manuel Soriano-Trigueros. 2022. A Survey of Vectorization Methods in Topological Data Analysis. arXiv:2212.09703. arXiv. https://doi.org/10.48550/arXiv.2212.09703.
Atienza, Nieves, Rocı́o González-Dı́az, and Manuel Soriano-Trigueros. 2020. “On the Stability of Persistent Entropy and New Summary Functions for Topological Data Analysis.” Pattern Recognition 107: 107509.
Bauer, Ulrich. 2021a. “Ripser: Efficient Computation of Vietoris–Rips Persistence Barcodes.” J Appl. And Comput. Topology 5 (3): 391–423. https://doi.org/10.1007/s41468-021-00071-5.
Bauer, Ulrich. 2021b. “Ripser: Efficient Computation of Vietoris–Rips Persistence Barcodes.” Journal of Applied and Computational Topology 5 (3): 391–423.
Bauer, Ulrich, Michael Kerber, Jan Reininghaus, and Hubert Wagner. 2017. “Phat – Persistent Homology Algorithms Toolbox.” Journal of Symbolic Computation, Algorithms and software for computational topology, vol. 78 (January): 76–90. https://doi.org/10.1016/j.jsc.2016.03.008.
Bauer, Ulrich, Talha Bin Masood, Barbara Giunti, Guillaume Houry, Michael Kerber, and Abhishek Rathod. 2022. Keeping It Sparse: Computing Persistent Homology Revised. arXiv:2211.09075. arXiv. https://doi.org/10.48550/arXiv.2211.09075.
Brown, Shael, and Reza Farivar. 2024. TDApplied: Machine Learning and Inference for Topological Data Analysis. https://cran.r-project.org/package=TDApplied.
Bubenik, Peter, Jonathan Scott, and Donald Stanley. 2023. “Exact Weights, Path Metrics, and Algebraic Wasserstein Distances.” Journal of Applied and Computational Topology 7 (2): 185–219.
Chachólski, Wojciech, Barbara Giunti, Alvin Jin, and Claudia Landi. 2023. “Decomposing Filtered Chain Complexes: Geometry Behind Barcoding Algorithms.” Computational Geometry 109 (February): 101938. https://doi.org/10.1016/j.comgeo.2022.101938.
Chan, Kit C., Umar Islambekov, Alexey Luchinsky, and Rebecca Sanders. 2022. “A Computationally Efficient Framework for Vector Representation of Persistence Diagrams.” Journal of Machine Learning Research 23 (268): 1–33. http://jmlr.org/papers/v23/21-1129.html.
Chazal, Frédéric, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo, and Larry Wasserman. 2014. “Stochastic Convergence of Persistence Landscapes and Silhouettes.” Proceedings of the Thirtieth Annual Symposium on Computational Geometry, 474–83.
Chazal, Frédéric, and Bertrand Michel. 2021. “An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists.” Frontiers in Artificial Intelligence 4: 667963.
Chen, Chao, and Michael Kerber. 2011. “Persistent Homology Computation with a Twist.” 27th European Workshop on Computational Geometry (EuroCG 2011) (Morschach, Switzerland), March, 197–200.
Chung, Yu-Min, and Austin Lawson. 2022. “Persistence Curves: A Canonical Framework for Summarizing Persistence Diagrams.” Advances in Computational Mathematics 48 (1): 6.
Cohen-Steiner, David, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. 2010. “Lipschitz Functions Have l p-Stable Persistence.” Foundations of Computational Mathematics 10 (2): 127–39.
Čufar, Matija, and Žiga Virk. 2023. Fast Computation of Persistent Homology Representatives with Involuted Persistent Homology. arXiv:2105.03629. arXiv. https://doi.org/10.48550/arXiv.2105.03629.
Čufar, Matija, and Žiga Virk. Fri Dec 01 05:00:00 UTC 2023. “Fast Computation of Persistent Homology Representatives with Involuted Persistent Homology.” FoDS 5 (4): 466–79. https://doi.org/10.3934/fods.2023006.
de Silva, Vin, Dmitriy Morozov, and Mikael Vejdemo-Johansson. 2011. “Dualities in Persistent (Co)homology.” Inverse Problems 27 (12): 124003. https://doi.org/10.1088/0266-5611/27/12/124003.
Edelsbrunner, Letscher, and Zomorodian. 2002. “Topological Persistence and Simplification.” Discrete Comput Geom 28 (4): 511–33. https://doi.org/10.1007/s00454-002-2885-2.
Fasy, Brittany Terese, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. 2014. Confidence Sets for Persistence Diagrams.
Fasy, Brittany T., Jisu Kim, Fabrizio Lecci, Clement Maria, David L. Millman, and Vincent Rouvreau. 2022. TDA: Statistical Tools for Topological Data Analysis. https://CRAN.R-project.org/package=TDA.
Kaji, Shizuo, Takeki Sudo, and Kazushi Ahara. 2020a. “Cubical Ripser: Software for Computing Persistent Homology of Image and Volume Data.” arXiv Preprint arXiv:2005.12692.
Kaji, Shizuo, Takeki Sudo, and Kazushi Ahara. 2020b. Cubical Ripser: Software for Computing Persistent Homology of Image and Volume Data. arXiv:2005.12692. arXiv. https://doi.org/10.48550/arXiv.2005.12692.
Kališnik, Sara. 2019. “Tropical Coordinates on the Space of Persistence Barcodes.” Foundations of Computational Mathematics 19 (1): 101–29. https://doi.org/10.1007/s10208-018-9379-y.
Li, Lu, Connor Thompson, Gregory Henselman-Petrusek, Chad Giusti, and Lori Ziegelmeier. 2021. “Minimal Cycle Representatives in Persistent Homology Using Linear Programming: An Empirical Study with User’s Guide.” Frontiers in Artificial Intelligence 4.
Lovato, Ilenia, Alessia Pini, Aymeric Stamm, Maxime Taquet, and Simone Vantini. 2021. “Multiscale Null Hypothesis Testing for Network-Valued Data: Analysis of Brain Networks of Patients with Autism.” Journal of the Royal Statistical Society Series C: Applied Statistics 70 (2): 372–97.
Luchinsky, Aleksei, and Umar Islambekov. 2025. TDAvec: Computing Vector Summaries of Persistence Diagrams for Topological Data Analysis in R and Python. arXiv:2411.17340. arXiv. https://doi.org/10.48550/arXiv.2411.17340.
Mémoli, Facundo, and Kritika Singhal. 2019. “A Primer on Persistent Homology of Finite Metric Spaces.” Bull Math Biol 81 (7): 2074–116. https://doi.org/10.1007/s11538-019-00614-z.
Perea, Jose A., and John Harer. 2015. “Sliding Windows and Persistence: An Application of Topological Methods to Signal Analysis.” Found Comput Math 15 (3): 799–838. https://doi.org/10.1007/s10208-014-9206-z.
Pesarin, Fortunato, and Luigi Salmaso. 2010. Permutation Tests for Complex Data: Theory, Applications and Software. John Wiley & Sons.
Pham, Tuyen, and Hubert Wagner. 2023. “Computing Representatives of Persistent Homology Generators with a Double Twist.” Proceedings of the 35th Canadian Conference on Computational Geometry (CCCG 2023) (Montreal, QC, Canada), August, 283–90.
Pham, Tuyen, and Hubert Wagner. 2024. Computing Representatives of Persistent Homology Generators with a Double Twist. arXiv:2403.04100. arXiv. https://doi.org/10.48550/arXiv.2403.04100.
Pini, Alessia, and Simone Vantini. 2017. “Interval-Wise Testing for Functional Data.” Journal of Nonparametric Statistics 29 (2): 407–24.
Ravishanker, Nalini, and Renjie Chen. 2021. “An Introduction to Persistent Homology for Time Series.” WIREs Computational Statistics 13 (3): e1548. https://doi.org/10.1002/wics.1548.
Richardson, Eitan, and Michael Werman. 2014. “Efficient Classification Using the Euler Characteristic.” Pattern Recognition Letters 49: 99–106.
Stamm, Aymeric. 2023. rgudhi: An Interface to the GUDHI Library for Topological Data Analysis. https://lmjl-alea.github.io/rgudhi/.
Umeda, Yuhei. 2017. “Time Series Classification via Topological Data Analysis.” Transactions of the Japanese Society for Artificial Intelligence 32 (3): D–G72_1. https://doi.org/10.1527/tjsai.D-G72.
Virk, Žiga. 2022. Introduction to Persistent Homology. Založba UL FRI, Ljubljana. https://doi.org/10.51939/0002.
Wadhwa, Raoul R, Drew FK Williamson, Andrew Dhawan, and Jacob G Scott. 2018. “TDAstats: R Pipeline for Computing Persistent Homology in Topological Data Analysis.” Journal of Open Source Software 3 (28): 860.
Wadhwa, Raoul, Andrew Dhawan, Drew Williamson, Jacob Scott, Jason Cory Brunson, and Shota Ochi. 2019. TDAstats: Pipeline for Topological Data Analysis.
Wadhwa, Raoul, Matt Piekenbrock, Jacob Scott, Jason Cory Brunson, Emily Noble, and Xinyi Zhang. 2025. ripserr: Calculate Persistent Homology with Ripser-Based Engines. https://cran.r-project.org/package=ripserr.
Wasserman, Larry. 2018. “Topological Data Analysis.” Annual Review of Statistics and Its Application 5 (2018): 501–32.
Zhang, Simon, Mengbai Xiao, Chengxin Guo, Liang Geng, Hao Wang, and Xiaodong Zhang. 2019. “HYPHA: A Framework Based on Separation of Parallelisms to Accelerate Persistent Homology Matrix Reduction.” Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA), ICS ’19, June, 69–81. https://doi.org/10.1145/3330345.3332147.
Zhang, Simon, Mengbai Xiao, and Hao Wang. 2020. “GPU-Accelerated Computation of Vietoris-Rips Persistence Barcodes.” DROPS-IDN/V2/Document/10.4230/LIPIcs.SoCG.2020.70. https://doi.org/10.4230/LIPIcs.SoCG.2020.70.
Zomorodian, Afra, and Gunnar Carlsson. 2005. “Computing Persistent Homology.” Discrete Comput Geom 33 (2): 249–74. https://doi.org/10.1007/s00454-004-1146-y.

Footnotes

  1. Dedicated blog post: https://r-consortium.org/posts/unifying-toolbox-for-handling-persistence-data/↩︎

  2. Dedicated blog post: https://r-consortium.org/posts/statistical-inference-for-persistence-diagrams/↩︎

  3. Dedicated blog post: https://r-consortium.org/posts/tidy-topological-machine-learning-with-tdavec-and-tdarec/↩︎