Persistent obstruction theory for a model category of measures with applications to data merging
HTML articles powered by AMS MathViewer
- by Abraham D. Smith, Paul Bendich and John Harer;
- Trans. Amer. Math. Soc. Ser. B 8 (2021), 1-38
- DOI: https://doi.org/10.1090/btran/56
- Published electronically: February 2, 2021
- HTML | PDF
Abstract:
Collections of measures on compact metric spaces form a model category (“data complexes”), whose morphisms are marginalization integrals. The fibrant objects in this category represent collections of measures in which there is a measure on a product space that marginalizes to any measures on pairs of its factors. The homotopy and homology for this category allow measurement of obstructions to finding measures on larger and larger product spaces. The obstruction theory is compatible with a fibrant filtration built from the Wasserstein distance on measures.
Despite the abstract tools, this is motivated by a widespread problem in data science. Data complexes provide a mathematical foundation for semi-automated data-alignment tools that are common in commercial database software. Practically speaking, the theory shows that database JOIN operations are subject to genuine topological obstructions. Those obstructions can be detected by an obstruction cocycle and can be resolved by moving through a filtration. Thus, any collection of databases has a persistence level, which measures the difficulty of JOINing those databases. Because of its general formulation, this persistent obstruction theory also encompasses multi-modal data fusion problems, some forms of Bayesian inference, and probability couplings.
References
- Samson Abramsky, Relational databases and Bell’s theorem, In search of elegance in the theory and practice of computation, Lecture Notes in Comput. Sci., vol. 8000, Springer, Heidelberg, 2013, pp. 13–35. MR 3124039, DOI 10.1007/978-3-642-41660-6_{2}
- Samson Abramsky, Rui Soares Barbosa, Kohei Kishida, Raymond Lal, and Shane Mansfield, Contextuality, cohomology and paradox, 24th EACSL Annual Conference on Computer Science Logic, LIPIcs. Leibniz Int. Proc. Inform., vol. 41, Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern, 2015, pp. 211–228. MR 3441764
- Torgny Lindvall, Lectures on the coupling method, Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York, 1992. A Wiley-Interscience Publication. MR 1180522
- Herbert Edelsbrunner and John L. Harer, Computational topology, American Mathematical Society, Providence, RI, 2010. An introduction. MR 2572029, DOI 10.1090/mbk/069
- P. J. Ehlers and T. Porter, Joins for (augmented) simplicial sets, J. Pure Appl. Algebra 145 (2000), no. 1, 37–44. MR 1732286, DOI 10.1016/S0022-4049(98)00065-6
- Brendan Fong and David I. Spivak, An invitation to applied category theory, Cambridge University Press, Cambridge, 2019. Seven sketches in compositionality. MR 3966447, DOI 10.1017/9781108668804
- Greg Friedman, Survey article: An elementary illustrated introduction to simplicial sets, Rocky Mountain J. Math. 42 (2012), no. 2, 353–423. MR 2915498, DOI 10.1216/RMJ-2012-42-2-353
- Eli Glasner, Ergodic theory via joinings, Mathematical Surveys and Monographs, vol. 101, American Mathematical Society, Providence, RI, 2003. MR 1958753, DOI 10.1090/surv/101
- Paul G. Goerss and John F. Jardine, Simplicial homotopy theory, Modern Birkhäuser Classics, Birkhäuser Verlag, Basel, 2009. Reprint of the 1999 edition [MR1711612]. MR 2840650, DOI 10.1007/978-3-0346-0189-4
- Daniel M. Kan, A combinatorial definition of homotopy groups, Ann. of Math. (2) 67 (1958), 282–312. MR 111032, DOI 10.2307/1970006
- Albert T. Lundell, Obstruction theory of principal fibre bundles, Trans. Amer. Math. Soc. 97 (1960), 161–192. MR 130694, DOI 10.1090/S0002-9947-1960-0130694-9
- J. Peter May, Simplicial objects in algebraic topology, Chicago Lectures in Mathematics, University of Chicago Press, Chicago, IL, 1992. Reprint of the 1967 original. MR 1206474
- Jason Morton, Contextuality from missing and versioned data, 1–21, arXiv:1708.03264 2017.
- Felix Naumann, Alexander Bilke, Jens Bleiholder, and Melanie Weis, Data fusion in three steps: resolving inconsistencies at schema-, tuple-, and value-level, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 29 (2006), no. 2, 21–31.
- Paul Olum, Obstructions to extensions and homotopies, Ann. of Math. (2) 52 (1950), 1–50. MR 36507, DOI 10.2307/1969510
- Nina Otter, Magnitude meets persistence. homology theories for filtered simplicial sets, 1–21, arXiv:1807.01540 2018.
- Gabriel Peyré and Marco Cuturi, Computational optimal transport, arXiv:1803.00567 2018.
- Daniel G. Quillen, Homotopical algebra, Lecture Notes in Mathematics, vol. 43, Springer Berlin Heidelberg, Berlin, Heidelberg, 1967, doi:10.1007/BFb0097438.
- Egbert Rijke, The join construction, (2017), arXiv:1701.07538.
- Patrick Schultz, David I. Spivak, Christina Vasilakopoulou, and Ryan Wisnesky, Algebraic databases, Theory Appl. Categ. 32 (2017), Paper No. 16, 547–619. MR 3641249
- Patrick Schultz and Ryan Wisnesky, Algebraic data integration, J. Funct. Programming 27 (2017), e24, 51. MR 3720789, DOI 10.1017/S0956796817000168
- N. E. Steenrod, Homology with local coefficients, Ann. of Math. (2) 44 (1943), 610–627. MR 9114, DOI 10.2307/1969099
- Norman Steenrod, The Topology of Fibre Bundles, Princeton Mathematical Series, vol. 14, Princeton University Press, Princeton, NJ, 1951. MR 39258, DOI 10.1515/9781400883875
- Phillip B. Thurber, Semi-localization of a one pointed Kan complex, Pacific J. Math. 178 (1997), no. 1, 147–184. MR 1447409, DOI 10.2140/pjm.1997.178.147
Bibliographic Information
- Abraham D. Smith
- Affiliation: Department of Mathematics, Statistics, and Computer Science, University of Wisconsin-Stout, Menomonie, Wisconsin 54751; and Geometric Data Analytics, Inc., Durham, North Carolina 27707
- MR Author ID: 924528
- ORCID: 0000-0002-6875-3290
- Email: smithabr@uwstout.edu
- Paul Bendich
- Affiliation: Department of Mathematics, Duke University, Durham, North Carolina 27708; and Geometric Data Analytics, Inc., Durham, North Carolina 27707
- MR Author ID: 936459
- Email: bendich@math.duke.edu
- John Harer
- Affiliation: Department of Mathematics, Duke University, Durham, North Carolina 27707; and Geometric Data Analytics, Inc., Durham, North Carolina 27707
- MR Author ID: 81320
- Email: harer@math.duke.edu
- Received by editor(s): December 4, 2019
- Received by editor(s) in revised form: August 7, 2020
- Published electronically: February 2, 2021
- Additional Notes: Work by all three authors was partially supported by the DARPA Simplex Program, under contract # HR001118C0070. The last two authors were also partially supported by the Air Force Office of Scientific Research under grant AFOSR FA9550-18-1-0266.
- © Copyright 2021 by the authors under Creative Commons Attribution-Noncommercial 3.0 License (CC BY NC 3.0)
- Journal: Trans. Amer. Math. Soc. Ser. B 8 (2021), 1-38
- MSC (2020): Primary 55U10; Secondary 55S35
- DOI: https://doi.org/10.1090/btran/56
- MathSciNet review: 4207891