# Persistent obstruction theory for a model category of measures with applications to data merging

## Abstract

Collections of measures on compact metric spaces form a model category (“data complexes”), whose morphisms are marginalization integrals. The fibrant objects in this category represent collections of measures in which there is a measure on a product space that marginalizes to any measures on pairs of its factors. The homotopy and homology for this category allow measurement of obstructions to finding measures on larger and larger product spaces. The obstruction theory is compatible with a fibrant filtration built from the Wasserstein distance on measures.

Despite the abstract tools, this is motivated by a widespread problem in data science. Data complexes provide a mathematical foundation for semi-automated data-alignment tools that are common in commercial database software. Practically speaking, the theory shows that database JOIN operations are subject to genuine topological obstructions. Those obstructions can be detected by an obstruction cocycle and can be resolved by moving through a filtration. Thus, any collection of databases has a persistence level, which measures the difficulty of JOINing those databases. Because of its general formulation, this persistent obstruction theory also encompasses multi-modal data fusion problems, some forms of Bayesian inference, and probability couplings.

## 1. Introduction

We begin this paper with an abstraction of a problem familiar to any large enterprise. Imagine that the branch offices within the enterprise have access to many data sources. The data sources exposed to each office are related and overlapping but non-identical. Each office attempts to merge its own data sources into a cohesive whole, and reports its findings to the head office. This is done by humans, often aided by ad-hoc data-merging software solutions. Presumably, each office does a good job of merging the data that they see. Now, the head office receives these cohesive reports, and must combine them into an overall understanding.

This paper provides a mathematical foundation combining methods from measure theory, simplicial homotopy, obstruction theory, and persistent cohomology (Section 1(a) gives an overview) for semi-automated data-table-alignment tools (e.g, HumMer Reference 14) that are common in commercial database software. Data tables are abstracted as measures over value spaces. The problem of merging tables, or indeed merging previously-merged tables, is recast as the search for a measure that marginalizes correctly.

For example, one data table might record the ages and heights and weights of patients in a hospital, abstracted as a formal sum of point-mass atoms. Another data table might be an actuarial table giving ages and heights and weights for an entire population, abstracted as a smooth distribution where the heights and weights form 2-dimensional elliptical Gaussian distributions for each age and height, the means and singular values varying with age. Both examples would be data tables on the same age-by-height-by-weight space. A third data table might be a simulated probabilistic model of injury severity during vehicle collisions based on height and weight of the passenger. This data table on height-by-weight-by-severity space may or may not merge with each of the previous examples over height-by-weight, within some error. One can imagine many other data tables collected from myriad sources (motor-vehicle records, longitudinal studies, clinical trials) related to this example that may be of interest.

Our first fundamental result (Theorem 3.11) uses this measure-theoretic lens to draw a surprising correspondence between the process of JOIN in database engineering and the Kan extension property for simplicial sets.

This abstraction, and the model-theoretic tools that come with it, permits several advances over the current state of the art, which are collected in our second fundamental result (Theorem 4.13). First, inconsistencies in table collections are automatically detected as *obstructions* in the sense of Steenrod (i.e, a certain co-cycle is not zero). Second, when inconsistencies are detected, the obstruction theoretic tools, combined with persistent cohomology, provide two potential remedies: a) if algebraically permitted (i.e, a certain co-class is zero), the user may retreat back one level of merging, repair, and proceed; b) else, the user may settle for a measure that only marginalizes approximately correctly, with the degree of possible correctness computed automatically by persistent obstruction theory.

More broadly, we are interested in the following three meta-problems:

By “sources of partial information” we mean, roughly, collected data (databases, spreadsheets, pictures), statistical models, established theories, simulations, and general population trends. In this article, we define a formal mathematical structure—a *Data Complex* (Section 2)—that can encapsulate a wide range of problems like 1.1, 1.2, and 1.3. A data complex is meant to encode each source of information as a finite measure on an appropriate value space. Whether these measures arise from collected data as in Problem 1.2 or some model/theory/simulation/trend/merger derived from previous work as in Problem 1.3, we call them *data tables*. By using measures, we combine Problems 1.2 and 1.3 into a single problem.

### 1(a). Overview of technical approach

Often, formal mathematics looks very different than its intuitive purpose, so we want to make sure the reader understands our intent, absent its formal notation.

We want a mathematically rigorous way to encode and solve Problems 1.1, ?GPMerge, and 1.3. When translated to the language of data complexes, a physically reasonable process for “merge [data tables] into a cohesive whole” can be expressed in terms of four key mathematical ingredients: homological algebra for simplicial sets, simplicial homotopy Reference 10Reference 18, obstruction theory Reference 23, and persistent (co)homology across a filtration Reference 4.

The first ingredient (homological algebra) is used because data tables may overlap partially, meaning that we need a formal simplicial complex to encode all possible intersections of all possible data tables. Moreover, simplicial sets allow data tables with repeated columns. The marginalization integral provides a face map and a boundary operator, and an analogue of the diagonal measure within a product provides the degeneracy map. The face and boundary operators tell us whether one “narrow” data table reproduces the data of another “wider” data table via marginalization integrals. Thus, the question of whether several “narrow” data tables can be merged into a few “wider” data tables becomes the question of whether a is the boundary of a -chain That is, the ability to merge overlapping partial information sources as in Problem -chain.1.2 is encoded as the homology of a data complex.

The second ingredient (simplicial homotopy) arises because Problems 1.1, 1.2, 1.3 suggest that we want “simple” solutions. Even when partial merging is possible in the sense of homology, it may be that the result is too complicated to be merged further. In the study of geometric bundles, the fundamental group and higher homotopy groups of the fiber play a key role, and we use simplicial homotopy in a similar way here. A simple solution to Problem 1.2/1.3 or a simple hypothesis in Problem 1.1 corresponds to a single data table (as opposed to a complicated chain of many data tables), which is indicated by triviality in the simplicial homotopy group.

An important side effect of introducing simplicial homotopy (via model categories) is that we see that the Kan extension condition means “merging operations are possible.” The process we call *merging* is similar to *JOIN’ing* in database software, to *fusion* in multi-modal data analysis, and to *coupling* in probability theory. This link reinforces the intuition that data complexes are a good way to encode Problems 1.1/1.2/1.3 for modern data mining when using spreadsheets, DataFrames, and SQL databases. Indeed, our first fundamental result (Theorem 3.11) explicitly formalizes this correspondence.

The reader may be wondering why we introduce something as abstract as simplicial homotopy into something so concrete and common as data merging. Consider the typical database operation

SELECT * FROM table1 INNER JOIN table2 ON table1.column1 = table2.column2 WHERE condition;

When issuing such a command, the administrator *must* designate two tables to be JOINed and choose specific columns from the two tables to be identified via the ON clause. The SELECT * …; command returns a table, whose columns must appear in some order that is determined by the ordering of attributes in table1 and table2, by their placement in the command, and by the columns in the ON clause. Thus, in the language of Section 2, the database software and the working administrator must agree on a total set of attributes, the attributes in each table, and an ordered attribute inclusion to be used for the ON clause.

This command also indicates why we formalize “data tables” as measures over products of attribute value spaces. Replacing SELECT * with a SELECT columnList corresponds to the ability to re-order the attribute list and to marginalize the output to a sublist of attributes; hence, arbitrary finite products are possible. The optional WHERE condition clause allows one to impose additional restrictions on the values to be considered by imposing logical conditions on the entries, such as WHERE (age > 18 AND height > 200). These conditions allow one to restrict the data table to anyFootnote^{1} measurable subset of the value space. The entries of a WHERE-restricted data table constitute the mass of this measurable set, with respect to the data table. (Finally, for those fluent in SQL subtleties, note that the ability to perform LEFT, RIGHT, and OUTER JOIN instead of INNER JOIN will be encompassed by approximate join and face operations in Section 4.)

The third ingredient (Steenrod’s obstruction theory as in Reference 22) provides guidance on how to combine homological algebra and homotopy theory to detect and describe any obstructions to an iterative merging process. In its original formulation, obstruction theory asks whether a section of a fiber bundle defined over a topological space can be extended to a section defined over a larger topological space The most famous example is the smooth category, where one computes characteristic cohomology classes to indicate whether sections of a bundle can be extended globally. Steenrod studied this problem in the case of fibrations over general topological spaces. Typically, assuming one has some sort of CW structure on ? one tries to extend , first over the of -skeleton and then the of -skeleton and so forth. Assuming one has already extended , to a section over the of -skeleton Steenrod’s obstruction cocycle is an element , of where , is the homotopy group of the fiber of as twisted by , loosely, the co-chain ; is defined on each -cell of by restricting to the boundary of but there is some nuance coming from the twisting needed to turn this into a homotopy class of the fiber. If this cocyle , is trivial in homotopy, then the section can be extended, and otherwise it cannot. However, if this cocycle is a coboundary, then there is another section , agreeing with , on the of -skeleton with , trivial in homotopy. Hence, obstructions are discovered dimension-by-dimension via homotopy-valued cohomology, and the obstruction computation can often permit the “correction” of initial extensions of the section to avoid higher-dimensional obstructions.

This concept of an obstruction cocycle was introduced by Steenrod in Reference 22 and revisited many times, such as Reference 15 and Reference 11. Its importance motivated early work in category theory. The entire *raison d’être* of defining fibrant objects and model categories in Reference 10 and Reference 18 was to establish the most general context in which these (co)homology and homotopy calculations remain sensible for more general notions of “weak equivalence.” In particular, one does not require actual topological spaces to perform obstruction theory, merely fibrant objects in a model category.

Here, we establish homology and homotopy theory for data tables by relying on these categorical foundations, giving us an obstruction theory directly analogous to Steenrod’s. When sequential merging is impossible, the obstruction cochain can compute specific data tables that obstruct the process. That is, obstruction theory determines when Problem 1.3 is solvable locally but not globally.

The fourth ingredient, persistent (co)homology, provides a mathematically robust way to measure how much the underlying original data tables would have to be altered, in order to overcome an obstruction. This is a key feature of the theory, because from a practical perspective, multiple information sources are *never* perfectly consistent. Typos and transpositions and omissions and error bars always exist, and must be accounted for. We use a filtration built from the Wasserstein distance on measures to ensure that the desired simplicial homotopy is possible throughout all levels of the filtration. This allows for a well-defined notion of *persistent obstruction theory.* Our second fundamental result (Theorem 4.13) formalizes the idea that when inconsistencies are identified, one of two remedies may be availableFootnote^{2}

- •
the head office should retreat back one merging level, repair (with repair suggested by algebra), then again seek consensus

- •
the head office should settle for only approximate consensus, where the desired measure only approximately marginalizes correctly, with the degree of approximation computed via persistence.

The ultimate result of this article is a mathematically robust framework for data merging that is reasonably applicable to real-world data. In this framework, Problems 1.1 and 1.2/1.3 become Problems 2.38 and 2.39, which are answered by Theorem 4.13 and Definition 4.14.

### 1(b). Related work

To the best of our knowledge, this is the first work to combine all of the tools above to build a robust obstruction theory for databases. Other authors have used different aspects of these tools to address databases. For example, recent work by Fong and Spivak uses database schemas and type/value relationships as a motivational example to introduce functors and (co)limits Reference 6, Chapter 3. Specifically Example 3.99 and the chapter’s final remark are somewhat in the same spirit as the approach taken here. Our category of data complexes in Section 2 is similar to the categorical presentation in Reference 20Reference 21, but our data tables are built from measures (not sets) in order to flexibly address the errors that are inevitable in applications. Finally, other recent work by Abramsky, Morton, and collaborators uses obstruction theory (in the sheaf-theoretic context) to detect non-contextuality in quantum theory, with an application to the non-existence of a universal data table that contains a set of given tables Reference 1Reference 2Reference 13. We expect that further interweaving of these measure-theoretic, sheaf-theoretic, and simplicial/categorical perspectives will be fruitful in the future.

### 1(c). Outline

The rest of this paper is organized as follows. Section 2 defines the basic object of study, a *data complex*, and draws a mapping between its simplicial set structure and the choices that must be made by any database administrator. Categorical language is alluded to in this section, but a full categorical treatment of data complexes is confined to the Appendix. Section 3 connects simplicial homotopy to the notion of JOIN, and shows how obstruction theory detects the impossibility of merging. Section 4 describes our notion of persistent obstruction theory and its application to the idea of fuzziness of consensus. The paper concludes with discussion of practical considerations for applications in Section 5.

## 2. Attributes and data tables

This section provides a practical developmental discussion of a Data Subcomplex that should be accessible to a fairly wide mathematical audience, with full categorical language found in the Appendix. The basic definitions appear in Sections 2(a) and 2(b), culminating in Theorem 2.14 which shows that we have indeed defined a simplicial set. Operations that are specifically useful to standard database operations (inclusion/merge/join) are defined in Section 2(c). Then Section 2(d) makes plain the analogue of “section of a bundle,” which permits the rephrasing of our fundamental problems in mathematical language, and Section 2(e) defines the (co)homology of data complexes needed for obstruction theory.

### 2(a). Data subcomplex as a simplicial set

Our definitions are aimed at making precise the following real-life scenario in data administration.

- (1)
The administrator chooses a set of all attributes (column names and variable types) of interest.

- (2)
For each attribute in the list the administrator chooses a space of possible values, and a “reasonable” metric , that can provide the distance between any two values in that space. Our notion of “reasonable” includes compactness, which is typically guaranteed by boundedness of realistic integer or vector-valued entries.

- (3)
The administrator acquires “data” for some lists of attributes, and attempts to reconcile these into a joint view across all attributes in The reconciliation process involves “join” operations that could be represented by SQL commands such as .

SELECT * FROM table1 INNER JOIN table2 ON table1.column1 = table2.column2 WHERE condition;

- (4)
When reconciling, the administrator may choose to alter the data, as long as the alterations are “small” with respect to both the individual values via and with respect to the overall information-theoretic content of the data.

The former two items are choices that must be made. The latter two items are a process to be accomplished. The mathematical structure developed here is informed deeply by the example SQL command, as discussed in Section 1(a).

Let us define our objects. It is convenient to use language of category theory; see Appendix A for our conventions.

Consider a finite set The elements are called .*attributes*. For each attribute there is a compact metric space , called the ,*value space*.Footnote^{3} These assumed objects (the finite set of attributes and a compact metric space assigned to each attribute) are user-supplied by a data administrator; after these choices are made, everything else proceeds as defined.

^{3}

These assumptions imply that is complete, separable and is a Radon space. In many applications, the space is finite or a closed interval in so one needn’t imagine esoteric spaces to grasp the theory. ,

Each is a Radon space (in particular, a measurable space) using the usual Borel algebra from the metric These metrics will be used in Section .4 to quantify levels of acceptable imprecision when marginalizing measures.

An *attribute list* is a finite sequence of attributes; that is, an attribute list is a function The .*length* of an attribute list is An attribute list . is called *nondegenerate* if it contains no repetitions; that is if the function is one-to-one. The longest nondegenerate attribute lists are permutations of .

For any attribute list the product space , is well-defined. The product space admits the metric and is measurable via the corresponding tensor-product algebra.Footnote^{4} For any list representing a permutation of the set then , is the correspondingly ordered total product of all the measurable spaces of all attributes. At the other extreme, we equip the empty attribute list of length , with the ,*trivial value space* as a singleton set. ,

^{4}

We use the for ease of proof when studying filtrations. Other -metric or more general product metrics might carry the whole theory, too, but we have not yet verified this. -metrics

In Section 2(b) it is shown that for each , is equipped with face maps (by omission of the element as in Definition th2.7) and degeneracy maps (by repetition of the element as in Definition th2.11). When omitting the trivial -level, is the simplicial set whose elements are generated by the permutations of via the face and degeneracy maps. Including the trivial -level, is the augmented simplicial set generated this way. See Appendix A for a summary of the standard definition of (augmented) simplicial sets.

For any attribute list let , denote the space of finite measures on A .*data table* is a pair for for any Note that . as a measure on the singleton set , is determined by the mass of A .*trivial data table* is any data table of the form where and is a measure on the singleton set We sometimes abbreviate our notation for data tables from . to because any , is equipped with a domain (the measurable sets in so ), is understood in context.

In the first example alluded to in the introduction, we could have = [age, height, weight] with in integer years, in centimeters, and in kilograms, each with the standard metric. The space would be the set of measures on the compact set given by the product. An attribute list is also permissible, and might arise for example if heights were compared from two different sources (driver’s license versus medical chart).

For practical purposes, because is a compact metric space, one might use the Radon–Nikodym theorem to write any using a density function with respect to the uniformFootnote^{5} probability measure on the compact set; however, for simplicity we use the language and notation of measures instead of the language of functions and integrals.

Theorem 2.14 shows that the ambient data complex is a simplicial set (augmented when including with faces given by the marginalization integrals (Definition )2.9) and degeneracies given by Dirac diagonalizations or intersections (Definition 2.13). The ambient data complex is a small category, whose morphisms are generated by faces and degeneracies. The forgetful functor is a simplicial map between the small categories and .

### 2(b). Morphisms of data tables

This section establishes notation for common operations and proves that and are simplicial sets, establishing that they are small categories with morphisms that are well-understood in language of measures.

In our earlier example with individual patients as atomic point-masses, the face map from age-by-height-by-weight to height-by-weight represents deleting the age column of the spreadsheet, and allowing new duplicate entries to add (that is, integrate) measure.

Face maps can be applied multiple times, and the following lemma provides the desired re-ordered “commutation” property. For attribute lists the proof is immediate; for data tables it is the Fubini–Tonelli Theorem applied to the measures.

If the measure is expressed as a density function via the Radon–Nikodym theorem, then this is the Dirac-delta

### 2(c). Inclusions, merges, and joins

We now establishFootnote^{7} additional operations (inclusion, sum, merge, join) that are special to and and do not apply to general simplicial sets.

^{7}

We are particularly indebted to Tony Falcone for technical discussions that motivated the formalism in this subsection.

The next lemma and corollary make clear that face maps and attribute inclusions are related tightly.