Persistent Obstruction Theory for a Model Category of Measures with Applications to Data Merging

Collections of measures on compact metric spaces form a model category ("data complexes"), whose morphisms are marginalization integrals. The fibrant objects in this category represent collections of measures in which there is a measure on a product space that marginalizes to any measures on pairs of its factors. The homotopy and homology for this category allow measurement of obstructions to finding measures on larger and larger product spaces. The obstruction theory is compatible with a fibrant filtration built from the Wasserstein distance on measures. Despite the abstract tools, this is motivated by a widespread problem in data science. Data complexes provide a mathematical foundation for semi-automated data-alignment tools that are common in commercial database software. Practically speaking, the theory shows that database JOIN operations are subject to genuine topological obstructions. Those obstructions can be detected by an obstruction cocycle and can be resolved by moving through a filtration. Thus, any collection of databases has a persistence level, which measures the difficulty of JOINing those databases. Because of its general formulation, this persistent obstruction theory also encompasses multi-modal data fusion problems, some forms of Bayesian inference, and probability couplings.


Introduction
We begin this paper with an abstraction of a problem familiar to any large enterprise. Imagine that the branch offices within the enterprise have access to many data sources. The data sources exposed to each office are related and overlapping but non-identical. Each office attempts to merge its own data sources into a cohesive whole, and reports its findings to the head office. This is done by humans, often aided by ad-hoc data-merging software solutions. Presumably, each office does a good job of merging the data that they see. Now, the head office receives these cohesive reports, and must combine them into an overall understanding.
This paper provides a mathematical foundation combining methods from measure theory, simplicial homotopy, obstruction theory, and persistent cohomology (Section 1(a) gives an overview) for semi-automated data-table-alignment tools (e.g, HumMer [13]) that are common in commercial database software. Data tables are abstracted as measures over value spaces, and the problem of merging tables, or indeed merging previously-merged tables, is recast as the search for a measure that marginalizes correctly. Our first fundamental result (Theorem 3.11) uses this measure-theoretic lens to draw a surprising correspondence between the process of JOIN in database engineering and the Kan extension property for simplicial sets.
This abstraction, and the model-theoretic tools that come with it, permits several advances over the current state of the art, which are collected in our second fundamental result (Theorem 4.13). First, inconsistencies in table collections are automatically detected as obstructions in the sense of Steenrod (i.e, a certain co-cycle is not zero). Recent work by Abramsky, The second ingredient (simplicial homotopy) arises because Problems 1.1, 1.2, 1.3 suggest that we want "simple" solutions. Even when partial merging is possible in the sense of homology, it may be that the result is too complicated to be merged further. In the study of geometric bundles, the fundamental group and higher homotopy groups of the fiber play a key role, and we use simplicial homotopy in a similar way here. A simple solution to Problem 1.2/1.3 or a simple hypothesis in Problem 1.1 corresponds to a single data table (as opposed to a complicated chain of many data tables), which is indicated by triviality in the simplicial homotopy group.
An important side effect of introducing simplicial homotopy (via model categories) is that we see that the Kan extension condition means "merging operations are possible." The process we call merging is similar to JOIN'ing in database software, to fusion in multi-modal data analysis, and to coupling in probability theory. This link reinforces the intuition that data complexes are a good way to encode Problems 1.1/1.2/1.3 for modern data mining when using spreadsheets, DataFrames, and SQL databases. Indeed, our first fundamental result (Theorem 3.11) explicitly formalizes this correspondence.
The reader may be wondering why we introduce something as abstract as simplicial homotopy into something so concrete and common as data merging. Consider the typical database operation SELECT * FROM table1 INNER JOIN table2 ON table1 . column1 = table2 . column2 WHERE condition ; When issuing such a command, the administrator must designate two tables to be JOINed and choose specific columns from the two tables to be identified via the ON clause. The SELECT * ...; command returns a table, whose columns must appear in some order that is determined by the ordering of attributes in table1 and table2, by their placement in the command, and by the columns in the ON clause. Thus, in the language of Section 2, the database software and the working administrator must agree on a total set of attributes, the attributes in each table, and an ordered attribute inclusion to be used for the ON clause.
This command also indicates why we formalize "data tables" as measures over products of attribute value spaces. Replacing SELECT * with a SELECT columnList corresponds to the ability to re-order the attribute list and to marginalize the output to a sublist of attributes. The WHERE condition clause allows one to impose additional restrictions on the values to be considered by imposing logical conditions on the entries, which is analogous to evaluating a measure on a measurable set. (Finally, for those fluent in SQL subtleties, note that the ability to perform LEFT, RIGHT, and OUTER JOIN instead of INNER JOIN will be encompassed by approximate joins in Section 4.) The third ingredient (Steenrod's obstruction theory as in [18]) provides guidance on how to combine homological algebra and homotopy theory to detect and describe any obstructions to an iterative merging process. While Steenrod's theory as originally written applies only to fibrations of topological spaces, we see that a similar formulation is effective in our case when using formal homology and simplicial homotopy. When sequential merging is impossible, the obstruction cochain can compute specific data tables that obstruct the process. That is, obstruction theory determines when Problem 1.3 is solvable locally but not globally.
The fourth ingredient (persistence of filtrations) provides a mathematically robust way to measure how much the underlying original data tables would have to be altered, in order to overcome an obstruction. This is a key feature of the theory, because from a practical perspective, multiple information sources are never perfectly consistent. Typos and transpositions and omissions and error bars always exist, and must be accounted for. We use a filtration built from the Wasserstein distance on measures to ensure that the desired simplicial homotopy is possible throughout all levels of the filtration. This allows for a well-defined notion of persistent obstruction theory. Our second fundamental result (Theorem 4.13) formalizes the idea that when inconsistencies are identified, one of two remedies may be available 1 • the head office should retreat back one merging level, repair (with repair suggested by algebra), then again seek consensus • the head office should settle for only approximate consensus, where the desired measure only approximately marginalizes correctly, with the degree of approximation computed via persistence. The ultimate result of this article is a mathematically robust framework for data merging that is reasonably applicable to real-world data. In this framework, Problems 1.1 and 1.2/1.3 become Problems 2.37 and 2.38, which are answered by Theorem 4.13 and Definition 4.14.
1(b). Outline. The rest of this paper is organized as follows. Section 2 defines the basic object of study, a data complex, and draws a mapping between its simplicial set structure and the choices that must be made by any database administrator. Categorical language is alluded to in this section, but a full categorical treatment of data complexes is confined to the Appendix. Section 3 connects simplicial homotopy to the notion of JOIN, and shows how obstruction theory detects the impossibility of merging. Section 4 describes our notion of persistent obstruction theory and its application to the idea of fuzziness of consensus. The paper concludes with discussion of practical considerations for applications in Section 5.

Acknowledgments
Work by all three authors was partially supported by the DARPA Simplex Program, under contract # HR001118C0070. The last two authors were also partially supported by the Air Force Office of Scientific Research under grant AFOSR FA9550-18-1-0266.
We are very grateful to John Paschewitz and Tony Falcone for project guidance and technical direction, and to Greg Friedman, Justin Curry, Jose Perea, and Jonathan Mattingly for helpful discussions at various stages of the theoretical development.

Attributes and Data Tables
This section provides a practical developmental discussion of a Data Subcomplex that should be accessible to a fairly wide mathematical audience, with full categorical language found in the Appendix. The basic definitions appear in Sections 2(a) and 2(b), culminating in Theorem 2.14 which shows that we have indeed defined a simplicial set. Operations that are specifically useful to standard database operations (inclusion/merge/join) are defined in Section 2(c). Then Section 2(d) makes plain the analogue of "section of a bundle," which permits the rephrasing of our fundamental problems in mathematical language, and Section 2(e) defines the (co)homology of data complexes needed for obstruction theory. 1 In fact, the second is always available, but may be less desirable! 2(a). Data Subcomplex as a Simplicial Set. Our definitions are aimed at making precise the following real-life scenario in data administration.
(1) The administrator chooses a set A of all attributes (column names and variable types) of interest. (2) For each attribute a in the list A, the administrator chooses a space of possible values, and a "reasonable" metric ρ a that can provide the distance between any two values in that space. Our notion of "reasonable" includes compactness, which is typically guaranteed by boundedness of realistic integer or vector-valued entries. (3) The administrator acquires "data" for some lists of attributes, and attempts to reconcile these into a joint view across all attributes in A. The reconciliation process involves "join" operations that could be represented by SQL commands such as (4) When reconciling, the administrator may choose to alter the data, as long as the alterations are "small" with respect to both the individual values via ρ a and with respect to the overall information-theoretic content of the data. The former two items are choices that must be made. The latter two items are a process to be accomplished. The mathematical structure developed here is informed deeply by the example SQL command, as discussed in Section 1(a).
Let us define our objects. It is convenient to use language of category theory; see Appendix A for our conventions.
Consider a finite set A. The elements are called attributes. For each attribute a ∈ A, there is a compact metric space (V(a), ρ a ), called the value space. 2 These assumed objects (the finite set of attributes and a compact metric space assigned to each attribute) are user-supplied by a data administrator; after these choices are made, everything else proceeds as defined.
Each V(a) is a Radon space (in particular, a measurable space) using the usual Borel algebra from the metric ρ a . These metrics will be used in Section 4 to quantify levels of acceptable imprecision when marginalizing measures.
An attribute list T = [a 0 , a 1 , . . . , a n ] is a finite sequence of attributes; that is, an attribute list is a function T : {0, . . . , n} → A. The length of an attribute list is len(T ) := n + 1. An attribute list T is called nondegenerate if it contains no repetitions; that is if the function T is one-to-one. The longest nondegenerate attribute lists are permutations of A.
For any attribute list T , the product space V(T ) := n i=0 V(a i ) is well-defined. The product space V(T ) admits the L ∞ metric ρ T = max a∈T ρ a and is measurable via the corresponding tensor-product algebra. 3 For any listÃ representing a permutation of the set A, then V(Ã) is the correspondingly ordered total product of all the measurable spaces of all attributes. At the other extreme, we equip the empty attribute list [], of length 0, with the trivial value space as V([]) = { * }, a singleton set. Definition 2.1 (Set of Attribute Lists). Let A denote the set of all attribute lists in A. For each n ≥ −1, let A n ⊂ A denote the set of all attribute lists of length n+1. A is a small category. Using the notation from Appendix A, an object in A is a function T : n → A. The case n = −1, giving the empty list T = [], is allowed. A morphism of attribute lists T → T is given by : n → n (an order-preserving function, which is a morphism of ∆ a as in Appendix A) such that T = T • , which is natural for the commutative diagram (2.1).
In Section 2(b) it is shown that for n ≥ 0, each A n is equipped with face maps d i : A n → A n−1 (by omission of the ith element as in Defn 2.7) and degeneracy maps s i : A n → A n+1 (by repetition of the ith element as in Defn 2.11). When omitting the trivial −1-level, A is the simplicial set whose elements are generated by the permutations of A via the face and degeneracy maps. Including the trivial −1-level, A is the augmented simplicial set generated this way. See Appendix A for a summary of the standard definition of (augmented) simplicial sets.
For any attribute list T , let M(T ) denote the space of finite measures on V(T ). A data table is a pair (T, τ ) for τ ∈ M(T ) for any T ∈ A. Note that M([]) ∼ = R ≥0 , as a measure on the singleton set V([]) is determined by the mass M ≥ 0 of { * }. A trivial data table is any data table of the form (T, τ ) where T = [] and τ = M ≥ 0 is a measure on the singleton set V([]) = { * }. We sometimes abbreviate our notation for data tables from (T, τ ) to τ , because any τ ∈ M(T ) is equipped with a domain (the measurable sets in V(T )), so T is understood in context.
For practical purposes, because V(T ) is a compact metric space, one might use the Radon-Nikodym theorem to write any τ ∈ M(T ) using a density function with respect to the uniform 4 probability measure on the compact set; however, for simplicity we use the language and notation of measures instead of the language of functions and integrals. Definition 2.2 (Ambient Data Complex). Given A, the ambient data complex over A is the set of all data tables, Theorem 2.14 shows that the ambient data complex is a simplicial set (augmented when including X −1 ) with faces given by the marginalization integrals (Definition 2.9) and degeneracies given by Dirac diagonalizations or intersections (Definition 2.13). The ambient data complex X is a small category, whose morphisms are generated by faces and degeneracies. The forgetful functor p is a simplicial map between these small categories.

Definition 2.3 (Data Subcomplex).
Given an ambient data complex p : X → A, a Data Subcomplex is a subset/subcategory X ⊆ X that is closed under the face and degeneracy maps defined in 2.9 and 2.13. Because p is a simplicial map, the attribute base A = p(X ) = {T ∈ A n : ∃n ≥ −1, ∃(T, τ ) ∈ X n } 4 That is, the measure depends only on r, for metric balls B r (x) of sufficiently small radius.
Remark 2.6. Actual database merging problems encountered in real-life situations such as Problems 1.1-1.3 always present themselves as Finitely Generated Data Subcomplexes, because there is some finite set of database tables or spreadsheets under consideration. The face and degeneracy maps provide the logical relations between these tables that allow or prevent joining. Real-life situations are also closed under permutation; because, the "SELECT * FROM ..." clause in SQL allows the database engineer to re-order the columns of any table.
Notational Note! We always use p : X → A to refer to an ambient data complex. We use either p : X → A or p : S → B to refer to a data subcomplex of p : X → A. We tend to use p : S → B when we imagine that this data subcomplex came from an actual data merging problem (so it is likely to be finitely generated and closed under permutation); however, we state explicitly these conditions when they are required for a result. When the projection p and the attribute simplicial sets A, B are not used in a statement, we omit them and write "a data subcomplex S of an ambient X ." 2(b). Morphisms of Data Tables. This section establishes notation for common operations and proves that A and X are simplicial sets, establishing that they are small categories with morphisms that are well-understood in language of measures. Definition 2.7 (Face of Attribute List). The face map on attribute lists, d i : A n → A n−1 , is defined as omission of the ith entry a i in T = [a 0 , . . . , a i , . . . , a n ], so d i [a 0 , . . . , a i−1 , a i , a i+1 , . . . , a n ] = [a 0 , . . . , a i−1 , a i+1 , . . . , a n ].   Table). For a data table (T, τ ) ∈ X n with T = [a 0 , . . . , a i , . . . , a n ], let d i (τ ) ∈ M(d i (T )) be the measure evaluated on the basis sets U 0 ×· · ·×U i−1 ×U i+1 ×· · ·×U n of the Borel algebra on V([a 0 , . . . , a i−1 , a i+1 , . . . , a n ]) = V(d i T ) as This is the measure obtained by marginalization to omit the ith factor, which could also be written as , which is well-defined in X n−1 .
Face maps can be applied multiple times, and the following lemma provides the desired re-ordered "commutation" property. For attribute lists the proof is immediate; for data tables it is the Fubini-Tonelli Theorem applied to the measures.

Lemma 2.10 (Fubini-Tonelli Theorem). For any
Definition 2.11 (Degeneracy of Attribute List). The degeneracy map on attribute lists, s i : A n → A n+1 , is defined as repetition of the ith entry a i in T = [a 0 , . . . , a i , . . . , a n ], so s i T = [a 0 , . . . , a i , a i , . . . , a n ].   Table). For a data table (T, τ ) ∈ X n , let s i (τ ) ∈ M(s i (T )) be the measure evaluated on the basis sets U 0 × · · · U i × U i × · · · U n of the Borel algebra on V([a 0 , . . . , a i , a i , . . . , a n ]) = V(s i T ) as If the measure is expressed as a density function via the Radon-Nikodym theorem, then this is the Dirac-delta Theorem 2.14 (Simplicial Sets). Let X be the ambient data complex over an attribute set A. For any (T, τ ) ∈ X n , consider the face maps d i (T, τ ) and degeneracy maps s i (T, τ ) as in the definitions above. Then Proof. This is direct with no surprises, by working on the Borel basis sets U 0 × · · · × U n for V([a 0 , . . . , a n ]). The d i d j condition was already seen as Fubini-Tonelli.
2(c). Inclusions, Merges, and Joins. We now establish additional operations (inclusion, sum, merge, join) that are special to A and X and do not apply to general simplicial sets.
ι is one-to-one (implying n ≤ n), Although ι itself is a map of index sets, we use the compatibility property to overload notation and write ι : [a 0 , . . . , a n ] → [b 0 , . . . , b n ].

Remark 2.16 (Categorical Interpretation
). An attribute inclusion is a morphism T → T in the category A such that T = T • ι = ι * T where ι : n → n is a monomorphism in ∆ a . We overload notation (that is, omit the upper-star) and write ι : T → T . The functor ∆ a → A is contravariant, so attribute inclusions are actually epimorphisms in A; however, it is reasonable to call them "inclusions" because the n -ordered multiset T (n ) is an ordered subset of the n-ordered multiset T (n). One could avoid this overloaded notation by working in the opposite category, but we decline to add another layer of notation since the meaning is always clear in context.
which can be summarized as ι = [0, 1, 3, 7, 8]. We can abbreviate this by decorating the entries in T that are included from T , The next lemma and corollary make clear that face maps and attribute inclusions are related tightly.
Corollary 2.21. For any attribute inclusion ι : T → T , there is a sequence 7 of face maps d j 0 , . . . , d j k such that d j 0 · · · d j k T = T and such that the attribute inclusion induced by the sequence of face maps is ι. Moreover, any permutation of this sequence obtained by re-indexing the face-maps according to Lemma 2.10 is equivalent. If j 0 ≤ · · · ≤ j k , then j is the index function for ι c , the quotient inclusion.
Remark 2.22. In light of Theorem 2.14, Corollary 2.21 is partial version of Lemma A.4, which says face maps and degeneracy maps generate all the morphisms in a simplicial set. This is because the co-face and co-degeneracy maps in ∆ a generate all order-preserving maps.
Attribute inclusions provide surjections on value spaces and measures, according to the following "contravariant" definition.

Definition 2.23 (Reduction). Consider an attribute inclusion
Similarly, define the surjective function ↓ ι : M(T ) M(T ) by sequential application of face maps according to the previous corollary: For any τ ∈ M(T ), let That is, ↓ ι τ is the measure on V(T ) obtained by marginalizing τ to remove the factors specified by ι c .
When the attribute inclusion ι : T → T is understood from context, we abuse notation τ , so we use this notation as shorthand for "the total integral of a measure." Definition 2.24 (Sum of Attribute Lists). Given attribute lists T 1 and T 2 in A, define T 1 ⊕ T 2 as the attribute list obtained by concatenating T 1 and T 2 .
Note that T 1 ⊕ T 2 and T 2 ⊕ T 1 are related by a permutation, which (excepting the identity permutation) does not correspond to a morphism in the categories A or ∆ a . The concatenation process provides specific attribute inclusions T 1 → T 1 ⊕ T 2 and T 2 → T 1 ⊕ T 2 . More generally, for attribute inclusion ι : T → T as in Lemma 2.18, it is true that T and T ⊕(T /ι) are related by a permutation; because, the concatenation provides inclusions T → T and (T /ι) → T that may not be the original ι and ι c . On the other hand, for any sum T = T 1 ⊕ T 2 , it is true that T 2 is the quotient of T by the concatenation-induced inclusion of T 1 , and vice-versa.
, as not every measure on a product space is an elementary product of measures!
The concatenation is equipped with inclusions (2.5) Definition 2.26 (Permutation Notation). Suppose that T 12 , T 1 , and T 2 are attribute lists such that ς(T 1 ⊕ T 2 ) = T 12 for a permutation ς. Then ι = ς| T 1 : T 1 → T 12 and ι c = ς| T 2 : T 2 → T 12 are complimentary attribute inclusions. If the permutation or attribute inclusions are wellknown in context, then for any subsets correspond with respect to the ς-permuted indices.
Because Lemma A.2 provides an ordered form of the inclusion-exclusion principle, we can define an indexed form of the inclusion-exclusion principle. Definition 2.27 (Merge of Attribute Lists). Suppose T 0 , T 01 , T 02 ∈ X , and that ι 01 : T 0 → T 01 and ι 02 : T 0 → T 02 are attribute inclusions. Define Merge(T 01 , T 02 , ι 01 ∼ ι 02 ) as the attribute list obtained by performing the index merge specified by FigureA(a) as in Lemma A.2; this merge concatenates sublists spliced between the entries aligned by ι 01 ∼ ι 02 . Writing T 012 for Merge(T 01 , T 02 , ι 01 ∼ ι 02 ), Diagram A.1 becomes a diagram of attribute inclusions. (2.6) Note that the choice of ordering in Definition 2.27 and Figure A(a) is partially arbitrary. In particular, one may draw an equivalent diagram with any choice of interleaving pattern, as long as the T 0 entries remain fixed. However, this choice is irrelevant, as the theory developed in Section 3 will encompass all allowable permutations. Regarding the permutation notation introduced earlier, for any Borel sets , since the definition and algorithm give a well-defined permutation. This× notation is required in Theorem 3.11 and elsewhere.
Our choice of ordering in Merge() provides that the trivial merge Merge(T 01 , T 02 , []) = T 01 ⊕ T 02 is the sum from Definition 2.24.
As with Definition 2.24, the attribute list Merge(T 01 , T 02 , ι 01 ∼ ι 02 ) is well-defined regardless of the preference of T 01 versus T 02 and regardless of the indices specified by ι 01 and ι 02 . This list is identical to the list obtained by constructing the sum T 01 ⊕ T 02 then applying face maps to remove the image of d ι 02 (i) for each i indexing T 0 . But, again beware that the partitioned merge-sort construction equips T 012 with specific attribute inclusions T 01 → T 012 and T 02 → T 012 such that the composed attribute inclusion T 0 → T 012 is well-defined through both compositions. In general, these inclusions are not the same as the inclusions obtained through the "sum and face" construction. Lemma 2.29 (Decomposition of Merged Lists). Suppose T 0 , T 01 , T 02 ∈ X , and that ι 01 : T 0 → T 01 and ι 02 : T 0 → T 02 are attribute inclusions. Let T 012 denote Merge(T 01 , T 02 , ι 01 ∼ ι 02 ). Let ι c 01 : T 1 → T 01 and ι c 02 : T 2 → T 02 denote the complements of these inclusions, so T 1 := T 01 /ι 01 and T 2 := T 02 /ι 02 .
2(d). Data Sections. Because the forgetful map p : X → A acts like a projection, it allows a notion of section.
Remark 2.31. In Section 4, data sections will be specified as σ : A n → X n , on a single level of the simplicial-set grading, where the other levels are inferred by the face and degeneracy maps. This omits all nondegenerate elements of level n + 1, so is interpreted as a section on the n-skeleton.
The following definition captures a condition describing data subcomplexes that are "as compatible as possible." Definition 2.32 (Well-Aligned). A data subcomplex X of an ambient X is called wellaligned if: for all (T 01 , τ 01 ), (T 02 , τ 02 ) ∈ X and all T 0 with attribute inclusions ι 01 : T 0 → T 01 and ι 01 : The next lemma shows that well-aligned data subcomplexes in this theory play the role of "submanifolds transverse to the fiber" from classical bundle theory and of "holonomic submanifolds" in geometric PDE theory. That is, they represent local sections.
Lemma 2.33. Suppose that p : X → A is a data subcomplex of an ambient data complex p : X → A such that X contains a nontrivial data table. The following are equivalent: (1) X is well-aligned.
(2) There is a data section σ : A → X such that σ(A ) = X .
(3) p : X → A is a simple cover via the isomorphism p.
Proof. (2) implies (1): Note that well-alignedness is implied by the commutation of σ with the face maps.
(1) implies (2): The case of T 0 = [] implies that all data tables in X have the same mass, M , which is non-zero since X contains at least one non-trivial table. The case of T 01 = T 02 = T 0 = T implies that each T ∈ A admits exactly one (T, τ ) ∈ X .
It is immediate that (2) and (3) are equivalent.
Remark 2.34. A database engineer would appreciate a database system that could be described as a well-aligned data subcomplex, because for each list of columns present within any combination of the given tables, there is only one possible table; that is, for each T there is exactly one (T, τ ). Compare the well-aligned condition to the space of joins, Defn 3.5.
8 Natural means that it respects the face and degeneracy maps, as in (A.2) and Lemma A.4.
Note too that well-aligned implies finitely generated. (Not every finitely-generated data subcomplex is well-aligned, as it could have multiple data tables over the same attribute lists.) Moreover, if X is well-aligned (and contains a nontrivial data table), then all data tables can be re-scaled by their shared mass M to yield probability measures.
Lemma 2.36. If X is well-aligned and A is connected, then X is path-connected.
With the language of simplicial sets, we can now re-state our original motivating questions. The remaining sections of this document construct a precise way to answer these questions, and ensure that the notion of "distance" is well-defined. An appropriate notion of distance appears in Defn 4.1. When all the definitions and lemmas are in place, these problems are answered by the Obstruction Cocycle in Defn 4.8.
Problem 2.37 (Testing Problem, bis). Consider a data subcomplex p : S → B of an ambient p : X → A. Given a data section σ + : A n+1 → X n+1 of the form σ + : Problem 2.38 (Merging and Meta-Merging Problems, bis). Consider a data subcomplex p : S → B of an ambient p : X → A. Suppose that there is a simplicial map σ : B n → S n of the form σ : T → (T, σ). Does there exist an extension σ + : A n+1 → X n+1 of σ, meaning ∂σ + (T + ) = σ(∂T + ) for all T + ∈ A n+1 such that ∂T + ∈ B n . If not, what is the minimal distance that would allow an approximate extension?
2(e). Homology. We use the traditional definition of chains, summarized here to fix notation.
where negative coefficients indicate formally reversed orientation. We define (−1)-graded chains as elements of the 1-dimensional R-module, where negative coefficients indicate formally reversed orientation. We can also define (−1)graded chains as C −1 (X , R), the R-module generated by M([]) = R ≥0 . Moreover, for any (T, τ ) ∈ X k , note that r(T, τ ) and (T, rτ ) are formally distinct unless r = 1 R ; hence, the graded module C • (S, R) is very large, especially if V(a) is infinite for any a ∈ A. For k ≥ 0, define the usual simplicial boundary operator C k (X , R) → C k−1 (X , R), as (2.10) The next lemma is easy, but important; it means the usual notions of cycle/closed and boundary/exact apply to chains in X .
Proof. Suppose that T = [a 0 , a 1 , . . . , a k ] and that τ ∈ M(T ). Recall that V(T ) is a product of the attributes' measurable spaces, and by our definitions, the measure (V(T ), τ ) is finite, therefore σ-finite, so the Fubini-Tonelli theorem holds. In particular, Corollary 2.21 shows that the reduction ↓ [a 0 ,...,â i ,...,â j ,...,a k ] is symmetric, so because the double-sum is alternating, all terms will cancel.
Define the projection p : C k (X , R) → C k (A, R) by p(T, τ ) := T and extending by linearity.
Suppose that T = [a 0 , a 1 , . . . , a k ] and that τ ∈ M(T ). Then Similarly, the chain modules and homology are well-defined for any data subcomplex p : X → A of an ambient p : X → A. We are particularly interested in the case R = Z/2Z, so that a chain C • (A, Z/2Z) (respectively C • (X , Z/2Z)) is interpreted as a set of attribute lists (respectively, data tables), without any consideration for multiplicity or orientation. It is therefore sensible to apply the condition well-aligned to a chain (Y, ψ) ∈ C n (X , Z/2Z), so that a well-aligned chain in (Y, ψ) ∈ C n (X , Z/2Z) can be interpreted equivalently to a section σ : p(X ) → X , where X is the data subcomplex generated by the elements of ψ.

Homotopy as Joins
In the previous sections, we established that a data complex is equipped with simplicial homology, and framed data complexes as simplicial sets. This section contains several payoffs for that effort. First, Section 3(a) builds to Theorem 3.11, our first key result, which shows that the simplicial set language enables a connection between our framework and standard database engineering; later results show that the framework enables further insights into data merging problems that transcend standard database engineering. Then, Section 3(b) explores the simplicial homotopy of data complexes and reframes Problems 2.37 and 2.38 in the language of obstruction theory for simplicial sets, as in [9,11,16,8,6].
3(a). Database Joins and the Kan Condition. Recall these three standard definitions from the well-established theory of simplicial sets, as in Appendix A and [9,11,16,8,6].  Like everyone else, we abuse notation slightly by referring to both ∆ n → X (which is an infinite collection of sets) and n → x ∈ X n (which is the generator of that collection) as "an n-simplex in X." Note! The (categorical) n-simplex ∆ n is not the same as the (topological) n-simplex |∆ n |. The former is an infinite set of formal objects in the simplex category; it has no notion of "interior" or "continuity." The latter is a compact topological space obtained defined via convex linear combinations in R n+1 . There is a relationship between their respective categories called realization, as discussed in [ Tables as Simplices). A data subcomplex X in an ambient X is an (augmented) simplicial set by Theorem 2.14. Thus, a data table (T, τ ) with T = [a 0 , . . . , a n ] can be seen as (the generator of) an n-simplex, which includes its faces d 0 (T, τ ), . . . , d n (T, τ ) and degeneracies s 0 (T, τ ), . . . , s n (T, τ ), and so-on. The n+1 "vertices" are (generated by) the single-attribute data tables (T 0 , τ 0 ), . . . , (T n , τ n ) obtained by apply sequences of n face maps. For m ≤ n, the m-simplices within (T, τ ) are (generated by) the data tables obtained by apply sequences face maps and degeneracy maps until the result has m+1 attributes. The picture of "two simplices that share a boundary component" is realized in X as a pair of data tables (T 01 , τ 01 ) and (T 02 , τ 02 ) and attribute inclusions ι 01 : T 0 → T 01 and ι 02 : T 0 → T 02 such that there is a data table (T 0 , τ 0 ) with ↓ ι 01 τ 01 = τ 0 = ↓ ι 02 τ 02 . If len(T 01 ) = len(T 02 ) = 2 and len(T 0 ) = 1, then this information generates a 2-horn. A completion of the 2-horn to a 2-simplex would be (generated by) a data table (T 012 , τ 012 ) that has (T 01 , τ 01 ) and (T 02 , τ 02 ) as two of its three faces. Depending on the available simplices in X , it may or may not be possible to find such (T 012 , τ 012 ).

Definition 3.4 (Kan Condition).
A simplicial set X is said to satisfy the Kan condition iff any map from a horn Λ n k → X extends to a compatible map from the simplex, Λ n k → ∆ n → X. The Kan condition means that the simplicial set is closed under simplicial deformation, so it has a well-defined homotopy group. The Kan condition is not specific to data complexes; it is a definition for general simplicial sets, and gives the appropriate notion of fibrant for many model categories. For our purposes, we require a slight variation on the Kan condition to provide an adequate notion of fibrant data contexts, which we now develop as Defn 3.13.
Similarly to Defn 2.27, we write this set as Joins(τ 01 , τ 02 , T 0 ) for notational convenience when the attribute inclusions are understood from context. Note! This is not the same notion of "join" that one sees in traditional topology, or in categorical references such as [5,17] and https://ncatlab.org/nlab/show/join+of+ simplicial+sets. It is not yet clear whether there is a useful relationship to joins in ergodic theory [7]. We choose the term "join" to mimic the terminology in database engineering discussed in Section 1. Definition 3.5 reminds one of couplings from statistics, as in [3]; however, the generality here allows repetition and distinct measures and overlaps. Definition 3.6 (Join Conditions). A data subcomplex p : X → A of an ambient p : X → A is said to satisfy the weak join condition iff, for any (T 01 , τ 01 ) and (T 02 , τ 02 ) ∈ X with attribute inclusions T 0 → T 01 and T 0 → T 02 and ↓ T 0 τ 01 = ↓ T 0 τ 02 , then Joins(τ 01 , τ 02 , T 0 ) ∩ X is nonempty. It satisfies the strong join condition iff Joins(τ 01 , τ 02 , T 0 ) ⊂ X .
The weak join condition means that the simplicial set admits some database JOIN operation between any well-aligned pair of data tables. The strong join condition requires that all possible joins exist in X .
Definition 3.8 (Closure under Independent Products). A data subcomplex p : X → A of an ambient p : X → A is said to be closed under independent products iff, for every (T 1 , τ 1 ), (T 2 , τ 2 ) ∈ X and Remark 3.9. The independent product is an example of a trivial join. If a data subcomplex is closed under independent products, then it also includes all IID measures built from its various data tables; this property is important for applications to statistics.
Lemma 3.10. If a data subcomplex p : X → A of an ambient p : X → A satisfies the strong join condition, then p : X → A is closed under permutations.
Proof. Because A is finite, it suffices to prove that X is closed under permutations that are swaps (that is, transpositions or 2-cycles). Moreover, it suffices to consider only swaps of adjacent entries, as any swap i ↔ j can be written by migration of j past i, then i to the original position of j.
(We would not be surprised if the Kan condition and the strong join condition are equivalent, under some reasonable assumptions, but we have not pursued that claim.) Proof of (2). Suppose that X satisfies the Kan condition and admits trivial joins. Admission of trivial joins provides the weak join condition in the case T 0 = []. Suppose that (T 01 , τ 01 ) and (T 02 , τ 02 ) are elements of X . Suppose that there are inclusions ι 01 : T 0 → T 01 and ι 02 : T 0 → T 02 for some T 0 , and suppose that ↓ T 0 τ 01 = ↓ T 0 τ 02 . Each of (T 01 , τ 01 ) and (T 02 , τ 02 ) and (T 0 , τ 0 ) provides all faces of all lower dimensions. We prove the existence of (T 012 , τ 012 ) ∈ Joins(τ 01 , τ 02 , T 0 ) by induction on the dimension. For simplicity, we use the language of simplicial sets, instead of the language of measures. Recall that a "vertex" is a data table obtained by marginalizing to a single attribute, and an n-simplex is a data table obtained by marginalizing to n+1 attributes, as in Remark 3.3. Fix a preferred vertex k in (T 0 , τ 0 ). For any vertex i in (T 01 , τ 01 ) and any vertex j in (T 02 , τ 02 ), the 1-simplex [i, k] and [k, j] exist a priori (up to notational ordering). This is an example of a horn Λ 2 k . Therefore, by the Kan condition, the 2-simplex [i, k, j] exists in X . Hence, every 2-face including k and vertices in (T 01 , τ 01 ) or (T 02 , τ 02 ) exists in X . Assume for induction that every n-face containing vertex k exists. Any n of those n-faces form a horn Λ n+1 k , so their (n+1)-face exists in X . So, every (n+1)-face containing vertex k exists in X . Therefore, there is a Data Table (T 012 , τ 012 ) that involves all vertices in (T 01 , τ 01 ) and (T 02 , τ 02 ).
Proof of (1). We prove part (1) under the notable assumption that V(a) is a compact metric space for all a ∈ A. Hence, for any attribute list T , the space of measures M(T ) includes a uniform 10 probability measure κ T .
Suppose that X satisfies the strong join condition. The case T 0 = [] implies admission of trivial joins.
In this proof, we assume that k = 0 is the common vertex in a horn Λ n k , but that is only for notational simplicity; the proof certainly applies for any other specified vertex k, by appropriate re-ordering. Consider data tables giving a horn Λ n 0 . These data tables are of the form (T m , τ m ) for 1 ≤ m ≤ n, where T m = [a 0 , . . . , a m−1 , a m+1 , . . . , a n ]. Let T = [a 0 , . . . , a n ]. As a horn, these data tables are well-aligned; that is, they match on all corresponding faces according to d i τ j = d j−1 τ i for i < j as noted after Definition A.7. In particular, all these data tables share the same total mass, M . To establish the Kan condition, we construct a compatible n-simplex; that is, a data table (T, τ ) such that d m (T, τ ) = (T m , τ m ) for 1 ≤ m ≤ n.
For K = V(T 0···(n−2) ), a compact set, consider the measures Consider any trivial join t K ∈ Joins(µ K,n−1 , µ K,n , []) ⊂ M(T n−1 ⊕T n ) = M(Merge(T n−1 , T n , [])); that is, ↓ T n−1 t K (U n−1 ) = τ n−1 (K×U n−1 ) and ↓ Tn t K (U n ) = τ n−1 (K×U n ). Note that Fix any open W ⊂ K such 11 that κ T 0···(n−2) (W ) = 1 2 κ T 0···(n−2) (K). Note that the measures ). That is, ↓ T n−1 t W (U n−1 ) = τ n−1 (W×U n−1 ) and ↓ Tn t W (U n ) = τ n−1 (W×U n ). Note that Further, by their definitions via trivial joins from W ⊂ K, one can choose t W to guarantee Likewise, for the closed set K − W , define t K−W := t K − t W , which is also a measure in M(T n−1 ⊕ T n ) by construction. Note that both the closureW and the complement K − W are closed in K, therefore both are compact. Replacing K withW or K − W in (3.2) means that we can establish measures {t W λ } λ∈Λ for a countable bifurcating 10 That is, κ T (B r (x)) depends only on r, for metric balls B r (x) of sufficiently small radius. 11 Of course the value of 1 2 is not special, but aesthetic. Any 0 < κ(W ) < κ(K) will do.
Assume for induction that for some m satisfying 1 < m ≤ n − 1 there exists a data table (T, τ (m) ) ∈ X such that d k τ (m) = ↓ T k τ (m) = τk ∈ X for all m ≤ k ≤ n. Denote the "error" of the d m−1 face as The error ε m−1 is a signed measure-not a measure-on V(T m−1 ), but the face operation of marginalization is still sensible. Then for m ≤ k ≤ n − 1, Also, for application below, consider the pre-measure f on Borel sets Observe the inequality which follows because for all Borel sets W×Z ⊂ V(T n−1 ) satisfying ε m−1 (W×Z) > 0, we have Let ρ m−1 ∈ M(T m−1 ) be a probability measure satisfying the condition for all Borel U m−1 ⊆ V(T m−1 ). Such probability measures are guaranteed to exist by (3.10). Then, define for any 12 Borel W×U m−1× Z ⊆ V(T ), and extend by additivity. By construction, τ (m−1) is additive and zero-null. Non-negativity follows from (3.12) and the definition of f in (3.9); therefore, τ (m−1) is a measure on V(T ).
Moreover, τ (m−1) satisfies the desired marginalizations, shown here: (3.14) And, for m ≤ k ≤ n, the properties (3.8) apply to give Therefore, (T, τ (m−1) ) is a data table that has the desired faces d m−1 through d n . The inductive step is established. The ultimate data table (T, τ (1) ) provides the n-simplex ∆ n completing Λ n 0 .  See Appendix A for a categorical version of this definition. The entire raison d'être of fibrant objects is that they admit homotopy, as proven by [9] and [16]. In the category of simplicial sets, the term fibrant refers only to the Kan extension condition. Our practical desire to use joins as a weak-equivalence compels us to require the strong join condition. By Theorem 3.11(1) the traditional definition and all of its consequences are implied.
Corollary 3.14. Suppose a data subcomplex X of an ambient X is fibrant, and fix a basepoint data table (T 0 , τ 0 ). The homotopy group π n (X , τ 0 ) is well-defined for all n, and satisfies the typical properties of homotopy categories over model categories. Proof. The very definition of an ambient X is that it includes all finite measures over the relevant metric spaces, so it includes the set Joins() in particular.
We now want to explore how a data subcomplex p : S → B of an ambient p : X → A interacts with any other attribute list T ∈ A. The following sets are of interest. A data subcomplex p : S → B may not be fibrant, so we define a convenient fibrant space that contains it. The notation F 0 is meant to be suggestive; in Section 4, a larger filtration of simplicial sets is created (Definition 4.5) by turning the equality in the definition below into an inequality involving Wasserstein distance. Definition 3.17 (Complex of Perfect Joins). For any data subcomplex X of an ambient X , let F 0 denote the subset of X defined by (T, τ ) ∈ F 0 if and only if ∀ a ∈ T , ∃ (S, σ) ∈ S| T such that a ∈ S and ↓ S τ = σ.
Note: the quantifier "∀a ∈ T " refers to each entry in the attribute list, which means repeated entries must have corresponding measures.
The definition of F 0 is a convenient way to say "consider everything that can be generated from S using Joins()," as justified by the following lemma. Similarly, the upcoming Definition 4.5 of F t gives a convenient way of saying "consider everything that can be approximated to an acceptable level of uncertainty from S using Joins()." Proof. Suppose (T, τ ) ∈ F 0 . Let a 0 ∈ T denote the first attribute of T . By the definition of F 0 , there exists (S 0 , σ 0 ) ∈ S with an attribute inclusion ι 0 : S 0 → T such that ↓ ι 0 τ = σ 0 and such that a 0 is in the image of ι 0 . Let (T 0 , τ 0 ) = (S 0 , σ 0 ). By reducing T 0 if necessary, we may ensure that ι 0 (T 0 ) is contiguous within T . If T 0 = T , then the sequence is complete. Otherwise, there exists some first attribute a 1 in T /ι 0 . By the definition of F 0 , there exists (S 1 , σ 1 ) ∈ S with an attribute inclusion ι 1 : S 1 → T such that ↓ ι 1 τ = σ 1 and such that a 1 is in the image of ι 1 . By reducing S 1 if necessary, we may ensure that ι 1 (S 1 ) is contiguous within T , and that T 0 ∩ S 1 is also contiguous. With these reductions, the orderings are consistent such that T 1 := Merge(T 0 , S 1 , T 0 ∩ S 1 ) is equipped with a list inclusion T 1 → T . Because τ is given, let τ 1 = ↓ T 1 τ , which by construction is an element of Joins(τ 0 , σ 1 , T 0 ∩ S 1 ). Repeat this process until all elements a i of T are in the image of some inclusion T i → T .
For the converse, note that each a ∈ T is included in some S i , which is sufficient.
Corollary 3.19. F 0 includes all independent products formed from data tables in S.
Lemma 3.20. For any data subcomplex S of an ambient X , the complex of perfect joins F 0 is fibrant.
Proof. The data subcomplex S is closed under face maps and degeneracy maps, so application of those maps to all (S, σ) in the definition shows that F 0 is closed under the face maps and degeneracy maps as well. To verify that F 0 is fibrant, suppose that (T 012 , τ 012 ) ∈ X is any join of (T 01 , τ 01 ) and (T 02 , τ 02 ) in F 0 . Because every a ∈ T 012 appears in T 01 or T 02 , the existence of (S, σ) ∈ S in inherited from (T 01 , τ 01 ) and (T 02 , τ 02 ).
We conclude this section by tying simplicial homotopy theory to Problem 2.38.
Lemma 3.21. Suppose X is a fibrant data subcomplex of an ambient X . A basepointpreserving simplicial map f : ∂∆ n → X defines a class in α(f ) ∈ π n−1 (X ). Moreover, α(f ) = e if and only if f admits an extension f + : ∆ n → X .
Proof. The first claim reduces to Lemma 9.6 in [6]. The second claim reduces to Lemma 7.4 in [8]. Our definition of fibrant implies path-connectedness, so a spanning tree can be used for locality such as in [9].
Corollary 3.22. Suppose that p : S → B is a data subcomplex of an ambient p : X → A such that B n−1 = A n−1 for some n ≥ 1. Fix a simplicial section σ : B n−1 → S n−1 . The following are equivalent (omitting basepoints for brevity).
(1) For every composition we have α(ι • σ • c) = e ∈ π n−1 (F 0 ). (2) σ admits an extension of the form σ + : A n → F 0 n . Proof. Because A n−1 = B n−1 , the boundary of every n-simplex in A appears in B. Apply the previous lemma for each f = ι • σ • c as a map f : ∂∆ n → X for X = F 0 . This corollary is revisited as Lemma 4.9. The corollary fails when no such extension can be found. Then, the question remains: how to measure the failure of this corollary? That measurement is the purpose of filtered obstruction theory.

Filtrations and Obstructions
This section concludes the theoretical framework outlined in Section 1(a). Section 4(a) introduces a filtration from a data subcomplex S to its ambient X using the Wasserstein distance. Each level of the filtration is fibrant, which allows one to define an obstruction cocycle (Section 4(b)) at each level of the filtration. Eventually, for a high enough level in the filtration, the obstruction cocycle becomes trivial, so the importance of the obstruction cocycle can be measured using topological persistence. This statement is formalized in Theorem 4.13, which can be seen as the main payoff of our theoretical development in terms of database engineering. As promised in the introduction, the theory of data complexes does not just mathematize the notion of table merging; rather, it provides further powerful operations when traditional merging is impossible. 4(a). Filtrations from Data Subcomplexes. A general notion of persistence on simplicial sets appears in [15]. In summary, a fibrant filtration of simplicial sets is a bi-graded collection of sets {F t n } for 0 ≤ t ≤ ∞ and n ∈ N equipped with maps d i and s i such that (1) (F t , d i , s i ) is a simplicial set for each t, The fibrant condition implies that π n (F t ) is well-defined for all t, and the inclusion maps F s → F t induce maps on homotopy, π n (F s ) → π n (F t ).
We now define a specific filtration for a data subcomplex that is designed to meet our application regarding joining data tables. Recall that (V(a), ρ a ) is a Radon space for each attribute a.
The reductions ↓ 1 and ↓ 2 refer to the two copies of the attribute a.
For any T ∈ A and τ 1 , τ 2 ∈ M(T ), let (4.2) The reductions ↓ 1 and ↓ 2 refer to the two interwoven copies of the attribute list T . Remark 4.2. Recall that ρ T (x 1 , x 2 ) = max a∈T ρ a (x 1,a , x 2,a ), the L ∞ -metric obtained from the individual attribute metrics. Also, in the special case that ↓ [] τ 1 = ↓ [] τ 2 , the infimum argument µ lies in the space of trivial joins, Joins(τ 1 , τ 2 , []), so the Wasserstein distance is tied to our notion of fibrant data complexes.
The proof is identical to the proof of Lemma 3.20, replacing the equality with an inequality.
Proof. Recall that the data complex S is closed under face maps and degeneracy maps. Note the face and degeneracy bounds for the Wasserstein distance given above. Application of those maps to the (S, σ) and (T, τ ) in the definition shows that F t is closed under the face maps and degeneracy maps as well. Therefore, F t is a data subcomplex.
To verify that F t is fibrant, apply Theorem 3.15 to obtain all joins (T 012 , τ 012 ) ∈ X from any (T 01 , τ 01 ) and (T 02 , τ 02 ) in F t . We must show such τ 012 lies in F t . Fix a ∈ T 012 . Because every a ∈ T 012 , it appears in T 01 or T 02 . For concreteness, assume a ∈ T 01 . There is some (S, σ) ∈ S such that w S (↓ S τ 01 , σ) ≤ t. By the construction of τ 012 , we have ↓ T 01 τ 012 = τ 01 , so Because F t is fibrant, all of the usual consequences apply in homotopical algebra, such as Corollary 4.7. Fix a data subcomplex S of an ambient X . For each t ∈ [0, ∞], and for each n ≥ 0, the pointed homotopy group π n (F t , * ) is well-defined. Moreover, for t 1 ≤ t 2 , the inclusion of data subcomplexes F t 1 ⊂ F t 2 induces a homomorphism of pointed homotopy groups π n (F t 1 , * ) → π n (F t 2 , * ).
Obstructions in dimension n − 1 = 2 detect spheres in F t , which will prevent some n + 1 = 4 data tables from being mutually joinable.
Obstructions in dimension n − 1 = 0 detect non-path-connectedness of F t , which would prevent some n + 1 = 2 data tables from being joinable (but this is impossible with our definitions including trivial joins).
The next theorem is an adaptation of Theorem 34.6 and Corollary 34.7 in [18], which is summarized in Theorem 4.5 of [10]. It relies on defining a difference cochain that compares a homology class of sections.
Theorem 4.12. Fix a data section σ : B n−1 → S n−1 . Suppose ξ t σ = δη for some η ∈ C n−1 (A; π n−1 (F t )). Then there exists a data section τ : A n → F t n such that τ | n−2 = σ| n−2 . The converse holds as well.  (1) ξ t σ = e as a cocycle. Every n−1-cycle of n+1 data tables in S over a total of n+1 attributes can be approximately joined to a single data table over those n+1 attributes, allowing error at-most t in any reduction to the original data.
(2) ξ t σ = e as a cocycle, but [ξ t σ ] = e as a cohomology class. There is some (n−1)cycle of n+1 data tables (T0, τ0), . . . , (Tn, τn) in S such that the combined attribute list T = [a 0 , . . . , a n ] does not admit an approximate join (T, τ ) with error at-most t. However, if one considers all of the faces of these data tables, then there is an approximate join to (T, τ ) of error at-most t.
(3) [ξ t σ ] = e as a cohomology class. There is some (n−1)-cycle of n+1 data tables (T0, τ0), . . . , (Tn, τn) in S such that the combined attribute list T = [a 0 , . . . , a n ] does not admit an approximate join (T, τ ) with error at-most t, even when omitting attributes from the original data tables. The only way to produce a single joined table is to increase the error threshold t. Definition 4.14 (Persistence of Obstruction). Let S ⊆ F 0 ⊂ · · · F t ⊂ · · · ⊂ F ∞ = X be the filtration of a path-connected data complex. Fix a dimension n such that d i Y ∈ B n−1 for all faces d i of all Y ∈ A n . Let σ : B n−1 → S n−1 be a data section. Fix a basepoint (T 0 , τ 0 ) ∈ S ⊂ F 0 . Define t n (σ) := inf{t : ξ t σ = e ∈ C n (A; π n−1 (F t ))} and t n (σ) := inf{t : [ξ t σ ] = e ∈ H n (A; π n−1 (F t ))}. Note that t n (σ) ≤ t n (σ).
Remark 4.15. Consider a data section σ : B → S. A specific value t n (σ) = t means that σ admits an extension into F t , but not for any level of the filtration less than t. In other words, there is no obstruction to extension beyond the mere existence of the data section σ : B n−1 → F t n−1 . Similarly, by Theorem 4.13, a specific value t n (σ) = t means that there is no obstruction to extension beyond the mere existence of the data section σ| n−2 : B n−2 → F t n−2 .
Remark 4.16. When obstructions are resolved, there are typically many solutions to Problems 1.2/1.3. That is, if any hypothesis is consistent in 1.1, then there are typically many other hypotheses that are consistent as well. Typical methods for choosing among them often involve posing and then solving some optimization problem. We might propose enriching those optimization problems via inclusion of a measure of global inconsistency. More precisely, the cost of a proposed data section σ might be some combination of a local cost and some decreasing function of t n (σ) or t n (σ); in other words, one might penalize proposed local mergers based on the degree of difficulty they cause in forming global consensus with other local mergers.

Discussion
This paper provides a mathematical foundation for semi-automated data-table-alignment tools that are common in commercial database software. Data tables are abstracted as measures over value spaces, and the problem of merging tables, or indeed merging previouslymerged tables, is recast as the search for a measure that marginalizes correctly. This abstraction, and the simplicial set structure built with it, permits several advances over the current state of the art in database engineering. Ongoing and future work will focus on developing clear algorithms for application of persistent obstruction theory to real-world database engineering and related problems in data science.
We conclude this paper with several brief remarks about further work and also some practicalities for future use of this theory: • A data sample X in any metric space V provides a measure, by counting. The measure is µ(U ) = #(U ∩ X) or normalized as µ(U ) = #(U ∩X) #X for any U ∈ 2 V . • For computational purposes, most infinite metric spaces can be considered as compact or finite spaces, using bounds or bins or kernel methods or distributional coordinates that are appropriate to the problem at hand. • On the compact metric spaces V(T ), measures of interest can be described as density functions via a Radon-Nikodym comparison to the uniform probability measure κ T . • One attribute can represent models on other attributes, providing an interpretation of Bayesian inference and an opportunity to apply persistent obstruction theory to compact parameterized model spaces. In machine learning, one could use this framework to describe the compatibility of solutions in ensemble methods. • Any list of attributes can be considered as a single attribute, because it is still provides measures over some metric space. There is no requirement that attribute value spaces are "minimal" or "1-dimensional" in any sense. • Filtrations other than L ∞ -Wasserstein might work, too, but someone has to prove that all levels of the filtration are fibrant. • The most important conclusions of this work are: Any manual or automatic datamerging system must analyze homotopy in order to guarantee success; and Obstructions can only be resolved two ways-backing up one step, or allowing additional leeway in the data comparison.

Appendix A. Categorical Definitions
This appendix provides a rapid summary of a categorical interpretation of the development in Section 2. For more on these topics, and for the notion of homotopy for fibrant objects in model categories, see [9,11,16,8,6]. The reader is warned that each of these references uses a slightly different convention for ordering, opposite categories, and co-/contra-variant functors.
A(a). Simplex. Let Set denote the set category, whose objects are sets and whose morphisms are functions.
Let ∆ denote the simplex category, whose objects are the nonempty sets of natural numbers with the standard ordering ≤, written n := {0, 1, · · · , n}, and whose morphisms are orderpreserving functions. Let ∆ a denote the augmented simplex category, whose objects are sets of natural numbers with the standard ordering, and whose morphisms are order-preserving functions. The augmented simplicial category includes the empty set, denoted −1 or ∅, which is the initial object in the category. So, ∆ a = ∆ ∪ {∅}. A monomorphism in ∆ a is a one-to-one order-preserving function. The only bimorphisms/isomorphisms in ∆ a are the identity maps. Among the morphisms in ∆ and ∆ a are the co-faces d i and co-degeneracies s i , defined as follows.
Every non-identity morphism in ∆ or ∆ a can be written a finite composition of co-face and co-degeneracy morphisms, so these five properties essentially characterize ∆ and ∆ a .
For our applications, the following lemmas about monomorphisms in ∆ a are very useful. They are elementary, but do not appear in the standard references in this form. Merged indexing is merely an ordered formulation of the inclusion-exclusion principle.
Lemma A.1 (Complimentary Monomorphism). For any monomorphism ι : n → n in ∆ a , write m = n − n − 1. There is a monomorphism ι c : m → n in ∆ a that enumerates the entries of n that are not in the image of ι.
Example A.3. Consider n 0 = 1 and n 01 = 5 and n 02 = 4. Then n 012 = 8. Let ι 01 : 1 → 5 be the monomorphism that is written as the sequence [1,4]. Let ι 02 : 1 → 5 be the monomorphism that is written as the sequence [1,3]. Visually, the merged indexing means A(b). Simplicial Sets. For any category C, the "simplicial category over C" is sC. An object in sC is a contravariant functor X : ∆ → C. That is, an object in sC is an assignment of: • for each object n in ∆, an object X n in C; • for each morphism (order-preserving function) µ : n → n in ∆, a morphism X(µ) : X n → X n in C. The augmented simplicial category, asC, allows a terminal object in C to correspond to the initial object −1 ∈ ∆ a . That is, the trivial map −1 → n yields a corresponding map X n → X −1 , if the category C happens to admit a terminal object.
The morphisms X → Y in sC or asC are the natural transformations as in (A.2).
The most important case is sSet, the category of simplicial sets, which is augmented to asSet. The following lemma shows that augmented simplicial sets are given by face and degeneracy maps.
Lemma A.4. Any object in asSet is a set X (called an augmented simplicial set) graded by −1, 0, 1, 2, . . . and equipped with morphisms d i : X n → X n−1 and s i : X n → X n+1 for 0 ≤ i ≤ n such that Proof. The objects are apparent. As for morphisms, each co-face d i : n − 1 → n and co-degeneracy s i : n + 1 → n morphism in ∆ a must correspond to face d i : X n → X n−1 and boundary s i : X n → X n+1 morphisms in X. Because the co-face and co-degeneracy morphisms generate all non-identity morphisms in ∆ a , it is sufficient to specify these face an degeneracy maps. A particularly important example of a simplicial set is ∆ n , the n-simplex. (See 3(a).) Definition A.6 (Simplex). The standard n-simplex ∆ n is the simplicial set generated (via face and degeneracy maps) by the ordered set n = {0, . . . , n} in the simplex category ∆.
By the Yoneda Lemma, a simplicial set X is characterized by the simplicial maps ∆ n → X; that is, a simplicial set is characterized by its simplices.
Definition A.7 (Horn). The kth horn Λ n k of the n-simplex ∆ n is the simplicial subset generated by the union of all the faces of ∆ n except the kth face.
By Lemma A.4 and the Yoneda Lemma, if X is a simplicial set, then a horn in X is a collection of n (n−1)-simplices f 0 , . . . , f k−1 , f k+1 , . . . , f n such that d i f j = d j−1 f i for i < j.
A simplicial map f : X → Y is called a cofibration iff it is a monomorphism. A simplicial map f : X → Y is called a fibration iff for any cofibration i : Λ n k → ∆ n , the commutative diagram (A.3) can be completed.
Weak-equivalences are defined to be compatible with fibrations and cofibrations according to [16]. See also [8]. These definitions of cofibration, fibration, and weak equivalence make sSet into a (closed) model category.
A simplicial set X is called fibrant or to satisfy the Kan extension condition if f : X → { * } is a fibration; that is, a simplicial set satisfies the Kan condition if and only if each horn Λ n k in X can be extended to a simplex ∆ n in X. Let sSet f denote the subcategory of fibrant simplicial sets. Then there is a homotopy category Π n (sSet f ), and any X ∈ sSet f admits pointed homotopy groups π n (X, x) that characterize the weak equivalence. Moreover, the simplicial homotopy groups of X ∈ sSet f are isomorphic to the continuous homotopy groups of its topological realization, |X|, as discussed in [16, §3] and [8,Chap I.2]. See also [9] and [11] for historical explanations that minimize categorical language.
A(c). Data Complexes. Let DataCplx denote the category of data complexes. An object in DataCplx is a pair of augmented simplicial sets (X , A) with simplicial map p : X → A such that for each n ∈ ∆ a , the set X n is a set of data tables over attribute lists A n from some attribute set A, as in Section 2, with d i and s i by marginalization and Dirac-delta intersection, respectively.
A morphism in DataCplx is simplicial map f : (X , A) → (Y, B) as in (A.4) with some compatibility conditions.
The vertical maps are tuples (ϕ n , {ψ a } a∈A , f n ) satisfying the following compatibility conditions.
(1) ϕ n : A n → B n is a level of a simplicial map ϕ : A → B on sets of attribute lists.
(2) f n : X n → Y n is a level of a simplicial map f : X → Y on sets of measures, with ϕ n = p • f n . These conditions guarantee simply that the attribute lists T , the value spaces V(T ), and the measure spaces M(T ) remain compatible. As with sSet, in (A.4), the map µ : n → n can be taken to be d i : n − 1 → n or s i : n + 1 → n so that the diagram describes naturality with respect to face and degeneracy maps on X and Y. These conditions are sensible for n ≥ 0, so they apply to the trivial data table ( Every data complex X admits a morphism to the terminal data complex R ≥0 . This terminal morphism f maps each data table (T, τ ) ∈ X to the singleton mass ([ * , . . . , * ], ↓ [] τ ) ∈ R ≥0 . If all data tables in X share the same mass (say, M = 1), then the image of the terminal morphism goes to some M ⊂ R ≥0 .
A morphism in DataCplx is called a cofibration iff it is a monomorphism. A morphism in DataCplx is called a fibration iff for any cofibration of from a well-aligned pair to a join i : τ 01 , τ 02 T 0 → τ 012 , the commutative diagram (A.6) can be completed.
(A.6) X Y τ 01 , τ 02 T 0 τ 012 f i A data complex X is called fibrant if the terminal morphism X → R ≥0 is a fibration. By Theorem 3.11, if X is a fibrant data complex, then X is a fibrant simplicial set. Thus, the category DataCplx is a (closed) model category, and the fibrant subcategory DataCplx f inherits a well-defined homotopy category Π n (DataCpl f ) from sSet f , and any X ∈ DataCplx f admits pointed homotopy groups π n (X , τ 0 ) that characterize the weak equivalence. Moreover, the homotopy groups are isomorphic to the continuous homotopy groups of the topological realization of the underlying simplicial set.