Mike_Kiev - Fotolia


How deterministic and probabilistic matching work

Expert David Loshin explores the benefits and challenges of the two classes of record matching in master data management systems: deterministic matching vs. probabilistic matching.

Record matching and linkage is the foundation of most modern master data management systems. There are many different tools and techniques that can be applied to match records, ranging from exact matching (in which two records are deemed to be a match only if all the corresponding data attribute values match exactly) to more complex matching using approximate similarity scoring to compare and match character strings to determine how close two records are.

Record matching and linkage methods are often divided into two classes, deterministic matching and probabilistic matching.

Deterministic matching applies discrete rules about how record attribute values are compared; the strictest approach requires exact matching, but in some cases standardizations can be applied to "smooth" the data in preparation for matching; character strings can be mapped to standard form (Reggie can be standardized as Reginald), punctuation can be removed (Dr. Kelley can be standardized as Dr Kelley), and tokens can be reordered within an attribute ("Kelley, Dr Richard" becomes Dr Richard Kelley). Deterministic matching can be extended to include some approximate methods, such as comparing equality of character string encodings.

Probabilistic matching analyzes the data attribute values to determine how data frequencies affect the probabilities of two matching records sharing the same values, versus two nonmatching records having the same values purely by chance. Once these probabilities are calculated, they are used to determine weights that are subsequently applied to the similarity scores for pairs of identifying attribute values computed using one or more approximate scoring methods. The sum of the weighted scores provides an overall similarity score. Record pairs whose similarity score meets or exceeds a match threshold are deemed to be matches, while a record pair with similarity score that is below an nonmatch threshold are deemed to be nonmatches.

How probabilistic record matching works

As can be expected, a key aspect of probabilistic matching is the determination of the probabilistic weighting factors to be applied to the similarity score for each pair of corresponding data elements. It relies on a Bayesian model of conditional probability to develop the weights and matching rules. These weights are determined in relation to two different probabilities:

  • the probability that if the values of two corresponding data elements agree, then the two records represent the same entity (called the m probability); and
  • the probability that the values of two corresponding data elements agree, despite the fact that the two records do not represent the same entity (called the u probability).

As an example, an m probability of 0.97 for a LastName field means that the probability that two records that share the same LastName field value represent the same individual is 0.97, and the probability that two records that share the same LastName field value do not represent the same individual is 0.03. An example of a u probability, which can be more simply stated as the chance that records have values that randomly match, would be 0.083 for a BirthMonth field. There are 12 months in the year, and the chance that two records share the same month is (12 matches)/(12*12 combinations), which = 1/12, which is 0.08333(repeating).

Once the m probability (m) and the u probability (u) is calculated for an attribute, the weighting factor for that attribute is log2(m/u). For example, if there is a Gender field that can be populated with the values {M, F}, the u probability is 0.5 (which is chance that any two records will have the same value). If we can determine that the m probability is 0.95 (by examining a sample of records and determining that the field is miscoded 5% of the time), then the weighting factor for the gender field would be log2(0.95/0.5), or 0.926. Should the record-matching algorithm use a similarity score function that assigns a score of 1 if the values match and a value of 0 if the values don't match, that value would be multiplied by 0.926 to yield its contribution to the overall similarity score.

Workarounds for probabilistic matching challenges

The challenge is in determining these probabilities. The m probability is often estimated; one approach is to take a random sample of record pairs, perform a manual match, determine when two records with matching values are truly matched, and then calculate the m probability accordingly.

Most modern probabilistic matching tools go beyond this approach by performing frequency analysis of the values and applying some iterative methods that alternate phases of the algorithm. They start with an assumed set of expected probabilities, then try to fit the data to those expected values and adjust the expectation for the subsequent iteration. The algorithm completes when there is an insignificant change between the fitting phase and the next iteration of the expectation-setting phase. The goal is to adjust the probabilities under the circumstances associated with the data value frequencies -- two records that share a common last name like Smith are more likely to chance match than two records with the last name Hospfeffer.

The ability to algorithmically calculate the m- and u-probabilities significantly improves the scoring methods.

However, the greater the variation among the values, coupled with the known commonality of those values, eventually leads the record-matching algorithms to tweak the weights so that they are optimized to contribute to the overall similarity score as appropriate. The upshot is that the ability to algorithmically calculate the m- and u-probabilities significantly improves the scoring methods, contributes to narrowing the gap between the upper (match) threshold and the lower (nonmatch) threshold, and enables analysts to narrow the width of the band between those two thresholds.

Dig Deeper on Data management strategies

Business Analytics
Content Management