The single most valuable dataset to almost any business is the customer profile, often collectively managed through a customer relationship management (CRM) system. For any business, managing a customer database can be challenging, and maintenance becomes increasingly complex as the enterprise grows. Sources of data become more numerous and varied in format. As small incremental changes to customer profiles are introduced, issues such as managing duplicate records and maintaining chronological sequence become increasingly important. Computationally, things get exponentially more expensive, as you are comparing each new record with all existing ones. If you are lucky enough to have a lot of customers, you may even be outgrowing the capabilities of a traditional technology stack.

At the center of many of these problems is the issue of record linkage (also referred to as entity resolution, merge/dedupe, and many other terms), or managing the connections between two customer profiles that belong to the same person. Because of the complexities involved, this is — not surprisingly — a very popular area of research in computer/data science.

Savvy hotel operators don’t want to be in the data processing and warehousing business, so they rely on Revinate’s proprietary guest relationship management (GRM) system. We receive and manage streaming guest data for millions of guests 24 hrs/day in widely varied formats. What follows, is a brief look into how we at Revinate manage the record linkage problem to provide value for our clients, from an engineering perspective.

In order to be able to adapt dynamically to changing data formats, quality, and potentially very large volumes, we’ve chosen to use Kafka, Cassandra, and Apache Spark with ML (Spark’s machine learning library) to perform a lot of the heavy lifting and data analysis around our guest data. With respect to record linkage, the steps we go through are roughly as follows:

  1. Normalize the structure and format of incoming guest data.

  2. Find potential duplicates (aka merge candidates) amongst all guest profiles.

  3. Build similarity features for potential duplicates.

  4. Train a machine learning model on a sample data set. (Done once).

  5. Make predictions on whether these potential duplicates are or are not matches based on this model.

  6. Consolidate and merge the duplicates.

Normalizing raw data

Any data engineer or data scientist knows: it is easy to underestimate the difficulty and complexity of normalizing data. Consider the small sample of customer data that we might receive from a property management system (PMS), below. Although a reservation is usually for a guest (or two, or three), it may also be for a wedding party, for a corporation or club, or it may be a placeholder reservation with a name like “walk-in” or “guest”. Names can often be misspelled, especially when dictated over the phone.

000010,William Faulkner,+16622343284,916 Old Taylor Rd Oxford MS 38655,
000011,will faulkner,(662)234-3284,rowan oak oxford ms 38655 usa,
000012,Bill F,662 234 3284,null,
000012,Guillaume Faulkneur,+49 662 234 3284,paris france,
000013,walk-in guest,null,null,
000014,Faulkner Society,null,PO Box 5272 Mississippi State MS 39762,

Until non-ascii characters are normalized, a first name like “José” will only be a partial match with “Jose.” Shortened names (aka hypocorisms) must also be accounted for, so that “William” is a match with “Bill”, “Margaret” with “Meg”, etc. Similarly, if a reservation was made over the phone, it is likely that “Julie N.” is the same person as “Julian,” and a phonetic matching algorithm like Soundex, Beider-Morse, or Metaphone might give better results than a simple string comparison when building features for a machine-learning model.

Phone numbers and addresses can be particularly troublesome to normalize, especially for international data. Although many libraries exist for parsing phone and address data, none validate addresses perfectly and most are specialized for a particular region, like the US. Matters are further complicated by PMSs that provide default values that hotel staff may blindly accept. We employed a mix of several libraries and proprietary code to get an acceptable level of coverage and accuracy. Libpostal and libphonenumber are popular alternatives for address and phone number normalization. If we want to send out an automatic greeting on someone’s birthday, dates have to be standardized to ISO from potentially hundreds of different kinds of formats.

At Revinate, we run our applications directly on a stand-alone Spark cluster and submit jobs using the spark-submit script. The first application listens for data from Kafka, handles data cleansing and normalization logic in Scala, and then writes normalized data to Cassandra. In this way, we can ensure consistent and scalable HA performance for both modes: batch and streaming data. Below, is an example of a character normalization method, to match “José” with “Jose.”

  def normalizeStringAndRemoveSpecialCharacters(str: String): String =
      _.replaceAll("[$&+_~`,:;=.,!?@#|'<>.^*()\\[\\]%!\\-/{}ƜɯƟɵƢƣƻƼƽƾǀǁǂǃDŽDždžLJLjljNJNjnjǰDZDzdz]", "")
        .replaceAll("[ÀàÁáÂâÃãÄäÅåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧ]", "a")
        .replaceAll("[ÆæǢǣǼǽ]", "ae")
        .replaceAll("[ƀƁƂƃƄƅ]", "b")
        .replaceAll("[ÇçĆćĈĉĊċČčƆƇƈ]", "c")
        .replaceAll("[ĎďĐđƉƊƋƌÐð]", "d")
        .replaceAll("[ÈèÉéÊêËëĒēĔĕĖėĘęĚěȄȅȆȇȨȩƎƏəƐɛ]", "e")

Generating match candidates

Depending on the size of your data set, evaluating all possible combinations of profiles for potential duplicates can be extremely expensive computationally. For a property with 50K profiles, there are 2.5 billion possible combinations of profiles to evaluate, at O(n2) runtime complexity. If one manages to eliminate reciprocal pairs (i.e. the n(n-1)/2 "handshake problem"), one still has nearly 1.25 billion possible combinations. A common strategy to make handling this problem more feasible is called “blocking.” In our use case, merge candidates are generated by creating a merged list of profiles from a list of all profile pairs with the same first name, combined with all profile pairs with the same last name, email, phone number, etc. This list of blocks by field results in a dramatically smaller set of data than the 1.25-2.5B mentioned above.

As a more concrete example, with a brute force approach, a customer base of 100 customers will entail 1002 operations, or 10,000 comparisons. With a blocking approach, if we have 12 matches for last name for instance, and 10 for first name, we would have 122 (last name) = 144 + 102  (first name) = 100, or 244 comparisons, …​ plus n2 comparisons for any additional fields. For a reasonable number of fields, this result renders vastly less than 10K pairs.

When performing blocking, evaluating the overall “health” of the data is mandatory. Blocking may only be used for fields that are both highly populated and have a high degree of uniqueness. At Revinate, we refer to these as coverage (the percentage of all values that are populated for a given field for all profiles) and cardinality (the ratio of unique values to the total value count for all profiles). If cardinality for a given field like phone number, for instance, is near 0, meaning nearly all values are the same, then by attempting to match on that field, we will again get nearly all possible combinations of profiles and we are right back up to 1.25-2.5 billion match candidates. On the other hand, if coverage is low, we will only match amongst a very small subset of the profiles, and therefore miss a large portion of possible matches. Without first evaluating the quality of the data, it is very easy to far exceed the original 2.5B count by a large margin, since we are running comparisons for each field in the profile. But as long as we construct a blocking scheme carefully, it is possible to greatly reduce the volume of data that needs to be processed.

To reduce block size for low-cardinality fields, it may also be effective to combine fields to build new, composite blocking features. For example, comparing a combination of “zip code + first 3 letters of last name” would likely yield more accurate results than simply comparing the zip code. The same may apply to a very common name, like first name: "John." As food for thought, here is a blocking scheme used for US Census data, below, where "^" means an intersection of two blocks (like an inner join) and "∪" means the union or concatenation of two blocks:

({token, zip} ∧ {first, surname}) ∪ ({token, phone}) ∪ ({first-3, zip} ∧ {token, day-of-birth} ∧ {token, month-of-birth}) ∪ ({token, zip} ∧ {token, house number}) ∪ ({first-3, surname} ∧ {first-3, first name})

- from "Learning Blocking Schemes for Record Linkage", Matthew Michelson & Craig A. Knoblock, 2006

Evaluating match similarity

Now that we have found potential duplicates, we need to calculate similarity features for each field in the profile pairs. After vectorizing these fields, they can be fed into a machine-learning model to help it learn. Most often, one will want to use different string comparison algorithms for different fields, depending on their type, usage, and length. For a phone number, you would probably want to check for exact match, so a simple string comparison is enough. A last name might be almost identical, so an algorithm like Levenshtein distance might make sense, which calculates the number of insertions, deletions, and replacements necessary to transform string A to string B. More advanced algorithms, with variants like Jaccard, Jaro-Winkler, Hamming distance, and others, also factor in string length in various ways to more evenly distribute output. An address, on the other hand, will likely have small variations, like “St.” versus “Street,” a missing country, or similar. Here, it makes much more sense to use an algorithm like n-gram (a.k.a. q-gram or minHash), which performs what is called shingling, a method that splits both strings into sets of small parts, typically of 2 or 3 letters, and then calculates the intersect between both sets. This blog post by R. Vogler has a very nice comparison of various string similarities and their output/distribution.

When we are finished running all of these comparisons on each field of a dataset, we have a table of numerical values, normalized to a range of 0-100 for similarity, from which we can calculate an aggregate similarity.

Note that the aggregate similarity value for a pair of profiles should not necessarily be the arithmetic mean of all field similarities, because fields like name are more important than other fields. For example, even if all the fields except name are similar, it is highly unlikely that two profiles should be merged if the names are dramatically different. The last step in creating features is feature weighting. We considered two preferred approaches:

  1. Multiply each field with a coefficient for a total sum of 1 between all weights to shift “influence” from one field to another, or

  2. Build a decision tree of sorts with a threshold for weights, so that you can give, say, 70% significance to a full name match, and use a match in any other field to tip the mean similarity across the decision border.

Now that we have numerical values for the similarity of each pair of fields and an ideal weighting scheme with which to calculate the aggregated similarity of two profiles, we have a vectorized series of features that can be used to train a machine learning model. We’ll cover this process in Part II of this blog post.