In Part I of this post about a record linkage pipeline, we normalized the structure and format of incoming guest data, found potential duplicates (aka merge candidates), and built similarity features for the potential duplicates we found. Now that we have our features, we can proceed with the next steps: training the machine-learning model, clustering the merge candidates, and finally, merging the profiles.

Preparing the Training Data

When blocking and match similarity computation is finished, we need to make a decision about which of the merge candidates we want to merge and which we want to leave as separate. In most cases, a rational approach to deciding whether to merge candidates would be to set “decision borders” or thresholds. In other words, if the aggregate similarity score is below, say, 75/100 for a match candidate, it is probably safer to assume that this is not a valid match. Conversely, if the similarity score is 90 or more, we can assume that we almost certainly have a valid match. But what do we do with the 75-90 cases? Often, evaluating these ambiguous pairs involves fuzzy logic and judgement that is best made by a human being, so we need manual review. Do we delegate all of this to TaskRabbit or MechanicalTurk and done? That could be costly and doesn’t sound very sustainable.

As a human being reviews these more ambiguous profile pairs and makes judgements about whether they are matches or not, this output can be fed back into the training data. In machine learning jargon, this "control" data is usually called the “label”. Each label that we add increases the size and quality of the training set, and since these records are the most ambiguous, they are also potentially the most valuable for fine-tuning the model. In this way, we can start out with manual work, but eventually get to full or near full automation by improving the performance of the model.

So…​nothing left to do but label somewhere around 30-100,000 carefully selected and randomized profile pairs! Seems like a good time to break for coffee, or something stronger. When this somewhat tedious process is finished, the vectorized data will look something like this, where -1.0 signifies the absence of field values to compare:

p1_id   p2_id   label   fName_sim mName_sim lName_sim   phone_sim   email_sim
f43549  e06362  1       1         -1.0      0.8333      1           -1.0
603a21  9bb2f9  1       0.125     0         0.7667      1           1
4asd25  53e18a  0       0         -1.0      0.3333      1           -1.0
1f3bc3  e4b5b4  1       1         1         0.6667      1           1
898282  66e3e9  0       0.125     -1.0      0.1667      1           0.125

Training and Evaluating a Model

With the data prepared, we are ready to load it and test the accuracy of our models. Since we are on Spark, we have a rich set of machine learning tools already available to us. A typical rule of thumb is to use roughly 70% of training set data to train a model and the remaining 30% to test the model’s performance. Spark MLlib is accommodating enough to handle the training and test data split with config settings, so that is one less thing for us to manage manually. For loading data, MLlib typically requires a sparse data format called libsvm. Rather than format the data into libsvm manually, we chose to load the data via Amazon S3 into a Spark DataFrame, allowing us to mix and match features for iterations in the model more flexibly with a simple SQL query.

For a more thorough evaluation, an optimally-performing model should learn conjunctions that minimize false positives (claimed matches that are actually invalid), but learn enough of these conjunctions to cover as many of the valid true matches as possible. These two goals of record linkage can be referred to as pairs quality (or precision) and pairs completeness (or recall), respectively. The reduction ratio (RR) is an indication of the proportion of matches found to the total number of records, and gives some idea of how dramatically you are affecting the data. In more concrete terms, the most common metrics are: 

Pairs Quality (precision)

# correct candidate matches found / # total candidate matches
If we find 80 matches, of which 65 are correct, then P = 65/80 or 0.8125.

Pairs Completeness (recall)

# successful matches / # total possible true matches
If there are 100 true dupes possible and we catch 80, then PC = 0.8.

Reduction Ratio

# of candidate matches / # total number of records
1 - (# candidate pairs generated / total number of records)


1 - (2 x PC x RR) / (PC + RR)
A normalized mean between PC and RR which accounts for the imbalanced distribution of matches and non-matches.

After the data has been loaded, we can simply set config for label and features columns and run the model. For this use case, we got good results with the gradient-boosted tree regression, although random forest regression would also have been a good choice. After training our model on 100K of the most ambiguous records, gradient boosted regression performed the best for us, with a 2.8% root mean squared error (RMSE). The metrics above are specific to this use case and must be generated manually.

When the model has been trained, it can be saved and run on all subsequent streaming data. If the data set changes or if the model degrades, it can be retrained at any point on new data. Our data flow would look something like this:

Merge Candidate Review


We now have a series of profile pairs that are merge candidates. The final step before merging two profiles is clustering. One might assume that we can simply chain all profiles together and merge them. In other words, if profile A == profile B, and profile B == profile C, then all three can simply be merged. However, our model does not necessarily ensure that profile A has an aggregate similarity to profile C that is high enough. If we have set an upper threshold or decision border of 0.90, for instance, then the similarity of A and C could be as low as 0.80, making a merge decision questionable. The 0.75-0.90 manual review zone complicates matters even further. These clustering conflicts must be evaluated to avoid joining profiles that should remain separate. In our case, clustering conflicts were rare enough (roughly 0.01% on average) that we chose to be "greedy" when clustering, but there is no shortage of more complex algorithms to implement for agglomerative and hierarchical clustering.

Merging Profiles

Once all profiles pairs have been queued for merge, the merge process is triggered and the two profiles are combined. We will not go into detail on this process in this post, as this functionality will be very different for each use case. Generally, it is important to have a systematic way of evaluating how each field is to be treated. When names are combined (e.g. “William F” and “Will Faulkner” becomes “William Faulkner”), how does one choose which first and last name take precedence? In this case, string length and similarity might be a reasonable combined metric. For different fields, like emails, phone numbers, and addresses, we want to persist all distinct values. An address that is 99% similar is almost certainly a duplicate, so in a case like this, we need to evaluate which address is the most complete and save one record. Other fields, like sales or revenue data, may need to be aggregated.

Since this will be frontend-facing data, it is also worth noting that when merging data, we will want to be working with the original data, not the data that was homogenized/normalized for profile synthesis, so that we have all of the accented characters, diacritics, capitalisation, etc. This may also be the stage where we want to consider supplementing the data with additional analysis, like inferring gender or adding latitude and longitude based on top of address data, for instance. At Revinate, the methods outlined above have allowed us to deal with very large volumes of data in a sophisticated, flexible, and scalable way. I hope this short look into how we approach record linkage and building a guest relationship management system has been interesting and informative!