Record Linkage

Due: Friday, Feb 16th at 4pm

You must work alone on this assignment.

In this assignment, you will take a pair of datasets containing restaurant names and addresses and link them, i.e., find records in the two datasets that refer to the same restaurant. This task is non-trivial when there are discrepancies in the names and addresses of the same restaurants across the two data sources.

This assignment will give you experience in probabilistic record linkage, a technique that you may find valuable for your project and when working with real-world data in the future.

Here is an overview of the tasks you will perform. Some of these will make more sense after you have read the full description.

  1. The restaurant datasets are tables of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each dataset consisting of four columns: (i) the sequentially-numbered index from the file, (ii) the restaurant name, (iii) the city, and (iv) the street address (except for the city). Call the dataframes zagat and fodors.
  2. For record linkage, you will use a training dataset of known links, available in the file known_links.csv. For a given row, the first column indicates the index of a restaurant in the Zagat dataset, and the second column indicates the index of the same restaurant in the Fodor’s dataset. Your next task is to read this dataset into two parallel pandas dataframes: one with the rows corresponding to indices in the known links from the Zagat dataset in the order they appear in the known links dataset and another with the rows corresponding to the indicies of the known links from the Fodor’s dataset, again in the order they appear in the known links dataset. Call a pair from the known links a match and the resulting parallel dataframes matches.
  3. You will also need a pair of datasets that are not matches. We will call a pair of corresponding rows from these datasets an unmatch. You will create these dataframes by choosing random rows from zagat and fodors. Call these the unmatches. (The terms match and unmatch follow the convention in record linkage. )
  4. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records (one each from zagat and fodors). We have provided a set of histograms that show the distribution of these scores, for each of the fields (name, city, and address), for matches and for unmatches. You should examine these histograms to understand the rationale for the algorithm we will adopt.
  5. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this part you will represent the distances between the fields of a pair of restaurants as a tuple of three values. Each field will be rated as having a "low", "medium", or "high" similarity across the two datasets. (It turns out that categorizing the similarity values into three buckets works better than the two that would result from a simple cutoff.) Since each tuple has three slots, and each can be one of three values, there are 27 possible tuples. Your next task is to count the occurrence of each tuple for the matches and the unmatches. Here you will use the matches and unmatches dataframes constructed earlier.
  6. Next, you will place the 27 different possible tuples into three categories: match_tuples, possible_tuples, and unmatch_tuples. If a test pair maps to a tuple in the first set, we classify it as a match; if it maps to a tuple in the second set, we classify it as a possible match; else we classify it as an unmatch. (We will trust the decisions made by our algorithm for the clear matches and unmatches, but possible matches need to be inspected manually.)
  7. It will then be time to find matches. You will iterate through all potential pairs from the two data sets, for each rank the similarities of the three fields and generate the corresponding tuple, and then determine the category to which it belongs. You will generate a CSV file that includes a row for each potential pair. Each row will have the zagat row index, fodors row index, and category (one of match, possible match, or unmatch). Use a comma to separate fields in the output file. For example, the line 0,2,match would indicate that the first row of the zagat data file is a match with the third row of the fodors data file.
  8. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above algorithm, but only consider potential pairs that also have the same value for city.

Create restaurant dataframes

The restaurant datasets are tables of names, cities, and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each consisting of four columns: i) the index number from the csv file, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city).

The review companies actually provide the restaurant information as a single string. We have split this string into the name, city, and address using regular expressions for you. Because an automated process was used, there may be restaurants for which this extraction process was performed imperfectly. This situation is not ideal, but it is realistic, and the data is in sufficiently good shape to use for the probabilistic algorithm we will employ.

  • It is best not to name the restaurant name column name, because pandas dataframes have a predefined attribute called name, and you get unexpected results if you use the dataframename.columnname style of referring to a column.

Create the matches pair

For record linkage, you will use a training dataset of known links, available in the file known_links.csv. It contains 50 manually-linked pairs of restaurants. In other words, a human has chosen some of the rows in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair is from Zagat and the second from Fodor’s. Rows in the file correspond to these pairs, and the two columns give the indices for a restaurant within the corresponding files.

Using this file and the other dataframes, create a pair of dataframes, each with three columns: the name, city, and address. If the ith row of the known links file contains zx and fy, then the ith row of the first data frame should contain the row with index zx from the Zagat dataframe and the second dataframe should contain the row with index fy from the Fodor’s dataset.

Create the unmatches pair

It turns out that for record linkage, you also need a dataset of unmatches. For this, choose 1000 random restaurants from zagat and pair them with 1000 random restaurants from fodors. To make random choices, the pandas sample function for sampling a dataframe is suitable.

Technically, the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we will ignore this minor source of error.

Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work.

The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. Furthermore, making the same random choices in your program as we do in ours, makes it easier for us to check your results. Therefore, we want you to seed the random generator with a fixed number. Specifically, use the following calls to choose the samples:

zs = zagat.sample(1000, replace = True, random_state = 1234)
fs = fodors.sample(1000, replace = True, random_state = 5678)

These calls will each always return the same 1000 rows because the random state is fixed through the seeds 1234 and 5678. In normal circumstances, once your code is working, you would remove these hard-coded seeds. But, we ask that you leave them in to facilitate testing of your submitted code.

Your function should simply return the resulting pair of dataframes: (zs, fs).

Explore Jaro-Winkler scores in histograms

To link records, we take a potential pair, one each from zagat and fodors, and decide if their strings are sufficiently similar. Our results are better when the strings are broken up into natural components (like name, city, and address), which we have done for you.

When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead.

The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score of 0.

You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install it on your VM using the standard:

sudo pip3 install -U jellyfish

To find the Jaro-Winkler distance, use, e.g.,:

import jellyfish
score = jellyfish.jaro_winkler("medici", "noodles etc.")

To make sense of the algorithm we will be using for record linkage, you need to gain an understanding of the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches.

Here is a link to a PDF file containing six histograms, labelled appropriately. The first pair of histograms is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches. The second and third are similar, except for cities and addresses.

Estimate Tuple probabilities given match or unmatch]

Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs.

The above implies that a single “cut-off” on the similarity for a field, or even a group of fields, will not serve our purpose. So we break-up the range of similarity values into discrete chunks, which we will call "low", "medium", and "high", and determine the probability of a match for each combination of field-wise chunks separately.

We have adopted (a simplified version of) the probabilistic record linkage approach proposed by Fellegi and Sunter. We have provided a simple utility function get_jw_category() in util.py that takes a Jaro-Winkler distance and returns the string "low", "medium", or "high", essentially breaking the range of the Jaro-Winkler score into three blocks. We apply this function to all three fields: names, cities, and addresses. Thus for any pair of restaurants, we can create a tuple (name_category, city_category, address_category) that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be classified as a match or unmatch, or something in between, will depend on whether that tuple was most closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.

Specifically, we determine whether a tuple should be classified as a match, possible match, or unmatch, based on estimates of the probability that a matched pair results in that tuple as well as the probability that an unmatched pair results in that tuple. Formally, assume that \((r_1, r_2)\) is a potential pair, \(t(r_1, r_2)\) is the tuple formed from its field similarities, and \(T\) is the set of all possible tuples. For every \(w \in T\) we need estimates for two quantities:

\[\begin{split}\begin{array}{l} m(w) \text{ defined as }\mathrm{P}[t(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is a match}], \text{ and} \\ u(w) \text{ defined as } \mathrm{P}[t(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is an unmatch}]. \end{array}\end{split}\]

Compute estimates for the former by iterating through all the corresponding rows in the two dataframes in matches, determining their tuples, and counting the frequency of each of the 27 possible tuples (combinations of the three similarity levels in the three different positions) during this process. The relative frequency for a tuple \(w\) is the estimate of the probability for \(w\) given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through tuples for all corresponding rows in the two dataframes in unmatches.

Partition Tuples into match, possible match, and unmatch sets

How should we decide which tuples belong in which of the three different categories: match_tuples, possible_tuples, and unmatch_tuples? It depends on our tolerance for two kinds of errors:

  • false positive rate, the probability that we classify an actual unmatch as a match, and
  • false negative rate, the probability that we classify an actual match as an unmatch.

Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to accept, and, given these are satisfied, we want to maximize the number of our matches.

In order to classify which tuples should be associated with matches, unmatches, and possible matches, we do the following—

  1. Assign all tuples \(w\) that have \(m(w) = 0\) and \(u(w) = 0\) to possible_tuples. Intuition: we have not seen either matches or unmatches with these similarity levels in our training data, so our algorithm cannot make any inferences about them, and human intervention will be required.
  2. Sort the rest of the 27 tuples such that the tuples with \(u(w) = 0\) are in front of the list, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\). Intuition: we want a ranked order of how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the ratio of how often they have been associated with matches, versus the number of times they have been associated with unmatches, and put the ones most likely to be associated with matches earlier.
  3. Starting at the first tuple in this sorted order, find the last tuple \(w_1\) such that the cumulative sum of \(u(w)\) values (for \(w\) from first to \(w_1\)) is at most \(\mu\). Add all tuples from the first till \(w_1\) (inclusive) to match_tuples. Intuition: we want to limit ourselves to no more than a specified false positive rate (\(\mu\)). We have sorted the tuples in order of their being most likely to be associated with matches. We take as many as we can, starting with the best ones, until we are at the point where taking another one would cause us to be likely to exceed our false positive rate because the likelihood that a pair is actually an unmatch has become too high.
  4. Starting at the last tuple in the reverse order, find the furthest-from-last tuple \(w_2\) such that the cumulative sum of \(m(w)\) values is at most \(\lambda\). Add tuples from the last till \(w_2\) (inclusive) to unmatch_tuples. Intuition: we choose the worst tuples, in terms of being likely to be associated with a match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false negative rate, so we stop just before the point where this would happen.
  5. Add any remaining tuples (in the ones that were sorted) to possible_tuples. Intuition: we are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be looked at by a human.

There is a subtlety in this description: the above only works when** \(w_1\) happens to be ahead of \(w_2\). If they “cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or unmatches, and either way, we would not exceed our targets for false positives or false negatives.

What should we do, if faced with this situation? It comes down to our priorities. If we want to prioritize having more matches (and we do), then we should count all of these ambiguous tuples as being associated with matches. Remember, in this scenario, this is okay, since we still will not exceed our false positive rate in the process.

A note about data structures: you can represent the partition as three sets of tuples or as a dictionary that maps a tuple to a partition ("match", "possible match", or "unmatch"). We used a dictionary, but either approach is fine.

Find matches, possible matches, and unmatches

Once you have the three sets for given maximum false positive and false negative rates, simply iterate through every possible pair of rows in zagat and fodors and classify the pair as a "match", a "possible match", or an "unmatch". Your final result will be a CSV file that that includes a row for each possible pair. Each row will have the zagat row index, fodors row index, and category.

Block on City

Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a blocking key. Implement a version of the record linking algorithm that only considers the restaurants that have the exact same city in both datasets.

This version will be similar to the previous one; thus, you should have one implementation that simply takes whether to block on city, or not, as a parameter, and proceeds accordingly.

Task checklist

Write Python code, with appropriate auxiliary functions and documentation that implements the following function:

find_matches(output_filename, mu, lambda_, block_on_city)

which takes output_filename, the name of the output file, mu, the maximum false positive rate, lambda_, the maximum false negative rate, and block_on_city a boolean that indicates whether to block on the city or not. It should generate a CSV file as described above.

In particular, ensure you—

  1. Create zagat, fodors, matches, and unmatches.
  2. Determine relative frequencies for 27 tuples corresponding to pairs from matches and unmatches.
  3. Sort tuples and create sets match_tuples, possible_tuples, and unmatch_tuples.
  4. Assign each pair to matches, possible matches, and unmatches for all potential pairs from zagat and fodors, possibly after blocking on city.
  5. Generate a CSV file with a row for each pair. The rows should include the zagat row index, fodors row index, and the category (one of match, possible match, or unmatch).

Getting started

We have seeded your repository with a directory for this assignment. To pick it up, change to your capp30122-win-18 directory (where the string username should be replaced with your username), run git pull to make sure that your local copy of the repository is in sync with the server, and then run git pull upstream master to pick up the distribution.

Submission

To submit your assignment, make sure that you have:

Remember to push your code to the server early and often! Also, remember that you can submit as many times as you like before the deadline.

Acknowledgments

This assignment was originally designed and written by Amitabh Chaudhary.