Record Linkage¶
Due: Friday, Feb 16th at 4pm
You must work alone on this assignment.
In this assignment, you will take a pair of datasets containing restaurant names and addresses and link them, i.e., find records in the two datasets that refer to the same restaurant. This task is non-trivial when there are discrepancies in the names and addresses of the same restaurants across the two data sources.
This assignment will give you experience in probabilistic record linkage, a technique that you may find valuable for your project and when working with real-world data in the future.
Here is an overview of the tasks you will perform. Some of these will make more sense after you have read the full description.
- The restaurant datasets are tables of names and addresses provided
by the restaurant review companies Zagat and Fodor’s, and are in
the files
zagat.csv
andfodors.csv
. Your first task is to create a pandas dataframe for each dataset consisting of four columns: (i) the sequentially-numbered index from the file, (ii) the restaurant name, (iii) the city, and (iv) the street address (except for the city). Call the dataframeszagat
andfodors
. - For record linkage, you will use a training dataset of known
links, available in the file
known_links.csv
. For a given row, the first column indicates the index of a restaurant in the Zagat dataset, and the second column indicates the index of the same restaurant in the Fodor’s dataset. Your next task is to read this dataset into two parallel pandas dataframes: one with the rows corresponding to indices in the known links from the Zagat dataset in the order they appear in the known links dataset and another with the rows corresponding to the indicies of the known links from the Fodor’s dataset, again in the order they appear in the known links dataset. Call a pair from the known links a match and the resulting parallel dataframes matches. - You will also need a pair of datasets that are not matches. We will
call a pair of corresponding rows from these datasets an
unmatch. You will create these dataframes by choosing random rows from
zagat
andfodors
. Call these theunmatches
. (The terms match and unmatch follow the convention in record linkage. ) - You will use the Jaro-Winkler distance to measure the similarity
between a field in a pair of records (one each from
zagat
andfodors
). We have provided aset of histograms
that show the distribution of these scores, for each of the fields (name, city, and address), for matches and for unmatches. You should examine these histograms to understand the rationale for the algorithm we will adopt. - To link records, you will use (a simplified version of) the probabilistic record
linkage approach. In this part you will represent the distances between
the fields of a pair of restaurants as a tuple of three values. Each field will be rated as having a
"low"
,"medium"
, or"high"
similarity across the two datasets. (It turns out that categorizing the similarity values into three buckets works better than the two that would result from a simple cutoff.) Since each tuple has three slots, and each can be one of three values, there are 27 possible tuples. Your next task is to count the occurrence of each tuple for the matches and the unmatches. Here you will use thematches
andunmatches
dataframes constructed earlier. - Next, you will place the 27 different possible tuples into three
categories:
match_tuples
,possible_tuples
, andunmatch_tuples
. If a test pair maps to a tuple in the first set, we classify it as a match; if it maps to a tuple in the second set, we classify it as a possible match; else we classify it as an unmatch. (We will trust the decisions made by our algorithm for the clear matches and unmatches, but possible matches need to be inspected manually.) - It will then be time to find matches. You will iterate through all
potential pairs from the two data sets, for each rank the
similarities of the three fields and generate the corresponding
tuple, and then determine the category to which it belongs. You
will generate a CSV file that includes a row for each potential pair.
Each row will have the
zagat
row index,fodors
row index, and category (one of match, possible match, or unmatch). Use a comma to separate fields in the output file. For example, the line0,2,match
would indicate that the first row of the zagat data file is a match with the third row of the fodors data file. - Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above algorithm, but only consider potential pairs that also have the same value for city.
Create restaurant dataframes¶
The restaurant datasets are tables of names, cities, and addresses provided by
the restaurant review companies Zagat and Fodor’s, and are in the
files zagat.csv
and fodors.csv
. Your first task is to create
a pandas dataframe for each consisting of four columns: i) the index number from the csv file, (ii) the restaurant name, (iii)
the city, and (iv) the address (except for the city).
The review companies actually provide the restaurant information as a single string. We have split this string into the name, city, and address using regular expressions for you. Because an automated process was used, there may be restaurants for which this extraction process was performed imperfectly. This situation is not ideal, but it is realistic, and the data is in sufficiently good shape to use for the probabilistic algorithm we will employ.
- It is best not to name the restaurant name column
name
, because pandas dataframes have a predefined attribute calledname
, and you get unexpected results if you use thedataframename.columnname
style of referring to a column.
Create the matches pair¶
For record linkage, you will use a training dataset of
known links, available in the file known_links.csv
. It contains 50 manually-linked
pairs of restaurants. In other words, a human has chosen some of the rows in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair is from Zagat and the second
from Fodor’s. Rows in the file correspond to these pairs, and the two columns give the indices for a restaurant within the corresponding files.
Using this file and the other dataframes, create a pair of dataframes, each with three columns: the name, city, and address. If the ith row of the known links file contains zx
and fy
, then the ith row of the first data frame should contain the row with index zx
from the Zagat dataframe and the second dataframe should contain the row with index fy
from the Fodor’s dataset.
Create the unmatches pair¶
It turns out that for record linkage, you also need a dataset of
unmatches.
For this, choose 1000 random restaurants
from zagat
and pair them with 1000 random restaurants from fodors
.
To make random choices, the pandas sample
function for sampling a dataframe is suitable.
Technically, the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we will ignore this minor source of error.
Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work.
The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. Furthermore, making the same random choices in your program as we do in ours, makes it easier for us to check your results. Therefore, we want you to seed the random generator with a fixed number. Specifically, use the following calls to choose the samples:
zs = zagat.sample(1000, replace = True, random_state = 1234)
fs = fodors.sample(1000, replace = True, random_state = 5678)
These calls will each always return the same 1000 rows because the random state is fixed through the seeds 1234 and 5678. In normal circumstances, once your code is working, you would remove these hard-coded seeds. But, we ask that you leave them in to facilitate testing of your submitted code.
Your function should simply return the resulting pair of dataframes: (zs, fs)
.
Explore Jaro-Winkler scores in histograms¶
To link records, we take a potential pair, one each from zagat
and fodors
, and
decide if their strings are sufficiently similar. Our results are better
when the strings are broken up into natural components (like name, city, and address), which we have done for you.
When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score of 0.
You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install it on your VM using the standard:
sudo pip3 install -U jellyfish
To find the Jaro-Winkler distance, use, e.g.,:
import jellyfish
score = jellyfish.jaro_winkler("medici", "noodles etc.")
To make sense of the algorithm we will be using for record linkage, you need to gain an understanding of the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches.
Here is a link
to a PDF file containing
six histograms, labelled appropriately. The first pair of histograms
is for the Jaro-Winkler distances between names for restaurants, one
each for matches and unmatches. The second and third are similar,
except for cities and addresses.
Estimate Tuple probabilities given match or unmatch]¶
Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs.
The above implies that a single “cut-off” on the similarity for a
field, or even a group of fields, will not serve our purpose. So we
break-up the range of similarity values into discrete chunks, which we will call "low"
, "medium"
, and "high"
, and
determine the probability of a match for each combination of field-wise chunks separately.
We have adopted (a simplified version of) the probabilistic record
linkage approach proposed by Fellegi and Sunter.
We have provided a simple utility function
get_jw_category()
in util.py
that takes a Jaro-Winkler distance and returns
the string "low"
, "medium"
, or "high"
, essentially breaking the range of
the Jaro-Winkler score into three blocks. We apply this function to all three
fields: names, cities, and addresses. Thus for any pair of
restaurants, we can create a tuple (name_category, city_category,
address_category) that represents the similarity of the fields for
that pair. Whether, during record linkage, the pair should be
classified as a match or unmatch, or something in between, will depend
on whether that tuple was most closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.
Specifically, we determine whether a tuple should be classified as a match, possible match, or unmatch, based on estimates of the probability that a matched pair results in that tuple as well as the probability that an unmatched pair results in that tuple. Formally, assume that \((r_1, r_2)\) is a potential pair, \(t(r_1, r_2)\) is the tuple formed from its field similarities, and \(T\) is the set of all possible tuples. For every \(w \in T\) we need estimates for two quantities:
Compute estimates for the former by iterating through all the
corresponding rows in the two dataframes in matches
, determining
their tuples, and counting the frequency of each of the 27 possible
tuples (combinations of the three similarity levels in the three
different positions) during this process. The relative frequency for a
tuple \(w\) is the estimate of the probability for \(w\) given
a matching pair. Similarly find an estimate for the probability given
an unmatch by iterating through tuples for all corresponding rows in
the two dataframes in unmatches
.
Partition Tuples into match, possible match, and unmatch sets¶
How should we decide which tuples belong in which of the three
different categories: match_tuples
, possible_tuples
, and
unmatch_tuples
? It depends on our tolerance for two kinds of
errors:
- false positive rate, the probability that we classify an actual unmatch as a match, and
- false negative rate, the probability that we classify an actual match as an unmatch.
Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to accept, and, given these are satisfied, we want to maximize the number of our matches.
In order to classify which tuples should be associated with matches, unmatches, and possible matches, we do the following—
- Assign all tuples \(w\) that have \(m(w) = 0\) and \(u(w) =
0\) to
possible_tuples
. Intuition: we have not seen either matches or unmatches with these similarity levels in our training data, so our algorithm cannot make any inferences about them, and human intervention will be required. - Sort the rest of the 27 tuples such that the tuples with \(u(w) = 0\) are in front of the list, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\). Intuition: we want a ranked order of how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the ratio of how often they have been associated with matches, versus the number of times they have been associated with unmatches, and put the ones most likely to be associated with matches earlier.
- Starting at the first tuple in this sorted order, find the last
tuple \(w_1\) such that the cumulative sum of \(u(w)\)
values (for \(w\) from first to \(w_1\)) is at most \(\mu\). Add all tuples from the first till
\(w_1\) (inclusive) to
match_tuples
. Intuition: we want to limit ourselves to no more than a specified false positive rate (\(\mu\)). We have sorted the tuples in order of their being most likely to be associated with matches. We take as many as we can, starting with the best ones, until we are at the point where taking another one would cause us to be likely to exceed our false positive rate because the likelihood that a pair is actually an unmatch has become too high. - Starting at the last tuple in the reverse order, find the
furthest-from-last tuple \(w_2\) such that the cumulative sum
of \(m(w)\) values is at most \(\lambda\). Add tuples from
the last till \(w_2\) (inclusive) to
unmatch_tuples
. Intuition: we choose the worst tuples, in terms of being likely to be associated with a match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false negative rate, so we stop just before the point where this would happen. - Add any remaining tuples (in the ones that were sorted) to
possible_tuples
. Intuition: we are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be looked at by a human.
There is a subtlety in this description: the above only works when** \(w_1\) happens to be ahead of \(w_2\). If they “cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or unmatches, and either way, we would not exceed our targets for false positives or false negatives.
What should we do, if faced with this situation? It comes down to our priorities. If we want to prioritize having more matches (and we do), then we should count all of these ambiguous tuples as being associated with matches. Remember, in this scenario, this is okay, since we still will not exceed our false positive rate in the process.
A note about data structures: you can represent the partition as three
sets of tuples or as a dictionary that maps a tuple to a partition
("match"
, "possible match"
, or "unmatch"
). We used a dictionary, but
either approach is fine.
Find matches, possible matches, and unmatches¶
Once you have the three sets for given maximum false positive and
false negative rates, simply iterate through every possible pair of
rows in zagat
and fodors
and classify the pair as a
"match"
, a "possible match"
, or an "unmatch"
. Your final
result will be a CSV file that that includes a row for each possible
pair. Each row will have the zagat
row index, fodors
row
index, and category.
Block on City¶
Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a blocking key. Implement a version of the record linking algorithm that only considers the restaurants that have the exact same city in both datasets.
This version will be similar to the previous one; thus, you should have one implementation that simply takes whether to block on city, or not, as a parameter, and proceeds accordingly.
Task checklist¶
Write Python code, with appropriate auxiliary functions and documentation that implements the following function:
find_matches(output_filename, mu, lambda_, block_on_city)
which takes output_filename
, the name of the output file, mu
,
the maximum false positive rate, lambda_
, the maximum false
negative rate, and block_on_city
a boolean that indicates whether
to block on the city or not. It should generate a CSV file as
described above.
In particular, ensure you—
- Create
zagat
,fodors
,matches
, andunmatches
. - Determine relative frequencies for 27 tuples corresponding to
pairs from
matches
andunmatches
. - Sort tuples and create sets
match_tuples
,possible_tuples
, andunmatch_tuples
. - Assign each pair to matches, possible matches, and unmatches for
all potential pairs from
zagat
andfodors
, possibly after blocking on city. - Generate a CSV file with a row for each pair. The rows should include the
zagat
row index,fodors
row index, and the category (one ofmatch
,possible match
, orunmatch
).
Getting started¶
We have seeded your repository with a directory for this assignment.
To pick it up, change to your capp30122-win-18
directory
(where the string username
should be replaced with your username),
run git pull
to make sure that your local copy of the repository
is in sync with the server, and then run git pull upstream master
to pick up the distribution.
Submission¶
To submit your assignment, make sure that you have:
- put your name at the top of your file,
- registered for the assignment using chisubmit,
- added, committed, and pushed
record_linkage.py
to the git server, and - run the chisubmit submission command for
pa4
.
Remember to push your code to the server early and often! Also, remember that you can submit as many times as you like before the deadline.
Acknowledgments¶
This assignment was originally designed and written by Amitabh Chaudhary.