Tuesday, July 19, 2011

Identifying duplicate records in Excel, ArcGIS

Let me start by saying that I have no doubt there is an easier method. This is just the way I attacked this problem today, and it seemed to work, so I thought I'd share.

There was a question as to why I didn't just use Excel's highlight or delete duplicate action. This is a great addition to the Excel package (I believe it introduced with Excel 2007) and you can read about it here: http://www.lytebyte.com/2008/06/10/highlight-remove-delete-duplicates-excel-2007/. But since I didn't want to blindly delete duplicates, I needed to find a quicker way than eyeballing highlighted cells to investigate and ultimately delete duplicate records.

Problem: I have four layers of patient data, each with over 3000 records. An eyeball review suggests that each layer has duplicate records, defined as records with the same Patient ID and same diagnosis (Dx) date. How do I identify these duplicates so that I can delete them?

Roundabout solution: First, I exported the records for one layer (i.e. 2005 cases) into Excel. I deleted all fields except patient ID, dx date, and STFID. I translated all columns into numbers with no decimal places (select the cells, hit the exclamation point drop down and select "convert to number").

I created a concatenated column named "ID_and_Dx". By using the formula "=CONCATENATE(A2,C2)", I created a record number of patient ID (located in A2) and dx date in numerical form (located in C2).

The important next step: I sorted the data by the concatenated column (ID_and_Dx).

I created a column called "Dup_ID_Dx". This is my indicator variable for duplicate fields. I used the formula "@if(D3=D2,"Dup", "Not Dup"). Filling this formula down through the sheet will mark the first record in a duplicate set as "Dup".

The important next step: I copied the Dup_ID_Dx column, highlighted the next blank column, right click and selected "Paste special", and chose to paste values. This will create a column of "Dup" or "Not Dup" without the formula. I then deleted the formula version column.

One more problem: I noticed that some records have the same patient ID and dx date, but different STFID. I would like to be able to investigate these records later in ArcGIS. How can I denote these kind-of-duplicates in Excel before I bring them back into ArcGIS?

And back to the circuitous solutions: I created another concatenated column and named it "ID_and_Dx_and_STFID". I used the concatenate function as described above to mash up my previously created ID_and_Dx field and my STFID field. Note that if a records couldn't geocode and therefore didn't have an associated STFID, this ID_and_Dx_and_STFID cell only contains the ID_and_Dx value.

I then made another duplicate indicator column, called "ID_Dx_STFID_Dup" and used the formula "@if(F3=F2,"Dup", "Not Dup")" to fill down and identify duplicates. Note that duplicates in this column indicate records that have the same patient ID, Dx date, and STFID. I am actually interested in records that have the same Patient ID and Dx date but different STFID. I will get to these kind-of-duplicates later.

Then it was time to bring it in to ArcGIS10. I created a patient ID field type double and named it "Pat_ID_dbl" on the 2005 case layer on ArcGIS and used the field calculator to populate it with the patient ID. I did this because my duplicate investigation xls document has patient ID listed as a number.

I joined the 2005 layer to the duplicate investigation xls document on the patient ID field. I sorted by the Dup_ID_Dx_ field and selected all duplicate records (using "select by attribute") to calculate my % duplication. I then used "switch selection" to select the non-duplicate records and created a new layer from these selected features (right click on layer name -> data -> export data and make sure you're exporting "selected features"). This will be the layer I will use for my analysis because it contains no duplicate records.

Now, back to the issue of kind-of-duplicates (same patient ID, same Dx date, different STFIDs, indicating that something went quite wrong in the data entry). I used my original, joined layer to select duplicate records (Dup_ID_dx). I created a new layer from this selection (right click layer -> selection -> new layer from selected features). In this layer, I sorted by the kind-of-duplicate field (sort by "ID_Dx_STFID_Dup") so I can investigate the "Not Dup" within these records to try to figure out what's going on with these records. In my case, I noticed all of the patients with the same patient ID and Dx date had the same birthday (DOB). So even though the two kind-of-duplicate records have different home addresses and therefore different STFID's, they seem to indicate that they are true duplicates of the same patient and diagnosis.

No comments:

Post a Comment