a scholarly smorgasbord: mistakes

Showing posts with label mistakes. Show all posts

Thursday, October 20, 2011

Adding coordinate data to a shapefile and lining up your layers

...or, Why is Maryland floating out in the Atlantic Ocean?

If you find that one (or more) of your layers aren't lining up in your mxd file, it's probably a problem with the geographic and/or projection coordinate systems. Your layers need to be projected in the same system. You can check the projections by right clicking the layer in the table of contents, selecting "properties", and looking at the information listed for both Projected Coordinate System and (by scrolling down) Geographic Coordinate System.

If you need to re-project a layer, see this post (steps 1 - 3) on how to. Remember to add your newly projected layer to your mxd file, and remove your "floating in the ocean" layer.

A problem arises if you have a shapefile that lacks coordinate data. You just can't re-project something that hasn't been projected yet.

For example, the 2000 census zip code shapefiles (available here) lack a coordinate system. When you download the zip file, it does not include the projection (.prj) file for the shapefile. Adding this layer will result in the zip code layer floating in the ocean.

To add a coordinate system, open Arc Catalog (not just the shortcut thru ArcMap). Select the shapefile that corresponds to the layer that is missing the coordinate data by double clicking it in the table of contents so you get a popup window. In this window, go to the XY Coordinate System tab, and click "select" to select a pre-defined coordinate system. Chose the appropriate coordinate system, click "add", and then "ok".

Re-add the layer to your mxd file. Check the properties to make sure the coordinate system has been updated. If the layer is still out of line with the other layers on your map, go through the re-projection steps described above.

Source: ArcGIS Desktop Help

Tuesday, July 19, 2011

Identifying duplicate records in Excel, ArcGIS

Let me start by saying that I have no doubt there is an easier method. This is just the way I attacked this problem today, and it seemed to work, so I thought I'd share.

There was a question as to why I didn't just use Excel's highlight or delete duplicate action. This is a great addition to the Excel package (I believe it introduced with Excel 2007) and you can read about it here: http://www.lytebyte.com/2008/06/10/highlight-remove-delete-duplicates-excel-2007/. But since I didn't want to blindly delete duplicates, I needed to find a quicker way than eyeballing highlighted cells to investigate and ultimately delete duplicate records.

Problem: I have four layers of patient data, each with over 3000 records. An eyeball review suggests that each layer has duplicate records, defined as records with the same Patient ID and same diagnosis (Dx) date. How do I identify these duplicates so that I can delete them?

Roundabout solution: First, I exported the records for one layer (i.e. 2005 cases) into Excel. I deleted all fields except patient ID, dx date, and STFID. I translated all columns into numbers with no decimal places (select the cells, hit the exclamation point drop down and select "convert to number").

I created a concatenated column named "ID_and_Dx". By using the formula "=CONCATENATE(A2,C2)", I created a record number of patient ID (located in A2) and dx date in numerical form (located in C2).

The important next step: I sorted the data by the concatenated column (ID_and_Dx).

I created a column called "Dup_ID_Dx". This is my indicator variable for duplicate fields. I used the formula "@if(D3=D2,"Dup", "Not Dup"). Filling this formula down through the sheet will mark the first record in a duplicate set as "Dup".

The important next step: I copied the Dup_ID_Dx column, highlighted the next blank column, right click and selected "Paste special", and chose to paste values. This will create a column of "Dup" or "Not Dup" without the formula. I then deleted the formula version column.

One more problem: I noticed that some records have the same patient ID and dx date, but different STFID. I would like to be able to investigate these records later in ArcGIS. How can I denote these kind-of-duplicates in Excel before I bring them back into ArcGIS?

And back to the circuitous solutions: I created another concatenated column and named it "ID_and_Dx_and_STFID". I used the concatenate function as described above to mash up my previously created ID_and_Dx field and my STFID field. Note that if a records couldn't geocode and therefore didn't have an associated STFID, this ID_and_Dx_and_STFID cell only contains the ID_and_Dx value.

I then made another duplicate indicator column, called "ID_Dx_STFID_Dup" and used the formula "@if(F3=F2,"Dup", "Not Dup")" to fill down and identify duplicates. Note that duplicates in this column indicate records that have the same patient ID, Dx date, and STFID. I am actually interested in records that have the same Patient ID and Dx date but different STFID. I will get to these kind-of-duplicates later.

Then it was time to bring it in to ArcGIS10. I created a patient ID field type double and named it "Pat_ID_dbl" on the 2005 case layer on ArcGIS and used the field calculator to populate it with the patient ID. I did this because my duplicate investigation xls document has patient ID listed as a number.

I joined the 2005 layer to the duplicate investigation xls document on the patient ID field. I sorted by the Dup_ID_Dx_ field and selected all duplicate records (using "select by attribute") to calculate my % duplication. I then used "switch selection" to select the non-duplicate records and created a new layer from these selected features (right click on layer name -> data -> export data and make sure you're exporting "selected features"). This will be the layer I will use for my analysis because it contains no duplicate records.

Now, back to the issue of kind-of-duplicates (same patient ID, same Dx date, different STFIDs, indicating that something went quite wrong in the data entry). I used my original, joined layer to select duplicate records (Dup_ID_dx). I created a new layer from this selection (right click layer -> selection -> new layer from selected features). In this layer, I sorted by the kind-of-duplicate field (sort by "ID_Dx_STFID_Dup") so I can investigate the "Not Dup" within these records to try to figure out what's going on with these records. In my case, I noticed all of the patients with the same patient ID and Dx date had the same birthday (DOB). So even though the two kind-of-duplicate records have different home addresses and therefore different STFID's, they seem to indicate that they are true duplicates of the same patient and diagnosis.

Tuesday, June 7, 2011

ArcGIS fields: What's the difference between "address" and "ARC_Street"?

From Getting to Know ArcGIS Desktop, Ormsby et al:

"When you rematch unmatched addresses, you can edit the values in ARC_Street and ARC_Zone to make matches. You would do this, for example, if you found mistakes in the (original data) table. The ADDRESS and ZIP attributes, on the other hand, preserves the original information from the (original data) table."

Friday, May 6, 2011

Joining insanity in ArcGIS 10, Excel 2010

Problem: I haven't been able to join the Excel file that contains demographic data by census tract to the census tract shape file.

My specific issue is that when I open the Join window, the joinable field names are not selectable... if the CT shapefile field name is selected, the corresponding field name in the Excel file is not available to select, and vice versa.

I downloaded both sets of info from the census website... so why the technical trubs???

After renaming, resaving, repopulating, and refilling my beer, I finally thought to check the field properties for my problem fields in both layers (open attribute table, highlight column of interest, right click column heading, select "Properties", check "Type"). Surprise! My field of interest in my shapefile layer is type "string", while the corresponding field of interest in my Excel file layer is type "double".

Solution: I created a same-type field to complete the join.

Since I couldn't seem to add a string type field to the Excel layer in ArcGIS, I ended up adding a double type field to the shapefile layer: Open attribute table, select "Add field...", named it "GeoId10d" and type "double" leaving the default for precision and scale. Then I started Editor (button that looks like a pencil connecting the dots), selected the appropriate layer to edit, opened the attribute table, right clicked on my new GeoID10d column, selected "Field Calculator" and just double clicked the "GeoID10" column (telling ArcGIS that GeoID10d = [GeoId10]... in English, setting the new GeoID10d column to copy the values from the old GeoID10 column and store those values as type double).

Once I had a matching type field, the join worked! Horray!

Wednesday, April 13, 2011

Entering data field by hand in Arc GIS

Problem: You've opened your attribute table and created a new field. It auto-populated with 0's, and no matter how many times you click those cells, it won't let you edit the attribute table.

Solution: Right click the layer and go to "Edit Features" -> "Start Editing". When you're done, go to the Editor toolbar (if it's not active, click the button around the top center that looks like a pencil connecting dots), then editor dropdown -> "Save Edits" then "Stop Editing".

It seems that you cannot add a field in the editor mode... curious.

Wednesday, March 30, 2011

Setting your working directory

You tried to set your working directory using the location listed in the properties tab of your file of interest by coding something like:

> setwd("C:\Users\Me\Documents\Rfolder\lab1")

But you get the following error:

Error: '\u' used without hex digits in character string starting "C:\u"

Q: What to do???

A: Change all your backslashes \ to forward slashes /

> setwd("C:/Users/Me/Documents/Rfolder/lab1")

> getwd()

[1] "C:/Users/Me/Documents/Rfolder/lab1"

Or you could just shortcut it:

File -> Change dir...

But that may be too easy

Friday, March 18, 2011

Messed up R code?

... And now it won't let you start a new line (aka hitting "enter" starts a continued line, denoted with "+" instead of "<")?

Just hit the red "Stop" button located on the top.