[phpBB Debug] PHP Notice: in file /viewtopic.php on line 988: date(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected the timezone 'UTC' for now, but please set date.timezone to select your timezone.
[phpBB Debug] PHP Notice: in file /viewtopic.php on line 988: getdate(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected the timezone 'UTC' for now, but please set date.timezone to select your timezone.
mcdc.missouri.edu • View topic - Using a correlation list to convert data

Using a correlation list to convert data

Discussion of the Mable/Geocorr application, the output it generates, and how to use the output to solve everyday problems.

Using a correlation list to convert data

Postby johnson » Thu Sep 09, 2010 11:39 am

I have a client who has some data that is by census tract that he would like to have converted to ZIP code. I seem to recall that the key to doing this was a file that showed which ZIP code each census tract was in so that you could tag each of your data records with the appropriate ZIP code and then just aggregate the data to get ZIP totals. Some of the data items to be aggregated are averages and percentages and there are even a couple of medians. I have no idea how to deal with those since they can't be simply summed up. I understand that MABLE/Geocorr can be of help in creating the census tract to ZIP code crosswalk file, which is why I am posting here.
johnson
 
Posts: 11
Joined: Thu Sep 09, 2010 8:01 am

Re: Using a correlation list to convert data

Postby blodgettj » Mon Sep 13, 2010 8:41 am

MABLE/Geocorr can be used to define the "crosswalk file" (aka "correlation list", "equivalency file", etc.) showing how Census Tracts are related to ZIP codes. Just invoke the application (at http://mcdc.missouri.edu/websas/geocorr2k.html) and select the state of interest. Then select Census Tract as the Source geocode and "5-digit ZIP/ZCTA" as the Target geocode from the two side-by-side select lists.
I ran it to generate a list for Missouri and saved a csv file with codes only (no names for the geographic areas). Here is an excerpt of what it looks like:
county tract zcta5 pop2k afact

29001 9501 63501 1522 0.432
29001 9501 63533 819 0.233
29001 9501 63540 160 0.045
29001 9501 63546 847 0.241
29001 9501 63547 33 0.009
29001 9501 63549 139 0.039
29001 9502 63501 1104 0.371
29001 9502 63544 371 0.12529001 9502 63546 19 0.006
29001 9502 63549 93 0.031
29001 9502 63559 1386 0.466
29001 9503 63501 3209 1
29001 9504 63501 3905 1
29001 9505 63501 6576 1
29001 9509 63501 2206 1
29001 9510 63501 2588 1

These are all the tracts in Adair county. The first two, 9501 and 9502, are large rural tracts which intersect lots of ZIP codes, while the remaining five tracts are all inside a single ZIP code in the city of Kirksville. The first row tells us that tract 9501 intersects ZIP 63501 with about 1522 people living in the intersection (in 2000) and that this is about 43.2% of the tract's total pop. The tract also intersects with 5 other ZIP codes.

This equivalency file can be linked to your client's data file with census tract summaries. It will be a many-to-1 match with multiple equivalency file records matching a single tract data record (at least for some tracts). The key to converting the data to the target
geocode layer (in this case, ZIP codes) is to "disaggregate", or allocate, the data using the allocation factors. Suppose the tract records have these variables: TotPop (total population), TotHHS (total households), PctAsian (percent Asian population) and PCI (per capita income). We want to estimate the values of these variables for the portions of each tract intersecting with each ZIP. For tract 9501 we'll be creating six allocated data records since the tract intersects with six ZIPs. The first output record is for 63501 and we want that ZIP to get about 43.2% of whatever we have for the tract. We apply the allocation factor (.432) to TotPop and TotHHs since they are simple counts. Granted, we do not know that just because 43.2% of the total pop (in 2000) was in this ZIP that the exact same proportion of households will be as well -- but the estimation method presumes that 43.2 is a reasonable estimate for the true value. Allocating the PctAsian and PCI variables is a little trickier since these are not simple quantities. Fortunately, there is a simple way to handle such variables, provided you have their "weight" variables on the record. The weight variable is the one used in the denominator to calculate the percentage or mean item. In this example PctAsian is calculated by taking Asian (count of Asian person) as a percentage of TotPop, and PCI is calculated by taking the total aggregate income and dividing it by Totpop. By applying the weight variable to these items we get the numerator values (total Asian persons, total income dollars), which we can then allocate and aggregate. We just have to remember to complete the process after aggregating by dividing the weighted sums by the aggregated denominators. In our example we would generate 6 intermediate records for the 6 ZIP codes intersecting with tract 9501. Each of these records would have a a PCI variable creating by weighting and the allocating the tract's PCI value. These values would be the estimated total income dollars for each of the areas (weighting PCI by TotPop gives you total income dollars; applying the "afact" apporoximates how many of these income dollars are to allocated to the ZIP). After we generate all these intermediate ZIP records we sort and aggregate by ZIP; as part of the aggregation step we need to take those weighted aggregations and divide them by the weight variables. When we do this the aggregated PCI values (which are now actually total income dollars) are divided by TotPop thus making it an average dollars per person again.

Handling weighted averages is probably the trickiest part of the data allocation process. Except for handling medians. You can always cheat and treat medians as if they were means and take a weighted average. The gods of mathematics will not smile upon this, however. the preferred way requires that your data contain a distribution table (e.g. if you have Median Household Income, then you need a table showing a distribution of households by household income). You do not directly aggregate the median; instead you aggregate the distribution and then post-process the resulting aggregated data and apply an algorithm that estimates the median from the distribution table. More detail than we can go into here.

We did a paper describing our work of allocating some 1990 census data to telephone Wire Centers back around 1999. It can be accessed at
http://www.oseda.missouri.edu/telexdata/Allocating.html. It contains links to some SAS code that we use to apply our algorithms, especially an aggregation macro that handles allocation factors and variables that need to be handled as weighted averages.
(http://www.oseda.missouri.edu/mscdc/sasmacro/agg.sas. (There is also a link to a macro that calculates medians from a distribution table.)
blodgettj
 
Posts: 41
Joined: Tue Jan 19, 2010 1:52 pm
Location: Columbia, MO


Return to Mable/Geocorr

Who is online

Users browsing this forum: No registered users and 1 guest

cron