MABLE/Geocorr can be used to define the "crosswalk file" (aka "correlation list", "equivalency file", etc.) showing how Census Tracts are related to ZIP codes. Just invoke the application (at http://mcdc.missouri.edu/websas/geocorr2k.html
) and select the state of interest. Then select Census Tract as the Source geocode and "5-digit ZIP/ZCTA" as the Target geocode from the two side-by-side select lists.
I ran it to generate a list for Missouri and saved a csv file with codes only (no names for the geographic areas). Here is an excerpt of what it looks like:
county tract zcta5 pop2k afact29001 9501 63501 1522 0.432
29001 9501 63533 819 0.233
29001 9501 63540 160 0.045
29001 9501 63546 847 0.241
29001 9501 63547 33 0.009
29001 9501 63549 139 0.039
29001 9502 63501 1104 0.371
29001 9502 63544 371 0.12529001 9502 63546 19 0.006
29001 9502 63549 93 0.031
29001 9502 63559 1386 0.46629001 9503 63501 3209 1
29001 9504 63501 3905 1
29001 9505 63501 6576 1
29001 9509 63501 2206 1
29001 9510 63501 2588 1
These are all the tracts in Adair county. The first two, 9501 and 9502, are large rural tracts which intersect lots of ZIP codes, while the remaining five tracts are all inside a single ZIP code in the city of Kirksville. The first row tells us that tract 9501 intersects ZIP 63501 with about 1522 people living in the intersection (in 2000) and that this is about 43.2% of the tract's total pop. The tract also intersects with 5 other ZIP codes.
This equivalency file can be linked to your client's data file with census tract summaries. It will be a many-to-1 match with multiple equivalency file records matching a single tract data record (at least for some tracts). The key to converting the data to the target
geocode layer (in this case, ZIP codes) is to "disaggregate", or allocate, the data using the allocation factors. Suppose the tract records have these variables: TotPop (total population), TotHHS (total households), PctAsian (percent Asian population) and PCI (per capita income). We want to estimate the values of these variables for the portions of each tract intersecting with each ZIP. For tract 9501 we'll be creating six allocated data records since the tract intersects with six ZIPs. The first output record is for 63501 and we want that ZIP to get about 43.2% of whatever we have for the tract. We apply the allocation factor (.432) to TotPop and TotHHs since they are simple counts. Granted, we do not know that just because 43.2% of the total pop (in 2000) was in this ZIP that the exact same proportion of households will be as well -- but the estimation method presumes that 43.2 is a reasonable estimate for the true value. Allocating the PctAsian and PCI variables is a little trickier since these are not simple quantities. Fortunately, there is a simple way to handle such variables, provided you have their "weight" variables on the record. The weight variable is the one used in the denominator to calculate the percentage or mean item. In this example PctAsian is calculated by taking Asian (count of Asian person) as a percentage of TotPop, and PCI is calculated by taking the total aggregate income and dividing it by Totpop. By applying the weight variable to these items we get the numerator values (total Asian persons, total income dollars), which we can then allocate and aggregate. We just have to remember to complete the process after aggregating by dividing the weighted sums by the aggregated denominators. In our example we would generate 6 intermediate records for the 6 ZIP codes intersecting with tract 9501. Each of these records would have a a PCI variable creating by weighting and the allocating the tract's PCI value. These values would be the estimated total income dollars for each of the areas (weighting PCI by TotPop gives you total income dollars; applying the "afact" apporoximates how many of these income dollars are to allocated to the ZIP). After we generate all these intermediate ZIP records we sort and aggregate by ZIP; as part of the aggregation step we need to take those weighted aggregations and divide them by the weight variables. When we do this the aggregated PCI values (which are now actually total income dollars) are divided by TotPop thus making it an average dollars per person again.
Handling weighted averages is probably the trickiest part of the data allocation process. Except for handling medians. You can always cheat and treat medians as if they were means and take a weighted average. The gods of mathematics will not smile upon this, however. the preferred way requires that your data contain a distribution table (e.g. if you have Median Household Income, then you need a table showing a distribution of households by household income). You do not directly aggregate the median; instead you aggregate the distribution and then post-process the resulting aggregated data and apply an algorithm that estimates the median from the distribution table. More detail than we can go into here.
We did a paper describing our work of allocating some 1990 census data to telephone Wire Centers back around 1999. It can be accessed athttp://www.oseda.missouri.edu/telexdata/Allocating.html
. It contains links to some SAS code that we use to apply our algorithms, especially an aggregation macro that handles allocation factors and variables that need to be handled as weighted averages.
. (There is also a link to a macro that calculates medians from a distribution table.)