Who are You? Bayesian Prediction of Racial Category Using Surname and Geolocation

Predicts individual race/ethnicity using surname, geolocation, and other attributes, such as gender and age. The method utilizes the Bayes' Rule to compute the posterior probability of each racial category for any given individual. The package implements methods described in Imai and Khanna (2015) "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records" .


This R package implements the methods proposed in Imai, K. and Khanna, K. (2016). "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record." Political Analysis, Vol. 24, No. 2 (Spring), pp. 263-272. doi: 10.1093/pan/mpw001.

Using wru

Here is a simple example that predicts the race/ethnicity of voters based only on their surnames.

library(wru)
data(voters)
predict_race(voter.file = voters, surname.only = T)

The above produces the following output, where the last five columns are probabilistic race/ethnicity predictions (e.g., 'pred.his' is the probability of being Hispanic/Latino):

"Proceeding with surname-only predictions ..."
VoterID    surname state CD county  tract block precinct age sex party PID place  pred.whi   pred.bla   pred.his   pred.asi   pred.oth
      1     Khanna    NJ 12    021 004000  3001        6  29   0   Ind   0 74000 0.0676000 0.00430000 0.00820000 0.86680000 0.05310000
      2       Imai    NJ 12    021 004501  1025           40   0   Dem   1 60900 0.0812000 0.00240000 0.06890000 0.73750000 0.11000000
      3    Velasco    NY 12    061 004800  6001           33   0   Rep   2 51000 0.0594000 0.00260000 0.82270000 0.10510000 0.01020000
      4    Fifield    NJ 12    021 004501  1025           27   0   Dem   1 60900 0.9355936 0.00220022 0.02850285 0.00780078 0.02590259
      5       Zhou    NJ 12    021 004501  1025           28   1   Dem   1 60900 0.0098000 0.00180000 0.00065000 0.98200000 0.00575000
      6   Ratkovic    NJ 12    021 004000  1025           35   0   Ind   0 60900 0.9187000 0.01083333 0.01083333 0.01083333 0.04880000
      7    Johnson    NY  9    061 015100  4000           25   0   Dem   1 51000 0.5897000 0.34630000 0.02360000 0.00540000 0.03500000
      8      Lopez    NJ 12    021 004501  1025           33   0   Rep   2 60900 0.0486000 0.00570000 0.92920000 0.01020000 0.00630000
      9 Wantchekon    NJ 12    021 004501  1025           50   0   Rep   2 60900 0.6665000 0.08530000 0.13670000 0.07970000 0.03180000
     10      Morse    DC  0    001 001301  3005           29   1   Rep   2 50000 0.9054000 0.04310000 0.02060000 0.00720000 0.02370000

In order to predict race/ethnicity based on surnames AND geolocation, a user needs to provide a valid U.S. Census API key to access the census statistics. You may request a U.S. Census API key here. Once you have an API key, you can use the package to download relevant Census geographic data on demand and condition race/ethnicity predictions on geolocation (county, tract, block, or place).

The following example predicts the race/ethnicity of voters based on their surnames, Census tract of residence (census.geo = "tract"), and which party registration (party = "PID"). Note that a valid API key must be provided in the input parameter 'census.key' in order for the function to download the relevant tract-level data.

library(wru)
data(voters)
predict_race(voter.file = voters, census.geo = "tract", census.key = "...", party = "PID")

The above returns the following output.

VoterID    surname state CD county  tract block precinct age sex party PID place    pred.whi     pred.bla     pred.his    pred.asi    pred.oth
      1     Khanna    NJ 12    021 004000  3001        6  29   0   Ind   0 74000 0.081856291 0.0021396565 0.0110451405 0.828313291 0.076645621
      6   Ratkovic    NJ 12    021 004000  1025           35   0   Ind   0 60900 0.916936771 0.0044432219 0.0120276229 0.008532929 0.058059455
      4    Fifield    NJ 12    021 004501  1025           27   0   Dem   1 60900 0.895620643 0.0022078678 0.0139457411 0.023345853 0.064879895
      5       Zhou    NJ 12    021 004501  1025           28   1   Dem   1 60900 0.003164229 0.0006092345 0.0001072684 0.991261466 0.004857802
      2       Imai    NJ 12    021 004501  1025           40   0   Dem   1 60900 0.029936354 0.0009275220 0.0129831039 0.850040743 0.106112277
      8      Lopez    NJ 12    021 004501  1025           33   0   Rep   2 60900 0.231046860 0.0016485574 0.6813780115 0.053180270 0.032746301
      9 Wantchekon    NJ 12    021 004501  1025           50   0   Rep   2 60900 0.817841573 0.0063677130 0.0258733496 0.107254103 0.042663261
      3    Velasco    NY 12    061 004800  6001           33   0   Rep   2 51000 0.223924118 0.0002913000 0.4451163607 0.313431417 0.017236805
      7    Johnson    NY  9    061 015100  4000           25   0   Dem   1 51000 0.241417483 0.6900686166 0.0293556870 0.011105140 0.028053073
     10      Morse    DC  0    001 001301  3005           29   1   Rep   2 50000 0.983300770 0.0006116706 0.0034070782 0.004823439 0.007857042

In predict_race, the census.geo options are "county", "tract", "block" and "place". Here is an example of prediction based on census statistics collected at the level of "place":

data(voters)
predict_race(voter.file = voters, census.geo = "place", census.key = "...", party = "PID")

It is also possible to pre-download Census geographic data, which can save time when running predict_race(). The example dataset 'voters' includes people in DC, NJ, and NY. The following example subsets voters in DC and NJ, and then uses get_census_data() to download Census geographic data in these two states (input parameter 'key' requires valid API key). Census data is assigned to an object named census.dc.nj. The predict_race() statement predicts the race/ethnicity of voters in DC and NJ using the pre-saved Census data (census.data = census.dc.nj). This example conditions race/ethnicity predictions on voters' surnames, block of residence (census.geo = "block"), age (age = TRUE), and party registration (party = "PID").

Please note that the input parameters 'age' and 'sex' must have the same values in get_census_data() and predict_race(), i.e., TRUE in both or FALSE in both. In this case, predictions are conditioned on age but not sex, so age = TRUE and sex = FALSE in both the get_census_data() and predict_race() statements.

library(wru)
data(voters)
voters.dc.nj <- voters[c(-3, -7), ]  # remove two NY cases from dataset
census.dc.nj <- get_census_data(key = "...", state = c("DC", "NJ"), age = TRUE, sex = FALSE)  # create Census data object covering DC and NJ 
predict_race(voter.file = voters.dc.nj, census.geo = "block", census.data = census.dc.nj, age = TRUE, sex = FALSE, party = "PID")

The last two lines above are equivalent to the following:

predict_race(voter.file = voters.dc.nj, census.geo = "block", census.key = "...", age = TRUE, sex = FALSE, party = "PID")

Using pre-downloaded Census data may be useful for the following reasons:

  • You can save a lot of time in future runs of predict_race() if the relevant Census data has already been saved;
  • The machines used to run predict_race() may not have internet access;
  • You can obtain timely snapshots of Census geographic data that match your voter file.

Downloading data using get_census_data() may take a long time, especially at the block level or in large states. If block-level Census data is not required, downloading Census data at the tract level will save time. Similarly, if tract-level Census data is not required, county-level data may be specified in order to save time.

library(wru)
data(voters)
voters.dc.nj <- voters[c(-3, -7), ]  # remove two NY cases from dataset
census.dc.nj2 <- get_census_data(key = "...", state = c("DC", "NJ"), age = TRUE, sex = FALSE, census.geo = "tract")  
predict_race(voter.file = voters.dc.nj, census.geo = "tract", census.data = census.dc.nj2, party = "PID", age = TRUE, sex = FALSE)
predict_race(voter.file = voters.dc.nj, census.geo = "county", census.data = census.dc.nj2, age = TRUE, sex = FALSE)  # Pr(Race | Surname, County)
predict_race(voter.file = voters.dc.nj, census.geo = "tract", census.data = census.dc.nj2, age = TRUE, sex = FALSE)  # Pr(Race | Surname, Tract)
predict_race(voter.file = voters.dc.nj, census.geo = "county", census.data = census.dc.nj2, party = "PID", age = TRUE, sex = FALSE)  # Pr(Race | Surname, County, Party)
predict_race(voter.file = voters.dc.nj, census.geo = "tract", census.data = census.dc.nj2, party = "PID", age = TRUE, sex = FALSE)  # Pr(Race | Surname, Tract, Party)

Or you can also use the census_geo_api() to maually construct a census object. The example below creates a census object with county-level and tract-level data in DC and NJ, while avoiding downloading block-level data. Note that this function has the input parameter 'state' that requires a two-letter state abbreviation to proceed.

censusObj2  = list()
 
county.dc <- census_geo_api(key = "...", state = "DC", geo = "county", age = TRUE, sex = FALSE)
tract.dc <- census_geo_api(key = "...", state = "DC", geo = "tract", age = TRUE, sex = FALSE)
censusObj2[["DC"]] <- list(state = "DC", county = county.dc, tract = tract.dc, age = TRUE, sex = FALSE)
 
tract.nj <- census_geo_api(key = "...", state = "NJ", geo = "tract", age = TRUE, sex = FALSE)
county.nj <- census_geo_api(key = "...", state = "NJ", geo = "county", age = TRUE, sex = FALSE)
censusObj2[["NJ"]] <- list(state = "NJ", county = county.nj, tract = tract.nj, age = TRUE, sex = FALSE)

Note: The age and sex parameters must be consistent when creating the Census object and using that Census object in the predict_race function. If one of these parameters is TRUE in the Census object, it must also be TRUE in the predict_race function.

After saving the data in censusObj2 above, we can condition race/ethnicity predictions on different combinations of input variables, without having to re-download the relevant Census data.

predict_race(voter.file = voters.dc.nj, census.geo = "county", census.data = censusObj2, age = TRUE, sex = FALSE)  # Pr(Race | Surname, County)
predict_race(voter.file = voters.dc.nj, census.geo = "tract", census.data = censusObj2, age = TRUE, sex = FALSE)  # Pr(Race | Surname, Tract)
predict_race(voter.file = voters.dc.nj, census.geo = "county", census.data = censusObj2, party = "PID", age = TRUE, sex = FALSE)  # Pr(Race | Surname, County, Party)
predict_race(voter.file = voters.dc.nj, census.geo = "tract", census.data = censusObj2, party = "PID", age = TRUE, sex = FALSE)  # Pr(Race | Surname, Tract, Party)

A related song

Watch this!

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("wru")

0.1-9 by Kabir Khanna, 4 months ago


https://github.com/kosukeimai/wru


Report a bug at https://github.com/kosukeimai/wru/issues


Browse source code at https://github.com/cran/wru


Authors: Kabir Khanna [aut, cre] , Kosuke Imai [aut]


Documentation:   PDF Manual  


GPL (>= 3) license


Imports devtools

Depends on utils

Suggests testthat


See at CRAN