Provides a set of tools for creating yearly data sets of St. Louis
Metropolitan Police Department (SLMPD) crime data, which are available from
January 2008 onward as monthly CSV releases on their website
(
The goal of compstatr
is to provide a suite of tools for working with
crime data made public by the City of St. Louis’s Metropolitan Police
Department.
Among cities in the United States, St. Louis has the distinction of having the highest or one of the highest violent crime and homicide rates since 2010. It therefore presents an important site for social scientists, public health researchers, and health care providers as well as policy makers to understand the effects of violent crime on urban communities.
The City’s crime data, however, are difficult to work with and present a
number of challenges for researchers. These data are inconsistently
organized, with all data before 2013 and some months of 2013 itself
having eighteen variables. Beginning during 2013, most (but not all)
months have twenty variables, many of which are named differently from
their pre-2014 counterparts. These inconsistencies, and the fact that
working with their data requires managing over 120 spreadsheets that
each download with with a .html
file extension, are the motivating
force behind compstatr
.
We therefore provide a set of tools for accessing, preparing, editing,
and mapping St. Louis Metropolitan Police
Department (SLMPD) crime data, which are
available on their website as
.csv
files. The categorization tools that are provided will work with
any police department that uses 5 and 6 digit numeric codes to identify
specific crimes.
You can install compstatr from Github with remotes
:
# install.packages("remotes")remotes::install_github("slu-openGIS/compstatr")
St. Louis data can be downloaded month-by-month from the St. Louis
Metropolitan Police Department’s
website. compstatr
assumes
that only one year of crime data (or less) is included in specific
folders within your project. These next examples assume you have
downloaded all of the data for 2017 and 2018, and saved them
respectively in data/raw/2017
and data/raw/2018
. We’ll start with
loading the compstatr
package:
The function cs_prep_data()
can be used to rename files, which will be
downloaded with the wrong file extension (January2018.csv.html
). Once
downloaded you can load them into what we call year-list objects:
> cs_prep_year(path = "data/raw/2017")>> yearList17 <- cs_load_year(path = "data/raw/2017")
The SLMPD are inconsistently organized, and problems that need to be
addressed prior to collapsing a year-list into a single object can be
identified with cs_validate()
:
> cs_validate(yearList17, year = 2017)[1] FALSE
If a FALSE
value is returned, the vebose = TRUE
argument provides
additional detail:
> cs_validate(yearList17, year = 2017, verbose = TRUE)# A tibble: 12 x 8namedMonth codedMonth valMonth codedYear valYear oneMonth varCount valVars<chr> <chr> <lgl> <int> <lgl> <lgl> <lgl> <lgl>1 January January TRUE 2017 TRUE TRUE TRUE TRUE2 February February TRUE 2017 TRUE TRUE TRUE TRUE3 March March TRUE 2017 TRUE TRUE TRUE TRUE4 April April TRUE 2017 TRUE TRUE TRUE TRUE5 May May TRUE 2017 TRUE TRUE FALSE NA6 June June TRUE 2017 TRUE TRUE TRUE TRUE7 July July TRUE 2017 TRUE TRUE TRUE TRUE8 August August TRUE 2017 TRUE TRUE TRUE TRUE9 September September TRUE 2017 TRUE TRUE TRUE TRUE10 October October TRUE 2017 TRUE TRUE TRUE TRUE11 November November TRUE 2017 TRUE TRUE TRUE TRUE12 December December TRUE 2017 TRUE TRUE TRUE TRUE
In this case, we have the wrong number of variables for the month of May
(in this case there are 26). We can fix this by using cs_standardize()
to create the correct number of columns (20) and name them
appropriately:
> # standardizeyearList17 <- cs_standardize(yearList17, month = "May", config = 26)>> # confirm data are now valid> cs_validate(yearList17, year = 2017)[1] TRUE
For 2013 and prior years, there will be only 18 variables. The 2013 data need to be fixed month by month because there are some correct months, but years 2008 through 2012 can be fixed en masse:
> yearList08 <- cs_standardize(yearList08, config = 18)
Once the data have been standardized, we can collapse them into a single
object with cs_collapse()
:
> reports17 <- cs_collapse(yearList17)
This gives us all of the crimes reported in 2017. However, there will be
crimes that were reported that year that occurred in prior years, and
there may also be crimes reported in 2018 that took place in our year of
interest. We can address both issues (assuming we have the next year’s
data) with cs_combine()
:
> # load and standardize 2018 data> cs_prep_year(path = "data/raw/2018")> yearList18 <- cs_load_year(path = "data/raw/2018")> cs_validate(yearList18, year = 2018)[1] TRUE> reports18 <- cs_collapse(yearList18)>> # combine 2017 and 2018 datacrimes17 <- cs_combine(type = "year", date = 2017, reports17, reports18)
We now have a tibble containing all of the known crimes that occurred in 2017 (including those reported in 2018).
Once we have the data prepared, we can easily pull out a specific set of
crimes to inspect further. For example, we could identify homicides. In
the next few examples, we’ll use the january2018
example data that
comes with the package. We’ll start by using cs_filter_crimes()
to
select only homicides as well as cs_filter_count()
to remove any
unfounded incidents:
> # load dependencies> library(compstatr)> library(ggplot2)> library(magrittr)> library(mapview)>> # subset homicides and removed unfounded incidents> janHomicides <- january2018 %>%+ cs_filter_count(var = Count) %>%+ cs_filter_crime(var = Crime, crime = "homicide")
Next, we’ll check for missing spatial data with cs_missingXY()
:
> # identify missing spatial data> janHomicides <- cs_missingXY(janHomicides, varX = XCoord, varY = YCoord, newVar = missing)>> # check for any TRUE values> table(janHomicides$missing)
We don’t have any missing spatial data in this example, but if we did we
would need to remove those observations with dplyr::filter()
(or
another subsetting tool). Finally, we can project and map our data:
> # project data> janHomicides_sf <- cs_projectXY(janHomicides, varX = XCoord, varY = YCoord)>> # preview data> mapview(janHomicides_sf)
These data can also be mapped using ggplot2
once they have been
projected:
> library(ggplot2)> ggplot() ++ geom_sf(data = janHomicides_sf, color = "red", fill = NA, size = .5)
If you work with data from other police departments, the cs_crime()
,
cs_crime_cat()
, and cs_filter_crime()
functions may be useful for
identifying, grouping, and subsetting by crime so long as they use a
standard set of 5 and 6 digit codes based on the UCR system (e.g.
31111
(robbery with a firearm) or 142320
(malicious destruction of
property)).
We wish to thank Taylor Braswell for his significant efforts compiling Stata code early in this project. Taylor’s code was used as a reference when developing this package, and many of the functions reflect issues that he worked to identify.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
README
README
and on the pkgdown
sitecran-comments.md
cs_example()
function for creating a sample year worth of .csv
filescs_prep_year()
, cs_load_year()
, and cs_projectXY
DESCRIPTION
for packageDESCRIPTION
file to the SLMPD data sourceexample.R
and create.R
read from and write to a temporary directorycs_address()
to facilitate concatenation of street addresses prior to geocodingNEWS.md
file to track changes to the package.