Convenience Functions for Education Data

Collection of convenience functions to make working with administrative records easier and more consistent. Includes functions to clean strings, identify cut points, and quickly combine shapefiles and data frames for plotting. Also includes three example data sets of administrative education records for learning how to process records with errors.


Introduction ============

eeptools is an R package that seeks to make it easier for analysts at state and local education agencies to analyze and visualize their data on student, school, and district performance. By putting simple wrappers around a number of R functions, eeptools strives to make many common tasks simpler and less prone to error specific to analysis of education data.

Datasets

eeptools provides three new datasets of interest to education researchers. These datasets are also used in the R Bootcamp for Education Analysts

library(eeptools) #> Warning: package 'ggplot2' was built under R version 3.2.4
data("stuatt")
head(stuatt)
#>   sid school_year male race_ethnicity birth_date
#> 1   1        2004    1              B      10869
#> 2   1        2005    1              H      10869
#> 3   1        2006    1              H      10869
#> 4   1        2007    1              H      10869
#> 5   2        2006    0              W      11948
#> 6   2        2007    0              B      11948
#>   first_9th_school_year_reported hs_diploma      hs_diploma_type
#> 1                           2004          0                     
#> 2                           2004          0                     
#> 3                           2004          0                     
#> 4                           2004          0                     
#> 5                             NA          1     Standard Diploma
#> 6                             NA          1 College Prep Diploma
#>   hs_diploma_date
#> 1                
#> 2                
#> 3                
#> 4                
#> 5        6/5/2008
#> 6       5/24/2009

The stuatt, student attributes, dataset is provided from the Strategic Data Project Toolkit for Effective Data Use. This dataset is useful for learning how to clean data in R and how to aggregate and summarize individual unit-record data into group-level data.

data(stulevel)
head(stulevel)
#>     X school  stuid grade schid dist white black hisp indian asian econ
#> 1  44      1 149995     3   495  105     0     1    0      0     0    0
#> 2  53      1  13495     3   495   45     0     1    0      0     0    1
#> 3 116      1 106495     3   495   45     0     1    0      0     0    1
#> 4 244      1  45205     3   205   15     0     1    0      0     0    1
#> 5 274      1 142705     3   205   75     0     1    0      0     0    1
#> 6 276      1  14995     3   495  105     0     1    0      0     0    1
#>   female ell disab sch_fay dist_fay luck   ability    measerr      teachq
#> 1      0   0     0       0        0    0  87.85405  11.133264 39.09024712
#> 2      0   0     0       0        0    1  97.78756   6.822394  0.09848192
#> 3      0   0     0       0        0    0 104.49303  -7.856159 39.53885270
#> 4      0   0     0       0        0    1 111.67151 -17.574152 24.11612277
#> 5      0   0     0       0        0    0  81.92539  52.983338 56.68061304
#> 6      0   0     0       0        0    0 101.92904  22.604145 71.62196655
#>   year attday schoolscore district schoolhigh schoolavg schoollow   readSS
#> 1 2000    180    29.22427        3          0         1         0 357.2865
#> 2 2000    180    55.96326        3          0         1         0 263.9046
#> 3 2000    160    55.96326        3          0         1         0 369.6722
#> 4 2000    168    55.96326        3          0         1         0 346.5957
#> 5 2000    156    55.96326        3          0         1         0 373.1254
#> 6 2000    157    55.96326        3          0         1         0 436.7607
#>     mathSS     proflvl race
#> 1 387.2803       basic    B
#> 2 302.5724 below basic    B
#> 3 365.4614       basic    B
#> 4 344.4964       basic    B
#> 5 441.1581       basic    B
#> 6 463.4033  proficient    B

The stulevel dataset is a simulated student-level longitudinal record. It contains student and school level attributes and is useful for practicing evaluating longitudinal analyses of student unit-record data.

data("midsch")
head(midsch)
#>   district_id school_id subject grade n1   ss1 n2   ss2 predicted
#> 1          14       130    math     4 44 433.1 40 463.0  468.7446
#> 2          70        20    math     4 18 443.0 20 477.2  476.4765
#> 3         112        80    math     4 86 445.4 94 472.6  478.3509
#> 4         119        50    math     4 95 427.1 94 460.7  464.0586
#> 5         147        60    math     4 27 424.2 27 458.7  461.7937
#> 6         147       125    math     4 17 423.5 26 463.1  461.2470
#>    residuals     resid_z     resid_t       cooks test_year     tprob
#> 1 -5.7445937 -0.59189645 -0.59170988 0.000171271      2007 0.2787298
#> 2  0.7235053  0.07455731  0.07452135 0.000003510      2007 0.4706873
#> 3 -5.7508949 -0.59266905 -0.59248250 0.000244921      2007 0.2774827
#> 4 -3.3585931 -0.34605798 -0.34591020 0.000059900      2007 0.3650957
#> 5 -3.0936928 -0.31877383 -0.31863490 0.000054100      2007 0.3762745
#> 6  1.8530072  0.19093568  0.19084643 0.000019800      2007 0.4250936
#>   flagged_t95
#> 1           0
#> 2           0
#> 3           0
#> 4           0
#> 5           0
#> 6           0

The midsch dataset contains an analysis on abnormality in school average assessment scores. It contains observed and predicted values of aggregated test scores at the school level for a large midwestern state.

Administrative Data Functions

For analysts using unit-record data of some type, there are several calc functions which automate common tasks including calculating ages (age_calc), grade retention (retained_calc), and student mobility (moves_calc).

age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "years")
#> [1] 8.087671
age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "months")
#> [1] 97.03571
age_calc(dob = as.Date('1995-01-15'), enddate = as.Date('2003-02-16'), 
         units = "days")
#> Time difference of 2954 days

age_calc also now properly accounts for leap years and leap seconds by default.

retained_calc takes a vector of student identifiers and a vector of grades and checks whether or not the student was retained in the grade level specified by the user. It returns a data.frame of all students who could have been retained and a yes or no indicator of whether they were retained.

x <- data.frame(sid = c(101, 101, 102, 103, 103, 103, 104, 105, 105, 106, 106),
                 grade = c(9, 10, 9, 9, 9, 10, 10, 8, 9, 7, 7))
retained_calc(df = x, sid = "sid", grade = "grade", grade_val = 9)
#>   sid retained
#> 1 101        N
#> 2 102        N
#> 3 103        Y
#> 4 105        N

retained_calc is intended to be used after you have processed your data as it does not take into account time or sequence other than the order in which the data is passed to it.

moves_calc is intended to identify based on enrollment dates whether a student experienced a school move within a school year.

df <- data.frame(sid = c(rep(1,3), rep(2,4), 3, rep(4,2)),
                   schid = c(1, 2, 2, 2, 3, 1, 1, 1, 3, 1),
                   enroll_date = as.Date(c('2004-08-26',
                   '2004-10-01', '2005-05-01', '2004-09-01',
                   '2004-11-03', '2005-01-11', '2005-04-02',
                   '2004-09-26', '2004-09-01','2005-02-02'), format='%Y-%m-%d'),
                   exit_date = as.Date(c('2004-08-26', '2005-04-10',
                    '2005-06-15', '2004-11-02', '2005-01-10',
                    '2005-03-01', '2005-06-15', '2005-05-30',
                    NA, '2005-06-15'), format='%Y-%m-%d'))
 
moves <- moves_calc(df, sid = "sid", schid = "schid", enroll_date = "enroll_date", 
                    exit_date = "exit_date")
moves
#>   sid moves
#> 1   1     4
#> 2   2     4
#> 3   3     2
#> 4   4    NA

Manipulate Data

Another set of key functions in the package are to make basic data manipulation easier. One thing users of other statistical packaegs may miss when using R is a convenient function for determining the mode of a vector. The statamode function is designed to do just that. statamode works with numeric, character, and factor data types. It also includes various options for how to deal with a tie demonstrated below.

vecA <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
statamode(vecA, method = "stata")
#> [1] "."
vecB <- c(1, 1, 1, 3:10)
statamode(vecB, method = "last")
#> [1] 1
vecC <- c(1, 1, 1, NA, NA, 5:10)
statamode(vecC, method = "last")
#> [1] 1
vecA <- c(LETTERS[1:10]); vecA <- factor(vecA)
statamode(vecA, method = "last")
#> [1] J
#> Levels: J
vecB <- c("A", "A", "A", LETTERS[3:10]); vecB <- factor(vecB)
statamode(vecB, method = "last")
#> [1] A
#> Levels: A
vecA <- c(LETTERS[1:10])
statamode(vecA, method = "sample")
#> [1] "A"
vecB <- c("A", "A", "A", LETTERS[3:10])
statamode(vecB, method = "stata")
#> [1] "A"
vecC <- c("A", "A", "A", NA, NA, LETTERS[5:10])
statamode(vecC, method = "stata")
#> [1] "A"

There are a number of functions to save you keystrokes like defac for converting a factor to a character, makenum for turning a factor variable into a numeric variable, max_mis for taking the maximum of a vector of numerics and ignoring any NAs (useful for inclusion in do.call or apply constructions). remove_char allows you to quickly gsub out a specific character from a string vector such as an * or .... decomma is a somewhat specialized version of this for processing data where numerics are written with commas. nth_max allows you to identify the 2nd, 3rd, etc. maximum value in a vector.

Regression Models

eeptools includes ways to simplify the use of regression analyses tools recommended by Gelman and Hill 2006 through the gelmansim function, which itself is a wrapper for the arm::sim() function.

require(MASS)
#> Loading required package: MASS
#> Warning: package 'MASS' was built under R version 3.2.4
#Examples of "sim" 
set.seed (1)
J <- 15
n <- J*(J+1)/2
group <- rep (1:J, 1:J)
mu.a <- 5
sigma.a <- 2
a <- rnorm (J, mu.a, sigma.a)
b <- -3
x <- rnorm (n, 2, 1)
sigma.y <- 6
y <- rnorm (n, a[group] + b*x, sigma.y)
u <- runif (J, 0, 3)
dat <- cbind (y, x, group)
# Linear regression 
dat <- as.data.frame(dat)
dat$group <- factor(dat$group)
M3 <- glm (y ~ x + group, data=dat)
cases <- expand.grid(x = seq(-2, 2, by=0.1), 
                     group=seq(1, 14, by=2))
cases$group <- factor(cases$group)
sim.results <- gelmansim(mod = M3, newdata = cases, n.sims=200, na.omit=TRUE)
head(sim.results)
#>      x group       yhats   yhatMin  yhatMax
#> 1 -2.0     1  1.49216152 -5.317070 8.826878
#> 2 -1.9     1  1.05530376 -6.141328 9.084779
#> 3 -1.8     1  1.50069226 -5.341525 8.256558
#> 4 -1.7     1 -0.04991641 -7.027672 6.982415
#> 5 -1.6     1  0.50262366 -6.315685 7.944253
#> 6 -1.5     1 -0.32213552 -7.343319 6.737021

Plotting Themes

The package includes a number of themes for ggplot2 named theme_dpi to reflect their internal use at the Wisconsin Department of Public Instruction. These themes are very similar to the theme_bw() within ggplot2 but with font sizes optimized for presentations and publications.

crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
require(reshape) # for melt
#> Loading required package: reshape
#> Warning: package 'reshape' was built under R version 3.2.5
crimesm <- melt(crimes, id = 1)
states_map <- map_data("state")
#> Warning: package 'maps' was built under R version 3.2.5
p1 <- ggplot(crimes, aes(map_id = state)) + geom_map(aes(fill = Murder), 
                                                     linetype = 1, map = states_map) + 
       expand_limits(x = states_map$long, y = states_map$lat) + labs(title="USA Crime")
p1 <- p1 + coord_map()
p1 + theme_dpi_map()
#> Warning: 'theme_dpi_map' is deprecated.
#> Use 'theme_bw' instead.
#> See help("Deprecated")

There is also a ggplot2 version of plot.lm included:

data(mpg)
mymod <- lm(cty~displ + cyl + drv, data=mpg)
autoplot(mymod)

Finally, there is a convenient method for creating labeled mosaic plots.

sampDat <- data.frame(cbind(x=seq(1,3,by=1), y=sample(LETTERS[6:8], 60, 
                                                        replace=TRUE)),
                        fac=sample(LETTERS[1:4], 60, replace=TRUE))
varnames<-c('Quality','Grade')
crosstabplot(sampDat, "y", "fac", varnames = varnames,  label = TRUE, 
             title = "Crosstab Plot", shade = FALSE)

Helping Out

eeptools is intended to be a useful project for the education analytics community. Contributions are welcomed. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

News

NEWS

  • Thanks to new ggplot2 theme options, the theme_dpi functions are now just wrappers for theme_bw
  • Fixed logical test in age_calc (GH #32)
  • Fixed issue with calls to library in examples (GH #33)
  • Compatibility with ggplot2 and roxygen updates
  • nth_max performance is improved thanks to use of partial sort contributed by @sgibb

This is a major update including removing little used functions and renaming and restructuring functions.

  • A new package vignette is now included
  • nth_max function for finding the nth highest value in a vector
  • retained_calc now accepts user specified values for sid and grade
  • destring function deprecated and renamed to makenum to better reflect the use of the function
  • crosstabs function exported to allow the user to generate the data behind crosstabplot but not draw the plot
  • dropbox_source deprecated, use the rdrop2 package
  • plotForWord function deprecated in favor of packages like knitr and rmarkdown
  • mapmerge2 has been deprecated in favor of a tested mapmerge
  • mosaictabs.labels has been deprecated in favor of crosstabplot
  • nsims in gelmansim was renamed to n.sims to align with the arm package
  • Fixed bug in retained_calc where user specified sid resulted in wrong ids being returned
  • Inserted a meaningful error in age_calc when the enddate is before the date of birth
  • Fixed issue with age_calc which lead to wrong fraction of age during leap years
  • lag_data now can do leads and lags and includes proper error messages
  • fix major bugs for statamode including faulty default to method and returning objects of the wrong class
  • add unit tests and continuous integration support for better package updating
  • fix behavior of max_mis in cases when it is passed an empty vector or a vector of NA
  • leading_zero function made robust to negative values
  • added NA handling options to cutoff and thresh
  • Codebase is now tested with lintr to improve readability
  • moves_calc function from Jason Becker
  • gelmansim function to do post-estimation prediction on new data from model objects using functionality in the arm package
  • statamode updated to work with data.table
  • age_calc function from Jason Becker given new precision option
  • lag_data function to create groupwise nested lags quickly
  • unit tests for decomma, gelmansim, and statamode using testthat package
  • new functions for building maps with shapefiles including

    • mapmerge to merge a dataframe and a shapefile, and
    • ggmapmerge to convert this to a document for making a map in ggplot2
  • statamode updated to allow for multiple methods for handling multiple modes

  • emove_stars deleted and replaced with remove_char to allow for users to specify an arbitrary character string to be removed

  • add plotForWord function to export plots in a Windows MetaFile for inclusion in Microsoft Office documents

  • add age_calc function to allow calculating the age of a vector of birthdates relative to the current date (h/t Jason Becker)

  • fix typos in documentation
  • fix startup message behavior
  • reduce dependencies of the package dramatically so loading is faster and more lightweight

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("eeptools")

1.0.0 by Jared E. Knowles, 7 months ago


Browse source code at https://github.com/cran/eeptools


Authors: Jason P. Becker [ctb], Jared E. Knowles [aut, cre]


Documentation:   PDF Manual  


GPL-3 license


Imports arm, data.table, vcd, memisc, maptools

Depends on ggplot2

Suggests reshape, sp, plyr, mapproj, testthat, maps, stringr, knitr, rmarkdown, MASS


See at CRAN