Title: | Analysis via Simulation of Interrupted Time Series (ITS) Data |
---|---|
Description: | Uses simulation to create prediction intervals for post-policy outcomes in interrupted time series (ITS) designs, following Miratrix (2020) <arXiv:2002.05746>. This package provides methods for fitting ITS models with lagged outcomes and variables to account for temporal dependencies. It then conducts inference via simulation, simulating a set of plausible counterfactual post-policy series to compare to the observed post-policy series. This package also provides methods to visualize such data, and also to incorporate seasonality models and smoothing and aggregation/summarization. This work partially funded by Arnold Ventures in collaboration with MDRC. |
Authors: | Luke Miratrix [aut, cre], Brit Henderson [ctb], Chloe Anderson [ctb], Arnold Ventures [fnd], MDRC [fnd] |
Maintainer: | Luke Miratrix <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-10-26 03:40:31 UTC |
Source: | https://github.com/lmiratrix/simits |
Take outcome and a list of covariates and add new columns with lagged versions. Assumes rows of dataframe are in time ascending order. Lagged outcome canonically called 'lag.outcome'. Covariates 'lag.XXX'.
add_lagged_covariates(dat, outcomename, covariates = NULL)
add_lagged_covariates(dat, outcomename, covariates = NULL)
dat |
The dataframe |
outcomename |
The outcome of interest (string) |
covariates |
The covariates to lag along with the outcome. This can be either of two things. First, it can be a list of string names. Covariates can also be a function with a "lags" attribute with the listed covariates (as returned by, e.g., make_fit_season_model) (which is a list of string names). NULL if no covariates other than outcome should be lagged. |
Augmented dataframe with lagged covariates as new columns. Will clobber old columns if the names (of form "lag.XXXX") conflict.
data( "newjersey" ) newjersey = add_lagged_covariates(newjersey, "n.warrant", c("sin.m","cos.m" ) ) head( newjersey[ c( "n.warrant", "sin.m", "lag.outcome", "lag.sin.m" ) ] )
data( "newjersey" ) newjersey = add_lagged_covariates(newjersey, "n.warrant", c("sin.m","cos.m" ) ) head( newjersey[ c( "n.warrant", "sin.m", "lag.outcome", "lag.sin.m" ) ] )
Reweight the components of a series to match target weights for several categories. This is a good preprocessing step to adjust for time-varying covariates such as changing mix of case types.
adjust_data( dat, outcomename = "Y", timename = "time", groupname = "G", Nname = "N", pi_star, is_count = FALSE, include_aggregate = FALSE, covariates = NULL )
adjust_data( dat, outcomename = "Y", timename = "time", groupname = "G", Nname = "N", pi_star, is_count = FALSE, include_aggregate = FALSE, covariates = NULL )
dat |
Dataframe of data. Requires an N column of total cases represented in each row. |
outcomename |
Name of column that has the outcome to calculated adjusted values for. |
groupname |
Name of categorical covariate that determines the groups. |
Nname |
Name of column in dat that contains total cases (this is the name of the variable used to generate the weights in pi_star). |
pi_star |
The target weights. Each time point will have its groups re-weighted to match these target weights. |
is_count |
Indicator of whether outcome is count data or a continuous measure (this impacts how aggregation is done). |
include_aggregate |
Include aggregated (unadjusted) totals in the output as well. |
covariates |
Covariates to be passed to aggregation (list of string variable names). |
Dataframe of adjusted data.
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
This will take a dataframe with each row being the outcomes, etc., for a given group for a given time point and aggregate those groups for each time point.
aggregate_data( dat, outcomename = "Y", timename = "time", groupname = "G", Nname = "N", is_count = FALSE, rich = TRUE, covariates = NULL )
aggregate_data( dat, outcomename = "Y", timename = "time", groupname = "G", Nname = "N", is_count = FALSE, rich = TRUE, covariates = NULL )
dat |
Dataframe with one row for each time point and group that we are going to post stratify on. This dataframe should also have an column with passed name "Nname" indicating the number of cases that make up each given row. It should have a column "timename" for the time. |
outcomename |
String name of the outcome variable in dat. |
groupname |
Name of the column that has the grouping categorical variable |
Nname |
Name of variable holding the counts (weight) in each group. |
is_count |
If TRUE the data are counts, and should be aggregated by sum rather than by mean. |
rich |
If TRUE, add a bunch of extra columns with proportions of the time point that are each group and so forth. |
covariates |
group-invariant covariates to preserve in the augmented rich dataframe. These are not used in this method for any calculations. Pass as list of column names of dat |
Dataframe of aggregated data, one row per time point If rich=TRUE many extra columns with further information.
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
This method is used to look at summary statistics such as average impact post-policy, and see how the predictive distribution compares to the observed.
aggregate_simulation_results( orig.data, predictions, outcomename = "Y", timename = "time", summarizer = calculate_average_outcome, ... )
aggregate_simulation_results( orig.data, predictions, outcomename = "Y", timename = "time", summarizer = calculate_average_outcome, ... )
orig.data |
The raw data (dataframe) |
predictions |
The results from process_outcome_model. |
outcomename |
Outcome to use. |
summarizer |
A function to calculate some summary quantity, Default: calculate_average_outcome |
... |
Extra arguments passed to the summarizer function. |
List of length two, with first item being the observed value of the test statistic and the second being a numeric vector representing the emperical reference distribution.
predictions = process_outcome_model( "pbail", mecklenberg, t0=0, R = 5, summarize = FALSE, smooth=FALSE ) sstat = aggregate_simulation_results( orig.data = mecklenberg, outcomename = "pbail", predictions = predictions, time_points = 1:18 ) sstat$t sstat$t.obs
predictions = process_outcome_model( "pbail", mecklenberg, t0=0, R = 5, summarize = FALSE, smooth=FALSE ) sstat = aggregate_simulation_results( orig.data = mecklenberg, outcomename = "pbail", predictions = predictions, time_points = 1:18 ) sstat$t sstat$t.obs
Given a set of simulation runs, estimate average impact over range of time points.
calculate_average_outcome(res, outcomename, timename, time_points = 1:54, ...)
calculate_average_outcome(res, outcomename, timename, time_points = 1:54, ...)
res |
Dataframe of a single series (simulated or otherwise) |
outcomename |
Name of outcome in res |
time_points |
Which time points to average over, Default: 1:18 |
... |
Other parameters (ignored) |
Single number (in this case mean of given time points)
See aggregate_simulation_results
for how
this function would be used.
data( mecklenberg ) calculate_average_outcome( mecklenberg, "pbail", time_points=1:24 ) calculate_average_outcome( mecklenberg, "pbail", time_points = 1:18 )
data( mecklenberg ) calculate_average_outcome( mecklenberg, "pbail", time_points=1:24 ) calculate_average_outcome( mecklenberg, "pbail", time_points = 1:18 )
Calculate overall proportion of cases in each group that lie within a given interval of time defined by t_min and t_max.
calculate_group_weights(groupname, dat, t_min, t_max = Inf, Nname = "N")
calculate_group_weights(groupname, dat, t_min, t_max = Inf, Nname = "N")
groupname |
Name of the column that has the grouping categorical variable |
dat |
Dataframe with one row for each time point and group that we are going to post stratify on. This dataframe should also have an column with passed name "Nname" indicating the number of cases that make up each given row. It should have a column "timename" for the time. |
t_min |
The start time point to aggregate cases over. |
t_max |
The final time point (default is last time point). |
Nname |
Name of variable holding the counts (weight) in each group. |
Dataframe of each group along with overall average group weight in the specified timespan.
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
data( "meck_subgroup" ) head( meck_subgroup ) pis = calculate_group_weights( "category", Nname="n.cases", meck_subgroup, t_min=0, t_max= max( meck_subgroup$month ) ) pis agg = aggregate_data( meck_subgroup, outcomename="pbail", groupname="category", Nname="n.cases", is_count=FALSE, rich = TRUE, covariates = NULL ) head( agg ) adjdat = adjust_data( meck_subgroup, "pbail", "category", "n.cases", pis, include_aggregate=TRUE ) head( adjdat )
This function takes a fitted model and uses it to make post-policy predictions by simulating data.
extrapolate_model( M0, dat, outcomename = "Y", timename = "time", t0 = 0, R = 400, summarize = FALSE, smooth = FALSE, smoother = smooth_series, full_output = FALSE, fix_parameters = FALSE, ... )
extrapolate_model( M0, dat, outcomename = "Y", timename = "time", t0 = 0, R = 400, summarize = FALSE, smooth = FALSE, smoother = smooth_series, full_output = FALSE, fix_parameters = FALSE, ... )
M0 |
The fit model |
dat |
Dataframe with data being analyzed. |
outcomename |
Outcome of interest (name of column). |
timename |
Name of the time variable (name of column). |
t0 |
Last pre-policy timepoint |
R |
Number of replications |
summarize |
Boolean, TRUE means collapse all simulated trajectories into single aggregate. FALSE means return all paths. |
smooth |
Boolean. TRUE means fit a smoother to the trajectories and look at distribution of smoothed trajectories. FALSE means look at raw data treajectories. |
smoother |
Function to do smoothing, if smoothing set to TRUE. Default is smooth_series() |
full_output |
TRUE means smoother returns residuals as well as smoothed series. |
fix_parameters |
Keep the parameters in the model M0 as fixed; do not add parameter uncertainty. |
... |
Extra arguments to be passed to smoother (e.g, bandwidth). |
Dataframe with columns corresponding to the simulations. If summarize=TRUE, one row per time point in original data. If FALSE, all the details of all the runs are returned.
process_outcome_model
for wrapper function
for this method that is easier to use.
data("mecklenberg" ) mecklenberg = add_lagged_covariates( mecklenberg, "pbail" ) mecklenberg.pre = dplyr::filter( mecklenberg, month <= 0 ) M0 = fit_model_default( mecklenberg.pre, "pbail" ) res = extrapolate_model( M0, "pbail", mecklenberg, 0, 1, smooth=TRUE) tail( res )
data("mecklenberg" ) mecklenberg = add_lagged_covariates( mecklenberg, "pbail" ) mecklenberg.pre = dplyr::filter( mecklenberg, month <= 0 ) M0 = fit_model_default( mecklenberg.pre, "pbail" ) res = extrapolate_model( M0, "pbail", mecklenberg, 0, 1, smooth=TRUE) tail( res )
This fits the model 'outcomename ~ lag.outcome + time', with no covariates.
fit_model_default(dat, outcomename, timename = "time", lagless = FALSE, ...)
fit_model_default(dat, outcomename, timename = "time", lagless = FALSE, ...)
dat |
Dataframe of pre-policy data to fit model to. Needs a "time" column |
outcomename |
Outcome of interest |
lagless |
Boolean, include the lagged outcome, or not? |
... |
Extra arguments passed to the lm() call. |
A fit model (a 'lm' object from a 'lm()' call) from fitting a simple regression of outcome onto time and lagged time.
mecklenberg = add_lagged_covariates(mecklenberg, "pbail") meck.pre = filter( mecklenberg, month <= 0 ) mod = fit_model_default( meck.pre, "pbail", "month", lagless = TRUE ) summary( mod ) mod = fit_model_default( meck.pre, "pbail", "month", lagless = FALSE ) summary( mod )
mecklenberg = add_lagged_covariates(mecklenberg, "pbail") meck.pre = filter( mecklenberg, month <= 0 ) mod = fit_model_default( meck.pre, "pbail", "month", lagless = TRUE ) summary( mod ) mod = fit_model_default( meck.pre, "pbail", "month", lagless = FALSE ) summary( mod )
Defaults have heavy seasonality, and an extra bump in impact kicks in at 12 months post-policy.
generate_fake_data( t_min = -40, t_max = 9, t0 = 0, rho = 0.5, sd.omega = 1, coef_line = c(20, 0.05), coef_q = c(1, 0, -1, 0), coef_temp = 0.1, coef_sin = c(0, 0), coef_tx = c(0, 0.25, 5) )
generate_fake_data( t_min = -40, t_max = 9, t0 = 0, rho = 0.5, sd.omega = 1, coef_line = c(20, 0.05), coef_q = c(1, 0, -1, 0), coef_temp = 0.1, coef_sin = c(0, 0), coef_tx = c(0, 0.25, 5) )
t_min |
Index of first month |
t_max |
Index of last month |
t0 |
Last pre-policy time point |
rho |
Autocorrelation |
sd.omega |
Standard deviation of the true residual |
coef_line |
Intercept and slope of the main trendline (list of 2). |
coef_q |
Coefficients for the four quarters (list of 4). |
coef_temp |
Coefficient for temperature. |
coef_sin |
Coefficents for sin and cos features (list of 2) |
coef_tx |
Coefficient for treatment post-policy (list of 3, initial offset, initial slope, additional slope past 12 months). Treatment is a piecewise linear function. |
A tibble
having month
, temperature
,
sin.m
, cos.m
, Q1
, Q2
, Q3
, Q4
,
post
, Ystr0
, Ystr
, Y
fdat = generate_fake_data(-100,100, rho = 0.95, coef_q=c(0,0,0,0), coef_temp = 0) plot( fdat$month, fdat$Y, type="l" ) fdat2 = generate_fake_data(-100, 100, rho = 0.0, coef_q=c(0,0,0,0), coef_temp = 0) plot( fdat$month, fdat2$Y, type="l" )
fdat = generate_fake_data(-100,100, rho = 0.95, coef_q=c(0,0,0,0), coef_temp = 0) plot( fdat$month, fdat$Y, type="l" ) fdat2 = generate_fake_data(-100, 100, rho = 0.0, coef_q=c(0,0,0,0), coef_temp = 0) plot( fdat$month, fdat2$Y, type="l" )
This code makes synthetic grouped data that can be used to illustrate benefits of post stratification.
generate_fake_grouped_data( t_min, t0, t_max, method = c("complex", "linear", "jersey") )
generate_fake_grouped_data( t_min, t0, t_max, method = c("complex", "linear", "jersey") )
t_min |
Index of first time point |
t0 |
last pre-policy timepoint |
t_max |
Index of last time point |
method |
Type of post-stratification structure to generate (three designs of 'complex', 'linear' and 'jersey' were originally concieved of when designing simulation studies with different types of structure). |
Dataframe of fake data, with one row per group per time period.
fdat = generate_fake_grouped_data(t_min=-5,t_max=10, t0 = 0) table( fdat$time ) table( fdat$type )
fdat = generate_fake_grouped_data(t_min=-5,t_max=10, t0 = 0) table( fdat$time ) table( fdat$type )
This method builds a ggplot object with the trendline and prediction envelope. It can be customized after the fact by adding more ggplot layers via normal ggplot "+" syntax.
make_envelope_graph(envelope, t0, ylab = "Y", xlab = "month")
make_envelope_graph(envelope, t0, ylab = "Y", xlab = "month")
envelope |
The result of a 'process_outcome_model()' call, i.e. dataframe with columns of original data, imputed data and, potentially, smoothed data. |
t0 |
Last pre-policy timepoint. Will draw vertical line here. |
ylab |
Y label of plot |
xlab |
X label of plot |
Returns (does not yet display) a ggplot plot object containing the time series along with extrapolation and prediction envelope. This plot can be augmented and changed via standard ggplot commands.
The ggplot2
package.
data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot") data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot")
data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot") data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot")
This method returns a function that will fit a model both with and without lagged outcomes.
make_fit_season_model(formula, no_lag = NULL)
make_fit_season_model(formula, no_lag = NULL)
formula |
Formula specifying seasonality. No outcome or time needed. |
no_lag |
Formula specifying additional covariates that should be included, but without lag (usually used due to colinearity of lagged outcomes, such as with a sin and cos component). |
You hand it a formula object specifying the seasonality, e.g., " ~ Q2 + Q3 + Q4", if you have quarterly season effects. This method assumes you want models with a linear time component as well, and will add that and an intercept in automatically.
It gives you a function back, that you can use to analyze data.
Callable function. See make_fit_model.
make_fit_model for the type of function this method will generate.
data( "newjersey") modF = make_fit_season_model( ~ temperature, timename = "month" ) newjersey = add_lagged_covariates( newjersey, "n.warrant", covariates = c("temperature") ) modF( newjersey, "n.warrant" )
data( "newjersey") modF = make_fit_season_model( ~ temperature, timename = "month" ) newjersey = add_lagged_covariates( newjersey, "n.warrant", covariates = c("temperature") ) modF( newjersey, "n.warrant" )
Given a fit linear model 'fit0', generate R prediction series starting at t0. This takes model uncertainty into account by pulling from the pseudo-posterior of the model parameters (from Gelman and Hill arm package).
make_many_predictions(fit0, dat, R, outcomename, timename, t0) make_many_predictions_plug(fit0, dat, R, outcomename, timename, t0)
make_many_predictions(fit0, dat, R, outcomename, timename, t0) make_many_predictions_plug(fit0, dat, R, outcomename, timename, t0)
fit0 |
The fit linear model to simulate from. |
dat |
A dataframe with the covariates needed by the model fit0 for both pre and post-policy time points |
R |
Number of series to generate. |
outcomename |
The name of the column in dat which is our outcome. |
t0 |
Last time point of pre-policy. Will start predicting at t0+1. |
A data.frame with the collection of predicted series, one row per time point per replicate (so will have R*nrow(dat) rows).
make_many_predictions_plug()
: This version makes multiple predictions
using estimated parameters without additional uncertainty. This takes point
estimates from the fit model as fixed parameters. WARNING: This method will
not capture true uncertainty as it is not taking parameter uncertainty into
account.
The 'arm' package, see https://cran.r-project.org/package=arm
Also see Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevelhierarchical models (Vol. 1). New York, NY, USA: Cambridge University Press.
data("mecklenberg" ) mecklenberg = add_lagged_covariates( mecklenberg, "pbail" ) mecklenberg.pre = dplyr::filter( mecklenberg, month <= 0 ) M0 = fit_model_default( mecklenberg.pre, "pbail" ) res = make_many_predictions( M0, dat=mecklenberg, outcome="pbail", t0=0, R=2 ) tail( res )
data("mecklenberg" ) mecklenberg = add_lagged_covariates( mecklenberg, "pbail" ) mecklenberg.pre = dplyr::filter( mecklenberg, month <= 0 ) M0 = fit_model_default( mecklenberg.pre, "pbail" ) res = make_many_predictions( M0, dat=mecklenberg, outcome="pbail", t0=0, R=2 ) tail( res )
This helper function gives back a function that takes the resulting simulation data from a single iteration of the simulation, and fits 'fit_model' to it, smoothes the residuals, and puts the predictions from 'fit_model' back.
make_model_smoother(fit_model, covariates)
make_model_smoother(fit_model, covariates)
fit_model |
A function that takes data, fits a linear model, and returns the fit model. This function needs an option to include (or not) lagged covariates. |
covariates |
A dataframe with all covariates needed in the model fitting defined by fit_model. |
This can be used to build smoothers that smooth using models other than the model being used for extrapolation (e.g., a model without temperature).
Resulting functions have the following parameters: 'res' (the data), 't0' (start time), 'outcomename', 'post.only' flag (for smoothing only post data or not), and 'smooth_k', a tuning parameter for degree of smoothing.
A smoother function that can be passed to the smoothing routines. This function is of the form listed above.
data( "newjersey") modA = make_fit_season_model( ~ temperature ) modB = make_fit_season_model( ~ sin.m + cos.m ) newjersey = add_lagged_covariates( newjersey, "n.warrant", covariates = c("sin.m", "cos.m", "temperature") ) smoother = make_model_smoother( fit_model = modA, covariates = newjersey ) class(smoother) # Pass made function to process_outcome_model() envelope = process_outcome_model( "n.warrant", newjersey, t0=-8, R = 1, summarize = TRUE, smooth=TRUE, smoother = smoother, smooth_k = 11, fit.model = modB )
data( "newjersey") modA = make_fit_season_model( ~ temperature ) modB = make_fit_season_model( ~ sin.m + cos.m ) newjersey = add_lagged_covariates( newjersey, "n.warrant", covariates = c("sin.m", "cos.m", "temperature") ) smoother = make_model_smoother( fit_model = modA, covariates = newjersey ) class(smoother) # Pass made function to process_outcome_model() envelope = process_outcome_model( "n.warrant", newjersey, t0=-8, R = 1, summarize = TRUE, smooth=TRUE, smoother = smoother, smooth_k = 11, fit.model = modB )
Mecklenberg data that gives proportion of different charge categories of cases given bail (by month).
meck_subgroup
meck_subgroup
A data frame with 144 rows and 5 variables:
month
integer Month, with 0 being month of policy implementation.
n.cases
integer Number of cases of that subgroup for that month
n.bail
interger Total number of cases given bail for that subgroup for that month
pbail
double Proportion of new cases in given subgroup in that month assigned bail
category
character Category of group (charge type).
Monthly aggregate outcomes of various measures of interest from Mecklenberg. See MDRC Report.
mecklenberg
mecklenberg
A data frame with 54 rows and 10 variables:
month
integer Month, with 0 being month of policy implementation.
karr
integer Total count of arrests.
pbail
double Proportion of cases in a given month assigned bail (or outright detention).
pptrel
double Proportion of cases assigned to pretrial supervised release.
pror
double Proportion of cases released on own recognizance.
pb4c
double Proportion of cases assigned to money bail (alternate coding from pbail, above).
avg_days_initial
double Average number of days spent detained before release due to bail, case resolution, etc.
avg_t2d
double Average time to case resolution (in days).
pstint7
double Proportion detained longer than 7 days.
pstint30
double Proportion detained longer than 30 days.
Montly aggregate counts of arrests of different types in New Jersey.
newjersey
newjersey
A data frame with 106 rows and 11 variables:
month
integer Index of month.
sin.m
double cos of month number
cos.m
double sin of month number
M12
integer Month number
Q1
integer Indicator of 1st quarter.
Q2
integer Indicator of 2nd quarter.
Q3
integer Indicator of 3rd quarter.
Q4
integer Indicator of 4th quarter.
n.warrant
double Number of warrant arrests
n.summons
double Number of summons arrests
n
double Total number of arrests
temperature
double Average temperature in New Jersey that month.
This is the primary function to use to use this approach on a given dataset.
process_outcome_model( dat, outcomename = "Y", timename = "time", t0 = 0, R = 400, summarize = FALSE, smooth = FALSE, smoother = NULL, fit_model = fit_model_default, covariates = NULL, plug_in = FALSE, ... )
process_outcome_model( dat, outcomename = "Y", timename = "time", t0 = 0, R = 400, summarize = FALSE, smooth = FALSE, smoother = NULL, fit_model = fit_model_default, covariates = NULL, plug_in = FALSE, ... )
dat |
Dataframe with a column for time. The time column is assumed to be a sequence of integer values. |
outcomename |
Name of column in dat containing the observed outcomes. |
timename |
Name of column in dat containing the time points. |
t0 |
Last pre-policy timepoint |
R |
Number of simulated pre-policy extrapolations to generate. |
summarize |
Summarise the series? (TRUE/FALSE) |
smooth |
Smooth the series? (TRUE/FALSE) |
smoother |
Function to smooth residuals, if smoothing set to TRUE. If NULL, will dynamically make a model smoother based on the fit_model method if covariates are passed. Otherwise it will use the simple smoother on the outcomes. |
fit_model |
The function used to fit the model to simulate from. (This model could be a seasonality model. Default is simple linear model with no covariates.) |
covariates |
Vector of covariate names of all covariates used in the passed model function fit_model. If null, will attempt to get list of covariates form the "lags" attribute of the passed 'fit_model'. |
plug_in |
Use the estimated parameters as fixed and do not include extra uncertainty of parameter estimation in the simulation. (Not recommended as it destroys inference.) |
... |
Extra arguments to be passed to 'extrapolate_model()' |
Take a given outcome variable, fit an ITS model, use it to extrapolate R plusible trajectories, and then using these trajectories, generate final impact results by averaging (if summarize is set to TRUE).
This function is basically a wrapper for 'extrapolate_model()' with some extra calls to 'make_model_smoother()' to prepare, in the case of smoothing, and adding on a summary trend via 'generate_Ybars()' in the case of summarizing.
If summarize=TRUE, A dataframe with several columns of interest and one row per time point of data. The columns are Ymin and Ymax, the limits of the envelope, 'range', the range of the envelope, 'SE', the standard deviation of the trajectories at that time point, 'Ysmooth' the median smoothed value at that time point (if smoothing), 'Ystar' the median unsmoothed value at that time point (regardless of smooth flag), 'Y', the observed outcome, 'Ysmooth1', the smoothed observed outcomes, and 'Ybar' the predicted outcome given the model with no autoregressive aspect.
If summarize=FALSE, a dataframe of all the raw series generated.
The core internal function that this method is a wrapper
for is extrapolate_model
.
data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot")
data( "mecklenberg" ) t0 = 0 envelope = process_outcome_model( "pbail", mecklenberg, t0=t0, R = 10, summarize = TRUE, smooth=FALSE ) make_envelope_graph(envelope, t0=t0, ylab = "Proportion given bail") + ggplot2::labs( title="Sample ITS plot")
simITS
package overviewAnalysis via Simulation of Interrupted Time Series
This package is based on the backbone analytic code for the analyses in, e.g., Redcross et al. (2019) or Golub et al. (2019). See companion paper Miratrix (2020) for technical discussion of the overall approach.
Broadly, this package provides methods for fitting Interrupted Time Series models with lagged outcomes and variables to account for temporal dependencies. It then conducts inference via simulation, simulating a set of plausible counterfactual post-policy series to compare to the observed post-policy series. This package provides methods to visualize such data, and also to incorporate seasonality models and smoothing and aggregation/summarization. See the vignette for a guide of how to conduct such analyses.
Redcross, C., Henderson, B., Valentine, E. & Miratrix, L. (2019). Evaluation of pretrial justice system reforms that use the public safety assessment: Effects in Mecklenburg County, North Carolina. Technical report, MDRC (link)
Golub, C. A., Redcross, C., Valentine, E., & Miratrix, L. (2019). Evaluation of pretrial justice system reforms that use the public safety assessment: Effects of New Jersey’s criminal justice reform. Technical report, MDRC. (link)
Miratrix, L. (2020). Using Simulation to Analyze Interrupted Time Series Designs (link)
Smooth a series by fitting the model to the data, smoothing the residuals, and then putting the model predictions back.
smooth_residuals( res, t0 = 0, outcomename = "Y", timename = "time", post.only = TRUE, smooth_k = SMOOTH_K, fit_model = fit_model_default, covariates = res, full_output = FALSE )
smooth_residuals( res, t0 = 0, outcomename = "Y", timename = "time", post.only = TRUE, smooth_k = SMOOTH_K, fit_model = fit_model_default, covariates = res, full_output = FALSE )
res |
A dataframe with a 'timename' column and an 'outcomename' column (which is the column that will be smoothed). |
t0 |
last pre-policy timepoint |
outcomename |
String name of the outcome variable in dat. Defaut is "Y". |
timename |
Name of the time column. Default is "time". |
post.only |
If TRUE fit model and smooth post-policy only. WHY fit model on post-policy data only? Because this will make sure the fixed pre-policy does not dominate too much? We are focusing on post-policy so we want a good fitting model for that so we can get our residuals as "white noise" as possible before smoothing. |
smooth_k |
A rough proxy for the number of observations to primarily consider to kernal weight in the neighborhood of each timepoint (this is a bandwidth, and the loess smoother gets smooth_k / n as a span value). We want to smooth with an absolute bandwidth, not one as function of how long the time series is. |
fit_model |
A function that takes data, fits a linear model, and returns the fit model. This function needs an option to include (or not) lagged covariates. |
covariates |
A dataframe with all covariates needed in the model fitting defined by fit_model. |
full_output |
If TRUE give back pieces for diagnostics of smoothing process. |
Use loess smoother on complete series of residuals including original data pre-policy and synthetic data post policy (i.e., smooth the entire plausible series).
A numeric vector of the smoothed residuals. If full_output=TRUE return a dataframe with several other columns: 'resid', the residuals based on Ystar and the model, ‘residStar' the smoothed residuals, ’Ybar.sm' the structural predictions of the model used for smoothing. Here the smoothed values will be 'Ysmooth'.
See smooth_series
for a more vanilla version that
smooths without the model fitting step.
data( "newjersey" ) smooth = smooth_series( newjersey, outcomename = "n.warrant", t0= -8, smooth_k = 30, post.only = FALSE) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" ) mod = make_fit_season_model( ~ temperature ) newjersey = add_lagged_covariates( newjersey, outcomename = "n.warrant", covariates = c("temperature") ) smooth = smooth_residuals( newjersey, outcomename = "n.warrant", t0=-8, smooth_k = 30, post.only = FALSE, fit_model = mod ) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" )
data( "newjersey" ) smooth = smooth_series( newjersey, outcomename = "n.warrant", t0= -8, smooth_k = 30, post.only = FALSE) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" ) mod = make_fit_season_model( ~ temperature ) newjersey = add_lagged_covariates( newjersey, outcomename = "n.warrant", covariates = c("temperature") ) smooth = smooth_residuals( newjersey, outcomename = "n.warrant", t0=-8, smooth_k = 30, post.only = FALSE, fit_model = mod ) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" )
Use loess smoother on complete series of residuals including original data pre-policy and synthetic data post policy (i.e., smooth the entire plausible series).
smooth_series( res, outcomename = "Y", timename = "time", t0 = 0, smooth_k = SMOOTH_K, post.only = TRUE, ... )
smooth_series( res, outcomename = "Y", timename = "time", t0 = 0, smooth_k = SMOOTH_K, post.only = TRUE, ... )
res |
A dataframe with a 'timename' column and an 'outcomename' column (which is the column that will be smoothed). |
outcomename |
String name of the outcome variable in dat. Defaut is "Y". |
timename |
Name of the time column. Default is "time". |
t0 |
last pre-policy timepoint |
smooth_k |
A rough proxy for the number of observations to primarily consider to kernal weight in the neighborhood of each timepoint (this is a bandwidth, and the loess smoother gets smooth_k / n as a span value). We want to smooth with an absolute bandwidth, not one as function of how long the time series is. |
post.only |
If TRUE fit model and smooth post-policy only. WHY fit model on post-policy data only? Because this will make sure the fixed pre-policy does not dominate too much? We are focusing on post-policy so we want a good fitting model for that so we can get our residuals as "white noise" as possible before smoothing. |
... |
Extra arguments (not used in this function). |
This method takes several parameters it does not use, to maintain compatability with smooth_residuals.
An updated version of the 'res' dataframe with 'Ysmooth', the smoothed predictions of the original Ystar outcome. Also includes 'Ystar' the original sequence to be smoothed.
data( "newjersey" ) smooth = smooth_series( newjersey, outcomename = "n.warrant", t0= -8, smooth_k = 30, post.only = FALSE) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" ) mod = make_fit_season_model( ~ temperature ) newjersey = add_lagged_covariates( newjersey, outcomename = "n.warrant", covariates = c("temperature") ) smooth = smooth_residuals( newjersey, outcomename = "n.warrant", t0=-8, smooth_k = 30, post.only = FALSE, fit_model = mod ) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" )
data( "newjersey" ) smooth = smooth_series( newjersey, outcomename = "n.warrant", t0= -8, smooth_k = 30, post.only = FALSE) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" ) mod = make_fit_season_model( ~ temperature ) newjersey = add_lagged_covariates( newjersey, outcomename = "n.warrant", covariates = c("temperature") ) smooth = smooth_residuals( newjersey, outcomename = "n.warrant", t0=-8, smooth_k = 30, post.only = FALSE, fit_model = mod ) plot( newjersey$month, newjersey$n.warrant ) lines( newjersey$month, smooth, col="red" )