ATTgt#
- class differences.attgt.attgt.ATTgt(data: DataFrame, cohort_name: str, strata_name: str | None = None, base_period: str = 'varying', anticipation: int = 0, freq: str | None = None)#
Difference in differences with
balanced panels, unbalanced panels or repeated cross-section
two or multiple periods
fixed or staggered treatment timing
binary or multi-valued treatment
heterogeneous treatment effects
based on the work by [CS2021], [CGS2022], [SZ2020]
- Parameters:
data (DataFrame) –
pandas DataFrame
df = df.set_index(['entity', 'time'])
where df is the dataframe to use, ‘entity’ should be replaced with the name of the entity column and ‘time’ should be replaced with the name of the time column.
cohort_name (str) – cohort name
base_period (str, default:
"varying") –"universal""varying"
anticipation (int, default:
0) – The number of time periods before participating in the treatment where units can anticipate participating in the treatment, and therefore it can affect their untreated potential outcomesstrata_name (str, default:
None) –The name of the column to be used in case of multi-valued treatment, used to calculate cohort-time-stratum ATT.
If stratum name is
None, fit() will return cohort-time ATT.freq (str | None, default:
None) – the date frequency of the panel data. Required if the time index is datetime. For example, if the time column is a monthly datetime then freq=’M’. Check offset aliases, for a list of available frequencies.
- group_time(feasible: bool = False) list[dict]#
- Returns:
a list of dictionaries where each dictionary keys are
cohort,base_period,time, (stratum)
- fit(formula: str, weights_name: str = None, control_group: str = 'never_treated', base_delta: str | list | dict = 'base', est_method: str | Callable = 'dr', as_repeated_cross_section: bool = None, boot_iterations: int = 0, random_state: int = None, alpha: float = 0.05, cluster_var: list | str = None, split_sample_by: Callable | str | dict = None, n_jobs: int = 1, backend: str = 'loky', progress_bar: bool = True) DataFrame#
Computes the cohort-time-(stratum) average treatment effects:
effects for each cohort, in each time, (for each stratum).
- Parameters:
formula (str) –
Wilkinson formula for the outcome variable and covariates
If no covariates the formula must contain only the name of the outcome variable
# example with covariates formula = 'y ~ a + b + a:b' # example without covariates formula = 'y'
Formulas are implemented using formulaic, refer to its documentation for additional details.
weights_name (str | None, default:
None) – The name of the column containing the sampling weights. If None, all observations have same weights.control_group (str, default:
"never_treated") –"never_treated""not_yet_treated"
base_delta (str | list | dict, default:
"base") –Use base period values for covariates and/or delta values, i.e. the change in value, between the value of covariates at time and the value at base period.
Available options are:
"base"the value of each covariate is set to its base period value
"delta"the value of each time-varying covariate is set to the delta. Time-constant covariates included through x_formula are dropped, and a warning issued.
["base", "delta"]or"base_delta"the value of each covariate is set to its base period value, and the value of each time-varying covariate is set to the delta.
{'base': ['a', 'b', ..]}the value of the specified covariates is set to its base period value, and the value of each time-varying covariate is set to the delta. A warning is issued if x_formula included time-constant covariates that are not included in base_delta.
{'delta': ['c', 'd', ..]}the value of each covariate is set to its base period value, and the value of the specified time-varying covariates is set to the delta. If the covariates included in ‘delta’ are not time-varying they will be removed from the list.
{'base': ['a', 'b', ..], 'delta': ['c', 'd', ..]}the value of the specified covariates is set to its base period value, and the value of the specified time-varying covariates is set to the delta. A warning is issued if x_formula included time-constant covariates that are not included in ‘delta’. If the covariates included in ‘delta’ are not time-varying they will be removed from the list.
est_method (str, default:
"dr-mle") –"dr-mle"or"dr"for locally efficient doubly robust DiD estimator, with logistic propensity score model for the probability of being treated
"dr-ipt"for locally efficient doubly robust DiD estimator, with propensity score estimated using the inverse probability tilting
"reg"for outcome regression DiD estimator
"std_ipw-mle"or"std_ipw"for standardized inverse probability weighted DiD estimator, with logistic propensity score model for the probability of being treated
as_repeated_cross_section (bool | None, default:
None) –boot_iterations (int, default:
0) –random_state (int | None, default:
None) –alpha (float, default:
0.05) – The significance level.cluster_var (str | list | None, default:
None) –split_sample_by (str | Callable | None, default:
None) –The name of the column along which to split the data, or a function which takes the data and returns a sample mask for a binary split, for example:
lambda: x = x['column name'] >= x['column name'].median()
The estimation of the ATT will be run separately for each specified sample; used for heterogeneity analysis.
n_jobs (int, default:
1) –The maximum number of concurrently running jobs. If -1 all CPUs are used.
If ≠ 1, concurrent jobs will be run for two separate tasks:
computing the cohort-time ATT; each cohort-time is assigned to a job
computing the bootstrap; the influence function is split into n_jobs parts and the boostrap is computed concurrently for each part
Parallelization is implemented using joblib, refer to its documentation for additional details on n_jobs.
backend (int, default:
"loky") –Parallelization backend implementation.
Parallelization is implemented using joblib, refer to its documentation for additional details on backend.
progress_bar (bool, default:
True) – If True, a progress bar will display the progress over the cohort-times iterations and/or the iterations over the number of boostrap concurrent splits (not the bootstrap iterations).
- Return type:
A DataFrame with the group time ATTs
- aggregate(type_of_aggregation: str | None = 'simple', overall: bool = False, difference: bool | list | dict[str, list] = False, alpha: float = 0.05, cluster_var: list | str = None, boot_iterations: int = 0, random_state: int = None, n_jobs: int = 1, backend: str = 'loky') DataFrame#
Aggregate the ATTgt
- Parameters:
type_of_aggregation (str | None, default:
None) –"simple"to calculate the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.
"event"or"event"to calculate the average effects in each relative period: periods relative to the treatment; as in an event study.
"cohort"to calculate the average treatment effect in each cohort.
"time"or"time"to calculate the average treatment effect in each time time.
overall (bool, default:
False) –calculates the average effect within each type_of_aggregation.
- if type_of_aggregation is set to
"event"or"event" to calculate the average effect of the treatment across positive relative periods
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"cohort" to calculate the average effect of the treatment across cohorts
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"time"or"time" to calculate the average effect of the treatment across time times
- if type_of_aggregation is set to
difference (bool | list | dict, default:
False) –take the difference of the estimates
Available options are:
Trueto calculate the difference between 2 samples or 2 strata of treatments
Note
Samples difference: if the estimation is run on 2 samples and more than 2 strata, the estimates for the two samples will be subtracted, as long as there are no strata that have the same names as the samples, in that case use a dictionary as indicated below
strata difference: if the estimation is run on 2 strata and more than 2 samples, the estimates for the two strata will be subtracted, as long as there are no samples that have the same names as the strata, in that case use a dictionary as indicated below
[sample-0, sample-1]or[stratum-A, stratum-B]to calculate the difference between 2 samples listed in the argument or the 2 strata of treatments listed in the argument
Note
Samples difference: if there are strata with the same name as the two samples listed, use a dictionary as indicated below
strata difference: if there are samples with the same name as the two strata listed, use a dictionary as indicated below
{'strata': [stratum-A, stratum-B]}or{'sample_names': [sample-0, sample-1]}
alpha (float, default:
0.05) – The significance level.cluster_var (str | list | None, default:
None) – cluster variablesboot_iterations (int, default:
0) – bootstrap iterationsrandom_state (int | None, default:
None) – seed for bootstrapn_jobs (int, default:
1) –The maximum number of concurrently running jobs. If -1 all CPUs are used.
If ≠ 1, concurrent jobs will be run for:
computing the bootstrap; the influence function is split into n_jobs parts and the boostrap is computed concurrently for each part
Parallelization is implemented using joblib, refer to its documentation for additional details on n_jobs.
backend (int, default:
"loky") –Parallelization backend implementation.
Parallelization is implemented using joblib, refer to its documentation for additional details on backend.
- Return type:
A DataFrame with the requested aggregation
- property sample_names#
- property wald_pre_test#
- results(type_of_aggregation: str | None = None, overall: bool = False, difference: bool = False, to_dataframe: bool = True, add_info: bool = False)#
provides easy access to cached results. this method must be called after fit and/or aggregate depending on the parameters requested
- Parameters:
type_of_aggregation (str | None, default:
None) –"simple"to return the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.
"event"or"event"to return the average effects in each relative period: periods relative to the treatment; as in an event study.
"cohort"to return the average treatment effect in each cohort.
"time"or"time"to return the average treatment effect in each time time.
overall (bool, default:
False) –calculates the average effect within each type_of_aggregation.
- if type_of_aggregation is set to
"event"or"event" to return the average effect of the treatment across positive relative periods
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"cohort" to return the average effect of the treatment across cohorts
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"time"or"time" to return the average effect of the treatment across time times
- if type_of_aggregation is set to
difference (bool, default:
False) – to return the most recent estimated differenceto_dataframe – whether to return the result in a DataFrame or a list of namedtuples
- Return type:
Either a pandas dataframe or a list of namedtuples
- plot(type_of_aggregation: str | None = None, overall: bool = False, difference: bool = False, estimation_details: bool = True, estimate_in_x_axis: bool = False, **plotting_parameters)#
- Parameters:
type_of_aggregation (str | None, default:
None) –"simple"to plot the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.
"event"or"event"to plot the average effects in each relative period: periods relative to the treatment; as in an event study.
"cohort"to plot the average treatment effect in each cohort.
"time"or"time"to plot the average treatment effect in each time time.
overall (bool, default:
False) –to plot the average effect within each type_of_aggregation.
- if type_of_aggregation is set to
"event"or"event" to plot the average effect of the treatment across positive relative periods
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"cohort" to plot the average effect of the treatment across cohorts
- if type_of_aggregation is set to
- if type_of_aggregation is set to
"time"or"time" to plot the average effect of the treatment across time times
- if type_of_aggregation is set to
difference (bool, default:
False) –take the difference of the estimates
Available options are:
Trueto plot the difference between 2 samples or 2 strata of treatments
estimation_details (bool | list | str, default:
True) – include the estimation details in the plot. One can modify the format through plotting_parametersestimate_in_x_axis (bool, default:
False) – whether to display the ATT estimates in the x-axisplotting_parameters – a set of parameters to customize the plot. Please refer to the separate documentation for the plotting functionalities built in the library
- Return type:
An interactive plot for the requested estimates
Citation
Callaway, Brantly, and Pedro HC Sant’Anna. “Difference-in-differences with multiple time periods.” Journal of Econometrics 225, no. 2 (2021): 200-230.
Callaway, Brantly, Andrew Goodman-Bacon, and Pedro HC Sant’Anna. “Difference-in-differences with a continuous treatment.” arXiv preprint arXiv:2107.02637 (2021).
Sant’Anna, Pedro HC, and Jun Zhao. “Doubly robust difference-in-differences estimators.” Journal of Econometrics 219, no. 1 (2020): 101-122.