ATTgt#

class differences.attgt.attgt.ATTgt(data: DataFrame, cohort_name: str, strata_name: str | None = None, base_period: str = 'varying', anticipation: int = 0, freq: str | None = None)#

Difference in differences with

  • balanced panels, unbalanced panels or repeated cross-section

  • two or multiple periods

  • fixed or staggered treatment timing

  • binary or multi-valued treatment

  • heterogeneous treatment effects

based on the work by [CS2021], [CGS2022], [SZ2020]

Parameters:
  • data (DataFrame) –

    pandas DataFrame

    df = df.set_index(['entity', 'time'])
    

    where df is the dataframe to use, ‘entity’ should be replaced with the name of the entity column and ‘time’ should be replaced with the name of the time column.

  • cohort_name (str) – cohort name

  • base_period (str, default: "varying") –

    • "universal"

    • "varying"

  • anticipation (int, default: 0) – The number of time periods before participating in the treatment where units can anticipate participating in the treatment, and therefore it can affect their untreated potential outcomes

  • strata_name (str, default: None) –

    The name of the column to be used in case of multi-valued treatment, used to calculate cohort-time-stratum ATT.

    If stratum name is None, fit() will return cohort-time ATT.

  • freq (str | None, default: None) – the date frequency of the panel data. Required if the time index is datetime. For example, if the time column is a monthly datetime then freq=’M’. Check offset aliases, for a list of available frequencies.

group_time(feasible: bool = False) list[dict]#
Returns:

  • a list of dictionaries where each dictionary keys are

  • cohort, base_period, time, (stratum)

fit(formula: str, weights_name: str = None, control_group: str = 'never_treated', base_delta: str | list | dict = 'base', est_method: str | Callable = 'dr', as_repeated_cross_section: bool = None, boot_iterations: int = 0, random_state: int = None, alpha: float = 0.05, cluster_var: list | str = None, split_sample_by: Callable | str | dict = None, n_jobs: int = 1, backend: str = 'loky', progress_bar: bool = True) DataFrame#

Computes the cohort-time-(stratum) average treatment effects:

effects for each cohort, in each time, (for each stratum).

Parameters:
  • formula (str) –

    Wilkinson formula for the outcome variable and covariates

    If no covariates the formula must contain only the name of the outcome variable

    # example with covariates
    formula = 'y ~ a + b + a:b'
    
    # example without covariates
    formula = 'y'
    

    Formulas are implemented using formulaic, refer to its documentation for additional details.

  • weights_name (str | None, default: None) – The name of the column containing the sampling weights. If None, all observations have same weights.

  • control_group (str, default: "never_treated") –

    • "never_treated"

    • "not_yet_treated"

  • base_delta (str | list | dict, default: "base") –

    Use base period values for covariates and/or delta values, i.e. the change in value, between the value of covariates at time and the value at base period.

    Available options are:

    • "base"

      the value of each covariate is set to its base period value

    • "delta"

      the value of each time-varying covariate is set to the delta. Time-constant covariates included through x_formula are dropped, and a warning issued.

    • ["base", "delta"] or "base_delta"

      the value of each covariate is set to its base period value, and the value of each time-varying covariate is set to the delta.

    • {'base': ['a', 'b', ..]}

      the value of the specified covariates is set to its base period value, and the value of each time-varying covariate is set to the delta. A warning is issued if x_formula included time-constant covariates that are not included in base_delta.

    • {'delta': ['c', 'd', ..]}

      the value of each covariate is set to its base period value, and the value of the specified time-varying covariates is set to the delta. If the covariates included in ‘delta’ are not time-varying they will be removed from the list.

    • {'base': ['a', 'b', ..], 'delta': ['c', 'd', ..]}

      the value of the specified covariates is set to its base period value, and the value of the specified time-varying covariates is set to the delta. A warning is issued if x_formula included time-constant covariates that are not included in ‘delta’. If the covariates included in ‘delta’ are not time-varying they will be removed from the list.

  • est_method (str, default: "dr-mle") –

    • "dr-mle" or "dr"

      for locally efficient doubly robust DiD estimator, with logistic propensity score model for the probability of being treated

    • "dr-ipt"

      for locally efficient doubly robust DiD estimator, with propensity score estimated using the inverse probability tilting

    • "reg"

      for outcome regression DiD estimator

    • "std_ipw-mle" or "std_ipw"

      for standardized inverse probability weighted DiD estimator, with logistic propensity score model for the probability of being treated

  • as_repeated_cross_section (bool | None, default: None) –

  • boot_iterations (int, default: 0) –

  • random_state (int | None, default: None) –

  • alpha (float, default: 0.05) – The significance level.

  • cluster_var (str | list | None, default: None) –

  • split_sample_by (str | Callable | None, default: None) –

    The name of the column along which to split the data, or a function which takes the data and returns a sample mask for a binary split, for example:

    lambda: x = x['column name'] >= x['column name'].median()
    

    The estimation of the ATT will be run separately for each specified sample; used for heterogeneity analysis.

  • n_jobs (int, default: 1) –

    The maximum number of concurrently running jobs. If -1 all CPUs are used.

    If ≠ 1, concurrent jobs will be run for two separate tasks:

    • computing the cohort-time ATT; each cohort-time is assigned to a job

    • computing the bootstrap; the influence function is split into n_jobs parts and the boostrap is computed concurrently for each part

    Parallelization is implemented using joblib, refer to its documentation for additional details on n_jobs.

  • backend (int, default: "loky") –

    Parallelization backend implementation.

    Parallelization is implemented using joblib, refer to its documentation for additional details on backend.

  • progress_bar (bool, default: True) – If True, a progress bar will display the progress over the cohort-times iterations and/or the iterations over the number of boostrap concurrent splits (not the bootstrap iterations).

Return type:

A DataFrame with the group time ATTs

aggregate(type_of_aggregation: str | None = 'simple', overall: bool = False, difference: bool | list | dict[str, list] = False, alpha: float = 0.05, cluster_var: list | str = None, boot_iterations: int = 0, random_state: int = None, n_jobs: int = 1, backend: str = 'loky') DataFrame#

Aggregate the ATTgt

Parameters:
  • type_of_aggregation (str | None, default: None) –

    • "simple"

      to calculate the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.

    • "event" or "event"

      to calculate the average effects in each relative period: periods relative to the treatment; as in an event study.

    • "cohort"

      to calculate the average treatment effect in each cohort.

    • "time" or "time"

      to calculate the average treatment effect in each time time.

  • overall (bool, default: False) –

    calculates the average effect within each type_of_aggregation.

    • if type_of_aggregation is set to "event" or "event"

      to calculate the average effect of the treatment across positive relative periods

    • if type_of_aggregation is set to "cohort"

      to calculate the average effect of the treatment across cohorts

    • if type_of_aggregation is set to "time" or "time"

      to calculate the average effect of the treatment across time times

  • difference (bool | list | dict, default: False) –

    take the difference of the estimates

    Available options are:

    • True

      to calculate the difference between 2 samples or 2 strata of treatments

    Note

    • Samples difference: if the estimation is run on 2 samples and more than 2 strata, the estimates for the two samples will be subtracted, as long as there are no strata that have the same names as the samples, in that case use a dictionary as indicated below

    • strata difference: if the estimation is run on 2 strata and more than 2 samples, the estimates for the two strata will be subtracted, as long as there are no samples that have the same names as the strata, in that case use a dictionary as indicated below

    • [sample-0, sample-1] or [stratum-A, stratum-B]

      to calculate the difference between 2 samples listed in the argument or the 2 strata of treatments listed in the argument

    Note

    • Samples difference: if there are strata with the same name as the two samples listed, use a dictionary as indicated below

    • strata difference: if there are samples with the same name as the two strata listed, use a dictionary as indicated below

    • {'strata': [stratum-A, stratum-B]} or {'sample_names': [sample-0, sample-1]}

  • alpha (float, default: 0.05) – The significance level.

  • cluster_var (str | list | None, default: None) – cluster variables

  • boot_iterations (int, default: 0) – bootstrap iterations

  • random_state (int | None, default: None) – seed for bootstrap

  • n_jobs (int, default: 1) –

    The maximum number of concurrently running jobs. If -1 all CPUs are used.

    If ≠ 1, concurrent jobs will be run for:

    • computing the bootstrap; the influence function is split into n_jobs parts and the boostrap is computed concurrently for each part

    Parallelization is implemented using joblib, refer to its documentation for additional details on n_jobs.

  • backend (int, default: "loky") –

    Parallelization backend implementation.

    Parallelization is implemented using joblib, refer to its documentation for additional details on backend.

Return type:

A DataFrame with the requested aggregation

property sample_names#
property wald_pre_test#
results(type_of_aggregation: str | None = None, overall: bool = False, difference: bool = False, to_dataframe: bool = True, add_info: bool = False)#

provides easy access to cached results. this method must be called after fit and/or aggregate depending on the parameters requested

Parameters:
  • type_of_aggregation (str | None, default: None) –

    • "simple"

      to return the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.

    • "event" or "event"

      to return the average effects in each relative period: periods relative to the treatment; as in an event study.

    • "cohort"

      to return the average treatment effect in each cohort.

    • "time" or "time"

      to return the average treatment effect in each time time.

  • overall (bool, default: False) –

    calculates the average effect within each type_of_aggregation.

    • if type_of_aggregation is set to "event" or "event"

      to return the average effect of the treatment across positive relative periods

    • if type_of_aggregation is set to "cohort"

      to return the average effect of the treatment across cohorts

    • if type_of_aggregation is set to "time" or "time"

      to return the average effect of the treatment across time times

  • difference (bool, default: False) – to return the most recent estimated difference

  • to_dataframe – whether to return the result in a DataFrame or a list of namedtuples

Return type:

Either a pandas dataframe or a list of namedtuples

plot(type_of_aggregation: str | None = None, overall: bool = False, difference: bool = False, estimation_details: bool = True, estimate_in_x_axis: bool = False, **plotting_parameters)#
Parameters:
  • type_of_aggregation (str | None, default: None) –

    • "simple"

      to plot the weighted average of all cohort-time average treatment effects, with weights proportional to the cohort size.

    • "event" or "event"

      to plot the average effects in each relative period: periods relative to the treatment; as in an event study.

    • "cohort"

      to plot the average treatment effect in each cohort.

    • "time" or "time"

      to plot the average treatment effect in each time time.

  • overall (bool, default: False) –

    to plot the average effect within each type_of_aggregation.

    • if type_of_aggregation is set to "event" or "event"

      to plot the average effect of the treatment across positive relative periods

    • if type_of_aggregation is set to "cohort"

      to plot the average effect of the treatment across cohorts

    • if type_of_aggregation is set to "time" or "time"

      to plot the average effect of the treatment across time times

  • difference (bool, default: False) –

    take the difference of the estimates

    Available options are:

    • True

      to plot the difference between 2 samples or 2 strata of treatments

  • estimation_details (bool | list | str, default: True) – include the estimation details in the plot. One can modify the format through plotting_parameters

  • estimate_in_x_axis (bool, default: False) – whether to display the ATT estimates in the x-axis

  • plotting_parameters – a set of parameters to customize the plot. Please refer to the separate documentation for the plotting functionalities built in the library

Return type:

An interactive plot for the requested estimates

Citation

[CS2021]

Callaway, Brantly, and Pedro HC Sant’Anna. “Difference-in-differences with multiple time periods.” Journal of Econometrics 225, no. 2 (2021): 200-230.

[CGS2022]

Callaway, Brantly, Andrew Goodman-Bacon, and Pedro HC Sant’Anna. “Difference-in-differences with a continuous treatment.” arXiv preprint arXiv:2107.02637 (2021).

[SZ2020]

Sant’Anna, Pedro HC, and Jun Zhao. “Doubly robust difference-in-differences estimators.” Journal of Econometrics 219, no. 1 (2020): 101-122.