Methodological Notes ==================== This document provides the theoretical foundation and methodological details for the Lee and Wooldridge difference-in-differences methods implemented in ``lwdid``. .. contents:: Table of Contents :local: :depth: 2 Overview -------- The ``lwdid`` package implements the Lee and Wooldridge methods for difference-in- differences estimation with panel data, covering three scenarios: 1. **Small-sample common timing** (Lee and Wooldridge 2026): Exact t-based inference under classical linear model assumptions when the number of cross-sectional units is small. 2. **Large-sample common timing** (Lee and Wooldridge 2025): Rolling transformation approach with asymptotic inference using robust standard errors. 3. **Staggered adoption** (Lee and Wooldridge 2025): Extension to settings where units are treated at different times, with cohort-time specific effect estimation. The core method transforms panel data into cross-sectional form via unit-specific time-series operations: 1. Transform panel data into cross-sectional form via unit-specific time-series operations 2. Estimate treatment effects from a cross-sectional regression on transformed data 3. Under classical linear model assumptions (including normality and homoskedasticity), conduct exact t-based inference from the cross-sectional regression The method exploits an algebraic equivalence: by removing unit-specific pre-treatment patterns (means or trends), the panel DiD estimator can be obtained from a cross-sectional regression where exact finite-sample inference is available under the classical linear model assumptions, particularly normality and homoskedasticity of the error term, when homoskedastic OLS standard errors are used. The Panel-to-Cross-Section Transformation ------------------------------------------ Core Principle ~~~~~~~~~~~~~~ The method removes unit-specific time-series patterns using only pre-treatment data, then pools the transformed outcomes across units. This proceeds in two steps: 1. **Transformation step**: Apply a unit-specific time-series transformation using pre-treatment periods only 2. **Regression step**: Estimate treatment effects from a cross-sectional regression on the transformed outcomes The transformation uses only pre-treatment information, preserving the treatment variation for estimation in the second step. Demean Transformation ~~~~~~~~~~~~~~~~~~~~~~ **Setup** Conceptually, unit :math:`i` is observed over :math:`T` periods (:math:`t = 1, \ldots, T`). Treatment begins at period :math:`S`, where :math:`S \in \{2, \ldots, T\}`. Pre-treatment periods are :math:`t = 1, \ldots, S-1` (:math:`T_0 = S-1` periods). Post-treatment periods are :math:`t = S, \ldots, T` (:math:`T_1 = T-S+1` periods). In the notation of Lee and Wooldridge (2026), this description uses a balanced-panel setup where each unit is observed in all T periods. The ``lwdid`` implementation also accommodates unbalanced panels: units need not appear in every period. However, each unit included in the data must have at least one pre-treatment observation so that its pre-treatment mean can be computed. Units without any post-treatment observations remain in the panel but do not contribute to the main ATT regression because their post-treatment average is undefined. **Procedure** 1. Compute the pre-treatment mean for each unit i: .. math:: \bar{Y}_{i,pre} = \frac{1}{S-1} \sum_{t=1}^{S-1} Y_{it} 2. Compute the post-treatment mean for each unit i: .. math:: \bar{Y}_{i,post} = \frac{1}{T-S+1} \sum_{t=S}^{T} Y_{it} 3. Construct the transformed outcome: .. math:: \Delta\bar{Y}_i = \bar{Y}_{i,post} - \bar{Y}_{i,pre} 4. Estimate the cross-sectional regression: .. math:: \Delta\bar{Y}_i = \alpha + \tau D_i + U_i, \quad i = 1, \ldots, N where :math:`D_i = 1` for treated units and :math:`D_i = 0` for control units. **Estimand**: :math:`\tau` is the average treatment effect on the treated (ATT). **Equivalence**: This transformation yields the same point estimate as the two-way fixed effects (TWFE) DiD estimator. The cross-sectional representation enables exact t-based inference under classical linear model assumptions. Detrend Transformation ~~~~~~~~~~~~~~~~~~~~~~~ **Motivation**: When units exhibit heterogeneous linear trends in pre-treatment periods, the standard parallel trends assumption (constant trends across units) may be too restrictive. Detrending allows for unit-specific linear trends by removing them before estimating treatment effects. **Procedure** 1. For each unit i, estimate a linear trend using pre-treatment data: .. math:: Y_{it} = A_i + B_i t + \varepsilon_{it} \quad \text{for } t = 1, \ldots, S-1 This requires at least two pre-treatment periods (:math:`T_0 \geq 2`). 2. Compute predicted values for all periods using the estimated trend: .. math:: \hat{Y}_{it} = \hat{A}_i + \hat{B}_i t \quad \text{for } t = 1, \ldots, T 3. For post-treatment periods, compute out-of-sample residuals: .. math:: \ddot{Y}_{it} = Y_{it} - \hat{Y}_{it} \quad \text{for } t = S, \ldots, T 4. Average the residuals over post-treatment periods: .. math:: \bar{\ddot{Y}}_i = \frac{1}{T-S+1} \sum_{t=S}^{T} \ddot{Y}_{it} 5. Estimate the cross-sectional regression: .. math:: \bar{\ddot{Y}}_i = \alpha + \tau_{DT} D_i + U_i, \quad i = 1, \ldots, N **Estimand**: :math:`\tau_{DT}` is the ATT after removing unit-specific linear trends. **Note**: The trend is estimated using only pre-treatment data, so the treatment variation remains available for estimation in the cross-sectional regression. As in the demean case, the implementation allows unbalanced panels: units need not appear in every period, but each unit included in the data must have at least two pre-treatment observations so that its trend can be estimated. Units without any post-treatment observations remain in the panel but do not contribute to the main ATT regression because their post-treatment average is undefined. Seasonal Transformations (demeanq/detrendq) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **demeanq**: Extends the demean transformation to include seasonal fixed effects, removing seasonal patterns in periodic data. **detrendq**: Extends the detrend transformation to include both linear trends and seasonal fixed effects. Both methods include seasonal dummies in the pre-treatment regression to remove seasonal variation before computing post-treatment residuals. **Generalized Q Parameter** The seasonal transformations support arbitrary seasonal periods through the ``Q`` parameter: - **Q=4** (default): Quarterly data with 4 seasons per year - **Q=12**: Monthly data with 12 seasons per year - **Q=52**: Weekly data with 52 seasons per year The mathematical formulation generalizes to Q seasons. For demeanq, the pre-treatment regression for each unit i is: .. math:: Y_{it} = \alpha_i + \sum_{q=2}^{Q} \gamma_q S_{itq} + \varepsilon_{it} where :math:`S_{itq}` is a dummy variable equal to 1 if observation t falls in season q (with season 1 as the reference category). For detrendq, the pre-treatment regression includes both trend and seasonal terms: .. math:: Y_{it} = \alpha_i + \beta_i t + \sum_{q=2}^{Q} \gamma_q S_{itq} + \varepsilon_{it} **Minimum Pre-Treatment Requirements** The minimum number of pre-treatment observations per unit depends on Q: - **demeanq**: :math:`n_{pre} \geq Q + 1` (to estimate intercept + Q-1 seasonal dummies) - **detrendq**: :math:`n_{pre} \geq Q + 2` (to estimate intercept + trend + Q-1 seasonal dummies) For monthly data (Q=12), this means at least 13 pre-treatment observations for demeanq and 14 for detrendq. For weekly data (Q=52), at least 53 and 54 observations respectively. **Season Coverage Requirement** For each unit, every season that appears in the post-treatment period must also appear in the pre-treatment period. This ensures that seasonal effects can be properly removed from post-treatment observations. **Numerical Stability** For high-dimensional seasonal adjustments (especially Q=52), the implementation includes numerical stability checks: - Condition number monitoring for the design matrix - Warnings when the design matrix approaches singularity - Robust OLS estimation using QR decomposition Inference Under CLM Assumptions -------------------------------- Classical Linear Model Assumptions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When homoskedastic OLS standard errors are used (``vce=None`` in ``lwdid``), exact finite-sample inference is available under the classical linear model (CLM) assumptions for the cross-sectional regression. For the demean transformation, the model is :math:`\Delta\bar{Y}_i = \alpha + \tau D_i + U_i` (for detrending, replace :math:`\Delta\bar{Y}_i` with :math:`\bar{\ddot{Y}}_i`). The CLM assumptions are: 1. **Linearity**: :math:`E[U_i | D_i] = 0` (zero conditional mean) 2. **Random sampling**: Units are independently sampled 3. **No perfect collinearity**: Treatment indicator varies across units 4. **Homoskedasticity**: :math:`\text{Var}(U_i | D_i) = \sigma^2` (constant variance) 5. **Normality**: :math:`U_i | D_i \sim N(0, \sigma^2)` (conditional normality) Under these assumptions and with homoskedastic OLS standard errors, the t-statistic :math:`(\hat{\tau} - \tau)/\text{se}(\hat{\tau})` follows an exact t-distribution with residual degrees of freedom equal to :math:`N - k`, where :math:`k` is the number of estimated parameters in the cross-sectional regression (:math:`k = 2` without controls: intercept and treatment indicator). The normality and homoskedasticity assumptions are critical for exact inference. Degrees of Freedom ~~~~~~~~~~~~~~~~~~ **Homoskedastic standard errors**: .. math:: df = N - k where :math:`N` is the number of cross-sectional units and :math:`k` is the number of parameters (:math:`k = 2` without controls: intercept and treatment indicator). **Cluster-robust standard errors**: .. math:: df = G - 1 where :math:`G` is the number of clusters. The cross-sectional regression has :math:`N` observations (one per unit), not :math:`N \times T`. With cluster-robust standard errors, degrees of freedom equal the number of clusters minus one. Period-Specific Effects ~~~~~~~~~~~~~~~~~~~~~~~ In addition to an overall post-treatment average effect, the Lee and Wooldridge framework allows estimation of period-specific treatment effects by running separate cross-sectional regressions for each post-treatment period, using the transformed outcome in that period as the dependent variable. In ``lwdid``, these regressions use the same variance estimator (``vce``) and control-variable specification as the main ATT regression. Robust Inference ---------------- Heteroskedasticity-Robust Standard Errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When homoskedasticity is violated, heteroskedasticity-robust standard errors provide asymptotically valid inference. The ``lwdid`` package supports the following HC estimators: - **HC0**: The original heteroskedasticity-consistent estimator. Tends to underestimate standard errors in small samples. - **HC1**: Degrees-of-freedom adjusted version of HC0 (applies :math:`n/(n-k)` correction). Equivalent to ``vce='robust'``. - **HC2**: Leverage-adjusted estimator that divides squared residuals by :math:`(1 - h_{ii})` where :math:`h_{ii}` is the diagonal of the hat matrix. - **HC3**: Small-sample adjusted version that divides squared residuals by :math:`(1 - h_{ii})^2`. Recommended for small to moderate samples. - **HC4**: High-leverage adjusted estimator with adaptive exponent based on leverage. Use when data contains influential observations. Robust standard errors rely on asymptotic approximations and may be less accurate in very small samples. Simulation evidence suggests HC3 can perform reasonably well even with small sample sizes, though caution is warranted with very small samples. Exact inference is not available under heteroskedasticity. Cluster-Robust Standard Errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When errors are correlated within clusters, cluster-robust standard errors account for within-cluster correlation. These standard errors rely on large-sample approximations in the number of clusters. Inference is more reliable when the number of clusters is not too small; ``lwdid`` issues a warning when the cluster count is very low. Degrees of freedom for cluster-robust inference are taken as G - 1. The cluster variable must be nested within the unit identifier (each unit belongs to exactly one cluster across all time periods), reflecting the usual assumption that clusters are independent. Randomization Inference ~~~~~~~~~~~~~~~~~~~~~~~ Randomization inference constructs a reference distribution for the test statistic under the sharp null hypothesis without relying on normality or homoskedasticity assumptions. **Procedure**: 1. Compute the observed test statistic. In ``lwdid``, this is the ATT estimate :math:`\hat{\tau}` from the cross-sectional regression. 2. Randomly reassign treatment status across units 3. Re-estimate the model with reassigned treatment 4. Repeat steps 2-3 :math:`R` times (e.g., :math:`R = 1000`) 5. Compute the p-value as the proportion of replications with the test statistic at least as extreme (in absolute value) as the observed value **Methods**: - **Bootstrap** (with replacement): Treatment group size may vary across replications - **Permutation** (without replacement): Fisher randomization inference; treatment group size is fixed across permutations In ``lwdid``, randomization inference always re-estimates the cross-sectional regression by OLS (ignoring the ``vce`` option used for the main estimate). The default ``ri_method`` is ``'bootstrap'`` for backward compatibility, while ``'permutation'`` corresponds to the classical Fisher randomization approach and is generally recommended for design-based randomization inference. **Advantages**: - Does not rely on normality or homoskedasticity assumptions - Naturally accommodates heteroskedasticity and non-normality under the maintained randomization scheme **Limitations**: - Computationally intensive for large numbers of replications - Tests only the sharp null hypothesis (zero treatment effect for all units) Clustering at Higher Levels --------------------------- When the policy or treatment varies at a level higher than the unit of observation, cluster standard errors at the policy variation level (Lee and Wooldridge 2026). This section provides guidance on choosing the appropriate clustering level and tools for diagnosing clustering structure. When to Cluster at Higher Levels ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Principle**: Cluster at the level where treatment is assigned or varies. **Example**: If studying a state-level policy using county-level data, cluster at the state level because the policy varies across states, not counties:: result = lwdid( data, y='outcome', ivar='county', tvar='year', gvar='first_treat', vce='cluster', cluster_var='state' ) **Rationale**: When treatment is assigned at a higher level (e.g., state), errors are likely correlated within that level. Clustering at the unit level (county) would understate standard errors because it ignores this correlation. Minimum Cluster Requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The reliability of cluster-robust inference depends on the number of clusters: - **Recommended**: ≥ 20-30 clusters for reliable asymptotic inference - **Warning threshold**: < 10 clusters may produce unreliable inference - **Alternative**: Use wild cluster bootstrap when clusters are few ``lwdid`` automatically warns when the number of clusters is small: - **< 10 clusters**: Warning with recommendation to use wild cluster bootstrap - **10-19 clusters**: Informational message suggesting wild cluster bootstrap - **Cluster size imbalance** (CV > 1.0): Warning about potential reliability issues Degrees of Freedom ~~~~~~~~~~~~~~~~~~ With cluster-robust standard errors, degrees of freedom are: .. math:: df = G - 1 where :math:`G` is the number of clusters. This is more conservative than the usual :math:`N - k` degrees of freedom, reflecting the effective sample size being the number of clusters rather than the number of observations. Diagnosing Clustering Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use ``diagnose_clustering()`` to analyze potential clustering options and get recommendations:: from lwdid import diagnose_clustering diag = diagnose_clustering( data, ivar='county', potential_cluster_vars=['state', 'region', 'county'], gvar='first_treat' ) print(diag.summary()) The diagnostics include: - **Cluster counts**: Total, treated, and control clusters for each variable - **Cluster sizes**: Min, max, mean, median, and coefficient of variation - **Level detection**: Whether each variable is at a higher/same/lower level than unit - **Treatment variation**: Whether treatment varies within clusters - **Recommendation**: Suggested clustering variable with explanation Getting a Clustering Recommendation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For a detailed recommendation with confidence scores and alternatives:: from lwdid import recommend_clustering_level rec = recommend_clustering_level( data, ivar='county', tvar='year', potential_cluster_vars=['state', 'region', 'county'], gvar='first_treat', min_clusters=20 ) print(rec.summary()) if rec.use_wild_bootstrap: print("Consider using wild_cluster_bootstrap() for inference") The recommendation considers: - Treatment variation level (cluster at this level when possible) - Number of clusters (more is better, ≥20 recommended) - Balance between treated and control clusters - Cluster size variation (lower CV is better) Checking Clustering Consistency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Verify that your chosen clustering level is consistent with the treatment assignment mechanism:: from lwdid import check_clustering_consistency result = check_clustering_consistency( data, ivar='county', cluster_var='state', gvar='first_treat' ) if not result.is_consistent: print(f"Warning: {result.recommendation}") A clustering choice is consistent when: - Treatment does not vary within clusters (or varies minimally, < 5%) - Cluster level is at or above the treatment variation level If treatment varies within clusters, standard errors may be conservative (too large), leading to under-rejection of the null hypothesis. Wild Cluster Bootstrap ~~~~~~~~~~~~~~~~~~~~~~ When the number of clusters is small (< 20), the wild cluster bootstrap provides more reliable inference than asymptotic cluster-robust standard errors. The procedure constructs a bootstrap distribution by: 1. Estimating the original model and obtaining residuals 2. Generating cluster-level random weights 3. Creating bootstrap residuals by multiplying original residuals by weights 4. Re-estimating the model with bootstrap outcomes 5. Computing the bootstrap distribution of t-statistics **Basic usage**:: from lwdid.inference import wild_cluster_bootstrap result = wild_cluster_bootstrap( data, y_transformed='ydot', d='d_', cluster_var='state', n_bootstrap=999 ) print(f"ATT: {result.att:.4f}") print(f"Bootstrap SE: {result.se_bootstrap:.4f}") print(f"Bootstrap p-value: {result.pvalue:.4f}") print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]") **Weight types**: - **rademacher** (default): P(w=1) = P(w=-1) = 0.5. Simplest and most common. - **mammen**: Two-point distribution matching first three moments. Better for asymmetric error distributions. - **webb**: Six-point distribution. Recommended for very few clusters (G < 10). **Example with Webb weights for few clusters**:: result = wild_cluster_bootstrap( data, y_transformed='ydot', d='d_', cluster_var='state', weight_type='webb', # Better for very few clusters n_bootstrap=999, seed=42 # For reproducibility ) **Null hypothesis imposition**: By default (``impose_null=True``), the bootstrap constructs samples under the null hypothesis that the treatment effect is zero. This provides better size control for hypothesis testing. Set ``impose_null=False`` for confidence interval construction when the null may not hold. Clustering Workflow ~~~~~~~~~~~~~~~~~~~ A recommended workflow for choosing and validating clustering: 1. **Diagnose**: Use ``diagnose_clustering()`` to understand the data structure 2. **Recommend**: Use ``recommend_clustering_level()`` to get a recommendation 3. **Check**: Use ``check_clustering_consistency()`` to validate the choice 4. **Estimate**: Run ``lwdid()`` with the chosen ``cluster_var`` 5. **Bootstrap**: If clusters < 20, use ``wild_cluster_bootstrap()`` for inference **Complete example**:: import pandas as pd from lwdid import ( lwdid, diagnose_clustering, recommend_clustering_level, check_clustering_consistency ) from lwdid.inference import wild_cluster_bootstrap # Step 1-2: Diagnose and get recommendation rec = recommend_clustering_level( data, ivar='county', tvar='year', potential_cluster_vars=['state', 'region'], gvar='first_treat' ) # Step 3: Check consistency consistency = check_clustering_consistency( data, ivar='county', cluster_var=rec.recommended_var, gvar='first_treat' ) # Step 4: Estimate with cluster-robust SE result = lwdid( data, y='outcome', ivar='county', tvar='year', gvar='first_treat', vce='cluster', cluster_var=rec.recommended_var ) # Step 5: Wild bootstrap if few clusters if rec.use_wild_bootstrap: boot_result = wild_cluster_bootstrap( result.data_transformed, y_transformed='ydot', d='d_', cluster_var=rec.recommended_var, n_bootstrap=999 ) print(f"Bootstrap p-value: {boot_result.pvalue:.4f}") Large-Sample Asymptotic Inference --------------------------------- When the cross-sectional sample size N is moderately large, asymptotic inference based on robust standard errors becomes appropriate. Lee and Wooldridge (2025) develops the theoretical foundation for large-sample inference in the rolling transformation framework. Asymptotic Theory ~~~~~~~~~~~~~~~~~ Under standard regularity conditions (independent sampling across units, finite moments, and either correct specification of the outcome model or propensity score model for doubly robust estimators), the ATT estimator is asymptotically normal: .. math:: \sqrt{N} (\hat{\tau} - \tau) \xrightarrow{d} N(0, V) where :math:`V` is the asymptotic variance. This justifies the use of heteroskedasticity-robust standard errors (HC0-HC4) or cluster-robust standard errors for constructing confidence intervals and hypothesis tests. The key insight from Lee and Wooldridge (2025) is that the rolling transformation converts the panel DiD problem into a cross-sectional treatment effects problem, enabling application of standard large-sample theory from the treatment effects literature. Doubly Robust Estimation ~~~~~~~~~~~~~~~~~~~~~~~~ Lee and Wooldridge (2025) shows that the rolling transformation approach enables application of doubly robust estimators, which provide consistent estimates when either the outcome model or the propensity score model is correctly specified. **IPWRA (Inverse Probability Weighted Regression Adjustment)**: The doubly robust IPWRA estimator combines regression adjustment and inverse probability weighting. For cohort :math:`g` in period :math:`r`, the estimator takes the form: .. math:: \hat{\tau}_{IPWRA} = N_1^{-1} \sum_i D_i [\hat{Y}_{ir} - \hat{m}_0(X_i)] - N_1^{-1} \sum_i (1-D_i) \frac{\hat{p}(X_i)}{1-\hat{p}(X_i)} [\hat{Y}_{ir} - \hat{m}_0(X_i)] where :math:`N_1` is the number of treated units, :math:`\hat{m}_0(\cdot)` is the estimated conditional mean for control units, and :math:`\hat{p}(\cdot)` is the estimated propensity score. **Double robustness property**: The IPWRA estimator is consistent if: 1. The outcome model :math:`E[Y|D=0, X]` is correctly specified, OR 2. The propensity score :math:`P(D=1|X)` is correctly specified This property makes IPWRA particularly attractive when functional form assumptions are uncertain. **IPW (Inverse Probability Weighting)**: The IPW estimator weights observations by the inverse of their propensity scores: .. math:: \hat{\tau}_{IPW} = N_1^{-1} \sum_i D_i \hat{Y}_{ir} - \frac{\sum_i (1-D_i)\hat{w}_i\hat{Y}_{ir}}{\sum_i (1-D_i)\hat{w}_i} where :math:`\hat{w}_i = \hat{p}(X_i)/(1-\hat{p}(X_i))`. IPW is consistent when the propensity score is correctly specified. **PSM (Propensity Score Matching)**: Propensity score matching estimates treatment effects by matching treated units to control units with similar propensity scores. Nearest-neighbor matching finds the closest control unit(s) for each treated unit based on the estimated propensity score. Efficiency Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~ Under correct specification of all models, Lee and Wooldridge (2025) shows that: 1. **Regression adjustment (RA)** is both best linear unbiased (BLUE) and asymptotically efficient under standard assumptions 2. **IPWRA** achieves efficiency close to RA while providing robustness to model misspecification 3. **Long differencing methods** can be considerably less efficient because they use only the period just prior to intervention Monte Carlo Simulation Evidence ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Monte Carlo simulations in Lee and Wooldridge (2025) provide quantitative evidence for these theoretical results. Key findings include: **Efficiency comparison under correct specification**: - **RA/POLS**: Relative SD = 1.00 (baseline), RMSE ratio to RA = 1.00. Best linear unbiased estimator. - **IPWRA**: Relative SD = 1.03-1.05, RMSE ratio to RA ≈ 1.03. Doubly robust property. - **IPW**: Relative SD = 1.08-1.15, RMSE ratio to RA ≈ 1.10. Propensity score-based weighting. - **PSM**: Relative SD = 1.25-1.40, RMSE ratio to RA ≈ 1.30. Matching-based estimator. **Rolling vs. long differencing efficiency**: The rolling transformation uses all pre-treatment periods to estimate unit-specific means or trends, whereas long differencing methods use only the period immediately prior to treatment. This difference has substantial efficiency implications: - Rolling transformation: Uses :math:`T_0` pre-treatment periods per unit - Long differencing: Uses only 1 pre-treatment period per unit - Efficiency gain: Rolling achieves standard deviation approximately :math:`\sqrt{T_0}` times smaller than long differencing when :math:`T_0 > 1`, under homoskedastic errors In the simulation designs with :math:`T_0 = 4`, the rolling approach achieves approximately 25-40% smaller standard deviations than long differencing methods, translating to substantially more precise estimates. **Robustness to misspecification**: When outcome models are misspecified but propensity scores are correct: - IPWRA maintains small bias due to double robustness property - RA may exhibit larger bias under outcome model misspecification - IPWRA RMSE becomes comparable to or smaller than RA RMSE These findings support using IPWRA as the primary estimator when functional form assumptions are uncertain. When to Use Large-Sample Methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Large-sample asymptotic inference is appropriate when: - The cross-sectional sample size :math:`N` is moderately large (e.g., :math:`N \geq 50`) - Homoskedasticity and normality assumptions may not hold - Doubly robust estimation is desired for robustness to model misspecification For small samples where CLM assumptions (normality and homoskedasticity) are plausible, exact t-based inference with ``vce=None`` remains appropriate. Inference Distribution by Estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``lwdid`` package uses different reference distributions for statistical inference depending on the estimator and sample size considerations: **Regression Adjustment (RA)**: - Uses the t-distribution with degrees of freedom :math:`df = N - k`, where :math:`N` is the cross-sectional sample size and :math:`k` is the number of regression parameters - Under CLM assumptions (homoskedasticity and normality), this provides exact finite-sample inference - With robust standard errors (HC0-HC4), inference is asymptotically valid **Inverse Probability Weighting (IPW)**: - Uses the normal distribution for asymptotic inference - This follows the standard inverse probability weighting framework - Valid for moderate to large samples where asymptotic approximations hold **Inverse Probability Weighted Regression Adjustment (IPWRA)**: - Uses the normal distribution for asymptotic inference - This is consistent with Lee and Wooldridge (2025), which develops the asymptotic theory for doubly robust estimators in the rolling transformation framework - For small samples, consider using RA with ``vce=None`` for exact t-based inference instead **Propensity Score Matching (PSM)**: - Uses the normal distribution for asymptotic inference - Standard errors are computed analytically - PSM inference is generally valid for moderate to large samples **Practical Guidance**: - **Small samples** (:math:`N < 50`): Use RA (``estimator='ra'``) with ``vce=None`` for exact t-based inference under CLM assumptions, or use randomization inference (``ri=True``) for assumption-free testing - **Moderate samples** (:math:`50 \leq N < 200`): Use RA or IPWRA with HC3 standard errors (``vce='hc3'``) - **Large samples** (:math:`N \geq 200`): All estimators with asymptotic inference are appropriate; IPWRA provides robustness to model misspecification Identification Assumptions --------------------------- No Anticipation ~~~~~~~~~~~~~~~ Units do not change behavior in anticipation of future treatment. Pre-treatment outcomes are unaffected by future treatment status. Example violation: Firms may alter behavior before a regulation takes effect. Parallel Trends ~~~~~~~~~~~~~~~ **Demean**: In the absence of treatment, the average change in outcomes from pre- to post-treatment periods would be the same for treated and control units. Formally, :math:`E[Y_{it}(0) - Y_{i1}(0) | D_i]` is constant across treatment groups for all :math:`t`. This is the standard parallel trends assumption. **Detrend**: In the absence of treatment, the average change in outcomes after removing unit-specific linear trends would be the same for treated and control units. This allows for heterogeneous linear trends across units, relaxing the standard parallel trends assumption. The parallel trends assumption is not directly testable because the counterfactual outcome (what would have happened to treated units without treatment) is unobserved. Researchers can examine pre-treatment trends visually, conduct placebo tests on pre-treatment periods, or use detrending when units exhibit different linear trends in the pre-treatment period. Common Treatment Timing (Common Timing Mode) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In common timing mode, all treated units begin treatment in the same period. This is specified via the ``post`` parameter, which indicates post-treatment periods. For staggered adoption settings where units are treated at different times, use the ``gvar`` parameter instead of ``post``. See the Staggered Adoption section below. Treatment Persistence ~~~~~~~~~~~~~~~~~~~~~ Once treated, units remain treated. At the implementation level, the post-treatment indicator must be monotone non-decreasing in the time index (once ``post`` switches from 0 to 1, it cannot revert to 0). Treatment reversals or temporary treatments are therefore not permitted. Unbalanced Panels and Selection Mechanism ----------------------------------------- Lee and Wooldridge (2025) addresses the treatment of unbalanced panels, where not all units are observed in all time periods. This section describes the selection mechanism assumption and guidance for working with incomplete panel data. Selection Mechanism Assumption ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The key assumption for unbalanced panels is that selection into the sample may depend on unobserved time-invariant heterogeneity (which is removed by the transformation), but cannot systematically depend on the shocks to :math:`Y_{it}(\infty)`. Formally, if :math:`S_{it}` indicates whether unit :math:`i` is observed at time :math:`t`: .. math:: E[Y_{it}(\infty) | S_{it} = 1, D_i, X_i] = E[Y_{it}(\infty) | D_i, X_i] This means that, conditional on treatment status and covariates, being observed does not systematically predict the untreated potential outcome. **Analogy to Fixed Effects**: This assumption is analogous to the standard fixed effects assumption where the error term may be correlated with time-invariant characteristics but not with idiosyncratic shocks. The rolling transformation removes the time-invariant component, so selection based on permanent characteristics does not cause bias. Acceptable Missing Data Patterns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Missing data is acceptable when it depends on: 1. **Time-invariant characteristics**: Units that are always high or low performers may be more or less likely to remain in the sample 2. **Observable covariates**: Missingness related to baseline characteristics that are controlled for in the analysis 3. **Random factors**: Missing completely at random (MCAR) 4. **Deterministic patterns**: Units observed only in certain calendar time periods due to sample design **Examples of acceptable patterns**: - A panel of firms where smaller firms are less likely to survive (selection on time-invariant firm quality) - A survey where certain regions are only surveyed in specific years - Administrative data where units enter and exit based on observable eligibility Problematic Missing Data Patterns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Missing data is problematic when it depends on: 1. **Outcome shocks**: Units experiencing negative shocks being more likely to leave the sample 2. **Treatment anticipation**: Units selecting out based on expected treatment effects 3. **Unobserved time-varying factors**: Missingness correlated with transitory components of the outcome **Examples of problematic patterns**: - Firms exiting the sample after experiencing losses (selection on outcome shocks) - Workers leaving a survey after job loss, where job loss is correlated with outcomes of interest - Treatment group units dropping out more frequently than control units in a pattern correlated with outcomes Detrending for Robustness ~~~~~~~~~~~~~~~~~~~~~~~~~ Lee and Wooldridge (2025) notes that detrending provides additional robustness to unbalanced panels. Removing unit-specific trends allows for two sources of heterogeneity — level and trend — to be correlated with selection into the sample. When selection may depend on both the level and growth rate of the outcome, detrending removes both sources of heterogeneity, making the estimates more robust to selection bias. **Recommendation**: When panel imbalance is substantial or selection mechanisms are uncertain, consider using ``rolling='detrend'`` for additional robustness. Diagnosing Selection Risk ~~~~~~~~~~~~~~~~~~~~~~~~~ The ``lwdid`` package provides diagnostic tools for assessing selection risk: .. code-block:: python from lwdid import diagnose_selection_mechanism diagnostics = diagnose_selection_mechanism( data, ivar='unit', tvar='year', gvar='first_treat' ) print(diagnostics.summary()) The diagnostics include: - **Balance statistics**: Panel balance ratio, observation counts - **Attrition analysis**: Dropout rates by cohort and period - **Missing pattern classification**: MCAR, MAR, or MNAR assessment - **Risk level**: LOW, MEDIUM, or HIGH selection risk - **Recommendations**: Suggested actions based on diagnostics **Interpreting risk levels**: - **LOW**: Proceed with estimation; selection mechanism likely satisfied - **MEDIUM**: Consider using detrending; perform sensitivity analysis - **HIGH**: Results should be interpreted with caution; consider balanced subsample Controlling Panel Balance ~~~~~~~~~~~~~~~~~~~~~~~~~ The ``balanced_panel`` parameter in ``lwdid()`` controls behavior when unbalanced panels are detected: - ``balanced_panel='warn'`` (default): Issue warning and continue estimation - ``balanced_panel='error'``: Raise exception; require balanced panel - ``balanced_panel='ignore'``: Silently proceed with unbalanced panel .. code-block:: python # Strict balanced panel requirement result = lwdid( data, y='outcome', ivar='unit', tvar='year', gvar='first_treat', balanced_panel='error' # Raises exception if unbalanced ) See :doc:`api/selection_diagnostics` for detailed documentation of the diagnostic functions. Heterogeneous Trends and Assumption CHT ---------------------------------------- Lee and Wooldridge (2025) introduces Assumption CHT (Cohort-specific Heterogeneous Trends), which relaxes the standard parallel trends assumption to allow for cohort-specific linear trends. Assumption CHT ~~~~~~~~~~~~~~ Under Assumption CHT, the potential outcome without treatment follows: .. math:: E[Y_t(\infty)|D, X] = \eta_S(D_S \cdot t) + \cdots + \eta_T(D_T \cdot t) + q_\infty(X) + \sum_g D_g q_g(X) + m_t(X) where: - :math:`\eta_g` is the cohort-specific linear trend for cohort :math:`g` - :math:`D_g = 1` indicates membership in cohort :math:`g` - :math:`q_\infty(X)` is the baseline level for never-treated units - :math:`q_g(X)` is the cohort-specific level shift - :math:`m_t(X)` is the common time effect The key implication is that different cohorts may have different pre-treatment trends, violating the standard parallel trends assumption but still allowing valid causal inference through detrending. First Difference Under CHT ~~~~~~~~~~~~~~~~~~~~~~~~~~ Taking first differences under Assumption CHT: .. math:: E[Y_t(\infty) - Y_{t-1}(\infty)|D, X] = \eta_g D_g + \cdots + [m_t(X) - m_{t-1}(X)] The first difference depends on cohort membership through the :math:`\eta_g` terms. This means that if cohorts have different trends (:math:`\eta_g \neq \eta_h`), the standard parallel trends assumption fails, but detrending can remove these cohort-specific trends. Choosing Between Demean and Detrend ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The choice between ``demean`` and ``detrend`` depends on whether the parallel trends assumption holds: **Use demean when**: - Pre-treatment trends are parallel across treatment groups - The number of pre-treatment periods is limited (:math:`T_0 < 3`) - More efficient estimates are desired (demean is more efficient when PT holds) **Use detrend when**: - Pre-treatment trends differ across treatment groups - Sufficient pre-treatment periods are available (:math:`T_0 \geq 2`, preferably :math:`\geq 3`) - Visual inspection or formal tests suggest heterogeneous trends **Decision procedure**: 1. Examine pre-treatment trends visually using ``plot_cohort_trends()`` 2. Test parallel trends formally using ``test_parallel_trends()`` 3. Diagnose trend heterogeneity using ``diagnose_heterogeneous_trends()`` 4. Get a recommendation using ``recommend_transformation()`` Detrending Under CHT ~~~~~~~~~~~~~~~~~~~~ Lee and Wooldridge (2025) describes the detrending procedure for staggered adoption under Assumption CHT: **Step 1**: For each cohort :math:`g \in \{S, \ldots, T\}`, run unit-specific regressions using only pre-treatment periods: .. math:: Y_{it} \text{ on } 1, t \quad \text{for } t = 1, \ldots, g-1 This estimates unit-specific intercept :math:`\hat{A}_i` and slope :math:`\hat{B}_i`. **Step 2**: For post-treatment periods :math:`r \in \{g, \ldots, T\}`, compute out-of-sample predictions: .. math:: \hat{Y}_{irg} = \hat{A}_i + \hat{B}_i \cdot r **Step 3**: Compute detrended residuals: .. math:: \ddot{Y}_{irg} = Y_{ir} - \hat{Y}_{irg} The detrended outcome :math:`\ddot{Y}_{irg}` removes the unit-specific linear trend, allowing valid estimation of treatment effects even when cohorts have different pre-treatment trends. Testing Parallel Trends ~~~~~~~~~~~~~~~~~~~~~~~ The ``lwdid`` package provides tools for testing the parallel trends assumption: **Placebo test**: Estimate treatment effects in pre-treatment periods. Under parallel trends, these "placebo" effects should be zero: .. code-block:: python from lwdid.trend_diagnostics import test_parallel_trends result = test_parallel_trends( data, y='outcome', ivar='unit', tvar='time', gvar='first_treat', method='placebo' ) print(result.summary()) **Interpretation**: - If ``reject_null=False``: No evidence against parallel trends; ``demean`` is appropriate - If ``reject_null=True``: Evidence of pre-treatment differences; consider ``detrend`` Diagnosing Heterogeneous Trends ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To diagnose whether cohorts have different pre-treatment trends: .. code-block:: python from lwdid.trend_diagnostics import diagnose_heterogeneous_trends diag = diagnose_heterogeneous_trends( data, y='outcome', ivar='unit', tvar='time', gvar='first_treat' ) print(diag.summary()) The diagnostics include: - Estimated trend slope for each cohort - F-test for trend heterogeneity across cohorts - Pairwise trend difference tests - Recommendation for transformation method Getting a Transformation Recommendation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For an automated recommendation: .. code-block:: python from lwdid.trend_diagnostics import recommend_transformation rec = recommend_transformation( data, y='outcome', ivar='unit', tvar='time', gvar='first_treat' ) print(rec.summary()) The recommendation considers: - Parallel trends test results - Trend heterogeneity diagnostics - Number of pre-treatment periods - Panel balance - Seasonal patterns (for quarterly data) Pre-treatment Period Dynamics ----------------------------- Lee and Wooldridge (2025) develops a methodology for estimating treatment effects in pre-treatment periods to assess the validity of the parallel trends assumption. This section describes the theoretical foundation and implementation. Motivation ~~~~~~~~~~ The parallel trends assumption is fundamental to difference-in-differences identification but cannot be directly tested because it concerns counterfactual outcomes. However, examining pre-treatment dynamics provides indirect evidence: - Under parallel trends, the transformed outcome difference between treated and control groups should be zero in all pre-treatment periods - Systematic non-zero pre-treatment effects suggest potential violations Rolling Transformation for Pre-treatment Periods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For pre-treatment periods, the transformation uses future pre-treatment periods rather than past periods. This ensures the transformation is well-defined for all pre-treatment periods. **Pre-treatment Demeaning** For cohort :math:`g` and pre-treatment period :math:`t < g`, the demeaned outcome is: .. math:: \dot{Y}_{itg} = Y_{it} - \frac{1}{g-1-t} \sum_{q=t+1}^{g-1} Y_{iq} where the sum is over future pre-treatment periods :math:`\{t+1, \ldots, g-1\}`. Key properties: - Uses only pre-treatment data (periods before :math:`g`) - Rolling window looks forward, not backward - Window size decreases as :math:`t` approaches :math:`g-1` **Pre-treatment Detrending** For cohort :math:`g` and pre-treatment period :math:`t < g`, fit an OLS regression: .. math:: Y_{iq} = A + B \cdot q + \varepsilon_{iq} \quad \text{for } q \in \{t+1, \ldots, g-1\} Then compute the detrended outcome: .. math:: \ddot{Y}_{itg} = Y_{it} - (\hat{A} + \hat{B} \cdot t) This requires at least 2 future pre-treatment periods for OLS estimation. Anchor Point Convention ~~~~~~~~~~~~~~~~~~~~~~~ The period immediately before treatment (:math:`t = g-1`, event time :math:`e = -1`) serves as the anchor point: .. math:: \dot{Y}_{i,g-1,g} = 0 \quad \text{(by construction)} This occurs because the rolling window :math:`\{g, \ldots, g-1\}` is empty, and by convention the transformation returns zero. The anchor point provides: 1. A reference point for interpreting other pre-treatment effects 2. Normalization that facilitates comparison across cohorts 3. Consistency with event study visualization conventions Pre-treatment ATT Estimation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For each cohort :math:`g` and pre-treatment period :math:`t < g`, the pre-treatment ATT is estimated by regressing the transformed outcome on treatment status: .. math:: \dot{Y}_{itg} = \alpha + \tau_{gt}^{pre} D_i + U_i where :math:`D_i = 1` for units in cohort :math:`g` and :math:`D_i = 0` for control units (never-treated or not-yet-treated at time :math:`t`). Under parallel trends, :math:`E[\tau_{gt}^{pre}] = 0` for all :math:`t < g`. Control Group for Pre-treatment Periods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For pre-treatment period :math:`t < g`, valid control units are: - **Never-treated**: Units with :math:`D_\infty = 1` - **Not-yet-treated at t**: Units with first treatment after :math:`t` (cohorts :math:`h > t`) This differs from post-treatment control groups because units must be untreated at time :math:`t`, not just at time :math:`r \geq g`. Joint Test for Parallel Trends ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The parallel trends test performs a joint F-test of the null hypothesis: .. math:: H_0: \tau_{gt}^{pre} = 0 \quad \text{for all } t < g-1 The anchor point (:math:`t = g-1`) is excluded because it is zero by construction. **Test statistic**: .. math:: F = \frac{1}{K} \sum_{k=1}^{K} \left(\frac{\hat{\tau}_k}{\text{SE}(\hat{\tau}_k)}\right)^2 where :math:`K` is the number of pre-treatment periods tested (excluding anchor). Under :math:`H_0`, the F-statistic follows an :math:`F(K, \nu)` distribution, where :math:`\nu` is the degrees of freedom from the estimation. **Interpretation**: - Rejection suggests evidence against parallel trends - Non-rejection does not prove parallel trends holds - Consider effect sizes, not just statistical significance Comparison with Other DiD Methods ---------------------------------- Two-Way Fixed Effects (TWFE) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The demean transformation is algebraically equivalent to TWFE with unit and time fixed effects. The Lee and Wooldridge method makes the cross-sectional structure explicit, enabling exact t-based inference in small samples under the CLM assumptions when homoskedastic OLS standard errors are used. In contrast, TWFE inference is usually based on large-sample approximations with cluster-robust standard errors. Long Differencing Approaches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Long differencing approaches to staggered DiD use only the single period just prior to the intervention as a reference. Lee and Wooldridge (2025) extends the rolling transformation approach to staggered adoption settings and develops large-sample asymptotic inference using doubly robust estimators (IPWRA). The rolling approach uses all pre-treatment periods, achieving better efficiency while permitting application of various treatment effect estimators. Lee and Wooldridge (2026) focuses on small cross-sectional sample sizes in common timing settings, showing that exact t-based inference is available under classical linear model assumptions (normality and homoskedasticity). When to Use This Method ~~~~~~~~~~~~~~~~~~~~~~~ **Small-sample exact inference** (common timing with ``vce=None``): - Small numbers of treated and/or control units - Willingness to assume normality and homoskedasticity - Alternative: use randomization inference (``ri=True``) **Large-sample asymptotic inference** (common timing or staggered): - Moderately large cross-sectional samples - Use robust standard errors (``vce='robust'`` or ``vce='hc3'``) - For staggered designs, use ``gvar`` to specify first treatment period **Staggered adoption**: - Units treated at different times - Supports cohort-time specific effect estimation - Multiple aggregation options (none, cohort, overall) - Flexible control group strategies (never-treated or not-yet-treated) **This implementation does not cover**: - Treatment reversals or temporary treatments (violates persistence assumption) - Continuous treatment intensity (treatment must be binary) Practical Considerations ------------------------ Sample Size Requirements ~~~~~~~~~~~~~~~~~~~~~~~~ **Minimum (panel-level checks)**: ``lwdid`` requires at least :math:`N \geq 3` units in the panel, with at least one treated unit (:math:`N_1 \geq 1`) and at least one control unit (:math:`N_0 \geq 1`). In addition, the firstpost cross-sectional regression used for ATT estimation must contain at least three units in total and at least one treated and one control unit. If some units drop out of the panel before the post-treatment period, the effective cross-sectional sample entering the main regression can be smaller than the panel :math:`N`; in such cases ``lwdid`` raises an error rather than reporting an ATT based on an under-identified regression. **Practical considerations**: - Exact t-based inference under CLM assumptions is technically available at this minimum, but in practice inference is more stable when the cross-sectional sample is meaningfully larger than :math:`N = 3`. - Cluster-robust standard errors rely on having a non-trivial number of clusters. A commonly used rule of thumb in applied work is to have around :math:`G \geq 10` clusters for more reliable cluster-robust inference, although there is no universal threshold. Time Index and Panel Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The time index used internally by ``lwdid`` must form a contiguous sequence without gaps (e.g., years 2000, 2001, 2002, ...). If the original time variables have gaps, the function raises an error rather than silently interpolating. - The post-treatment indicator ``post`` must be a pure function of time (common timing): at any given time period, all units are either pre-treatment or post-treatment. This aligns with the common timing assumption described above. Pre-Treatment Periods ~~~~~~~~~~~~~~~~~~~~~~ **Minimum requirements in this implementation**: - ``demean``: At least one pre-treatment period overall (:math:`T_0 \geq 1`), and each unit must have at least one pre-treatment observation so that its pre-treatment mean can be computed. - ``detrend``: At least two pre-treatment periods overall (:math:`T_0 \geq 2`), and each unit must have at least two pre-treatment observations so that a unit-specific linear trend can be estimated. - ``demeanq`` and ``detrendq``: The minimum number of pre-treatment observations depends on the seasonal period Q: - **demeanq**: :math:`n_{pre} \geq Q + 1` per unit (e.g., 5 for quarterly, 13 for monthly, 53 for weekly) - **detrendq**: :math:`n_{pre} \geq Q + 2` per unit (e.g., 6 for quarterly, 14 for monthly, 54 for weekly) Additionally, for each unit, every season that appears in the post-treatment period must also appear in the pre-treatment period; otherwise seasonal effects for that season cannot be removed. **Practical recommendations**: - :math:`T_0 \geq 3` for detrend/detrendq (more stable trend estimation) - More pre-treatment periods improve statistical power and facilitate visual assessment of parallel trends Control Variables ~~~~~~~~~~~~~~~~~ Time-invariant controls (e.g., baseline characteristics) are permitted. **Effects**: - Reduces residual variance, increasing power - Reduces degrees of freedom - Does not affect the transformation step Including many controls relative to sample size can lead to overfitting and unstable estimates. In this implementation, time-invariant controls are included only when both groups are sufficiently large relative to the number of controls: controls enter the regression if :math:`N_1 > K+1` and :math:`N_0 > K+1`, where :math:`K` is the number of control variables. Otherwise, ``lwdid`` issues a warning and estimates the model without controls to avoid under-identified or extremely fragile regressions. When controls are included, observations with missing values in any included control variable are dropped from the main regression sample; ``lwdid`` reports how many observations were removed. If dropping those observations would cause either group to violate the :math:`N_1 > K+1` or :math:`N_0 > K+1` conditions, controls are omitted instead and the full sample is retained. Computational Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Speed**: The transformation and cross-sectional regression steps are computationally efficient for typical panel datasets. Randomization inference requires :math:`R` replications and is more computationally intensive (computation time scales linearly with :math:`R`). **Memory**: Memory requirements are modest for typical panel datasets. Randomization inference stores :math:`R` test statistics for computing the empirical p-value. Limitations ----------- This implementation has the following limitations: 1. **Binary treatment**: Treatment is either on or off (no continuous treatment intensity) 2. **Time-invariant controls**: Controls must not vary over time 3. **Treatment persistence**: Once treated, units must remain treated 4. **Seasonal methods data requirements**: The ``demeanq`` and ``detrendq`` transformations require sufficient pre-treatment observations for seasonal parameter estimation. In staggered designs, early cohorts with limited pre-treatment data may be excluded if they lack the required observations Staggered Adoption ------------------ When units are treated at different times, the staggered adoption framework applies. This is activated by specifying the ``gvar`` parameter (first treatment period for each unit) instead of ``post``. Identification ~~~~~~~~~~~~~~ For treatment cohort :math:`g` (units first treated in period :math:`g`) and calendar time :math:`r \geq g`, the ATT is identified under: 1. **No anticipation**: :math:`E[Y_t(g) - Y_t(\infty) | D_g = 1] = 0` for :math:`t < g` 2. **Conditional parallel trends**: :math:`E[Y_t(\infty) - Y_1(\infty) | D, X] = E[Y_t(\infty) - Y_1(\infty) | X]` These assumptions ensure that never-treated and not-yet-treated units provide valid counterfactuals for estimating treatment effects. Transformation ~~~~~~~~~~~~~~ For each cohort :math:`g` and period :math:`r \geq g`, the transformed outcome is: **Demean**: .. math:: \dot{Y}_{irg} = Y_{ir} - \frac{1}{g-1} \sum_{s=1}^{g-1} Y_{is} **Detrend**: .. math:: \ddot{Y}_{irg} = Y_{ir} - \hat{A}_i - \hat{B}_i r where the trend coefficients are estimated from pre-treatment periods :math:`\{1, \ldots, g-1\}`. Control Group Strategies ~~~~~~~~~~~~~~~~~~~~~~~~ Lee and Wooldridge (2025) establishes that, under the conditional parallel trends assumption (CPTS), the cohort treatment indicators are unconfounded with respect to the transformed potential outcome. This implies: - For estimating :math:`\tau_{gr}` (effect for cohort :math:`g` at time :math:`r`), cohorts :math:`h > r` (not-yet-treated at time :math:`r`) can serve as valid controls in addition to never-treated units - Under CPTS, the conditional expectation :math:`E[\dot{Y}_{rg}(\infty)|D_\infty=1, X] = E[\dot{Y}_{rg}(\infty)|D_h=1, X]` is the same across all cohorts - By the no-anticipation assumption, :math:`E[\dot{Y}_{rg}(\infty)|D_h=1, X] = E[\dot{Y}_{rg}(h)|D_h=1, X]` for :math:`h > r`, so not-yet-treated units can substitute for never-treated units **Strategies**: - **never_treated**: Only units with :math:`D_\infty = 1` (never treated during observation period) - **not_yet_treated**: Never treated plus units with first treatment after period :math:`r` (i.e., cohorts :math:`h \in \{r+1, \ldots, T, \infty\}`) The not-yet-treated strategy uses more control observations, potentially improving efficiency, while the never-treated strategy may be more robust to violations of no anticipation. All Units Eventually Treated ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lee and Wooldridge (2025) addresses the case where no units remain untreated through period :math:`T` (no never-treated group). In this setting: 1. Treatment effects are defined relative to :math:`Y_t(T)`, the state of being treated only in the final period, rather than :math:`Y_t(\infty)` 2. Effects can only be estimated for cohorts :math:`g \in \{S, \ldots, T-1\}`; no effect can be estimated for the final cohort (:math:`g = T`) because there are no control units 3. By no anticipation, for :math:`r < T`: :math:`E[Y_r(g) - Y_r(T)|D_g = 1] = E[Y_r(g) - Y_r(\infty)|D_g = 1]`, so except for the final period the ATTs have the same interpretation as when a never-treated group exists 4. When :math:`r = T`, the only available control is cohort :math:`g = T` (units first treated in the final period) The implementation handles this automatically: when aggregating to cohort or overall effects, only cohorts with valid control groups are included. Estimation Methods ~~~~~~~~~~~~~~~~~~ In staggered mode, multiple estimators are available: - **ra** (Regression Adjustment): OLS on transformed outcome - **ipw** (Inverse Probability Weighting): Propensity score weighting - **ipwra** (Doubly Robust): Combines regression and IPW - **psm** (Propensity Score Matching): Nearest neighbor matching The doubly robust IPWRA estimator is consistent if either the outcome model or propensity score model is correctly specified, making it particularly attractive when functional form assumptions are uncertain. Inference Distribution by Estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different estimators use different reference distributions for inference: .. list-table:: :header-rows: 1 :widths: 20 20 60 * - Estimator - Distribution - Rationale * - RA (OLS) - t-distribution - Exact inference under CLM assumptions; df = N - k * - IPW - Normal - Asymptotic inference based on influence functions * - IPWRA - Normal - Asymptotic inference based on influence functions * - PSM - Normal - Asymptotic inference; SE based on matching-based variance estimator The RA estimator uses the t-distribution because Lee and Wooldridge (2026) provides exact finite-sample inference under classical linear model assumptions. The IPW, IPWRA, and PSM estimators use the normal distribution because these estimators rely on asymptotic theory (influence functions or matching-based variance estimators) where t-distribution adjustments are not directly applicable. For large samples (N > 50), the practical difference between t and normal distributions is negligible. Aggregation ~~~~~~~~~~~ Cohort-time specific effects :math:`\tau_{gr}` can be aggregated: - **none**: Report :math:`(g,r)`-specific effects only - **cohort**: Average effects within each cohort: .. math:: \tau_g = \frac{1}{T-g+1} \sum_{r=g}^{T} \tau_{gr} - **overall**: Cohort-share weighted average: .. math:: \tau_\omega = \sum_g \omega_g \tau_g \quad \text{where } \omega_g = N_g/N_{treat} Event Time Aggregation (WATT) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The weighted average treatment effect on the treated (WATT) by event time provides an aggregation of cohort-time specific effects :math:`\tau_{gr}` across cohorts sharing the same relative time since treatment onset, as developed in Lee and Wooldridge (2025). For event time :math:`e = r - g` (relative time since treatment), the WATT is: .. math:: \text{WATT}(e) = \sum_{g \in G_e} w(g,e) \cdot \tau_{g, g+e} where :math:`G_e` is the set of cohorts with a valid ATT estimate at event time :math:`e`, and the weights are cohort-size proportions: .. math:: w(g,e) = \frac{N_g}{\sum_{g' \in G_e} N_{g'}} **Standard Error Computation** The standard error for the WATT at event time :math:`e` is computed as: .. math:: SE(\text{WATT}(e)) = \sqrt{\sum_{g \in G_e} [w(g,e)]^2 \cdot [SE(\tau_{g,g+e})]^2} This formula assumes independence across cohorts and uses the normalized weights. **Degrees of Freedom** For t-distribution inference on event-time aggregated effects, the degrees of freedom are chosen conservatively as the minimum across contributing cohorts: .. math:: df(e) = \min_{g \in G_e} df_g This provides conservative coverage that accounts for finite-sample uncertainty. **Event Study Visualization** Event time aggregation is particularly useful for event study plots, which display treatment effects as a function of relative time. The pre-treatment effects (negative event times) serve as placebo tests for the parallel trends assumption, while post-treatment effects reveal the dynamic treatment response. The anchor point at event time :math:`e = -1` is set to zero by convention, serving as the reference baseline for interpreting other effects. Robustness to Pre-treatment Period Selection -------------------------------------------- Lee and Wooldridge (2026) recommends studying the robustness of DiD estimates by varying the number of pre-treatment periods used in the transformation. This section describes the sensitivity analysis tools implemented in ``lwdid``. Motivation ~~~~~~~~~~ The rolling transformation approach uses pre-treatment periods to estimate unit-specific means or trends. The choice of how many pre-treatment periods to include can affect the estimates: - **Too few periods**: May not adequately capture unit-specific patterns - **Too many periods**: May include periods where parallel trends do not hold Robustness of the estimates can be studied by adjusting the number of pre-treatment time periods used in the transformation, analogous to the variation of pre-treatment windows in synthetic control methods. Sensitivity Ratio ~~~~~~~~~~~~~~~~~ The sensitivity ratio measures how much ATT estimates vary across specifications with different numbers of pre-treatment periods: .. math:: \text{Sensitivity Ratio} = \frac{\max_k \hat{\tau}_k - \min_k \hat{\tau}_k}{|\hat{\tau}_{\text{baseline}}|} where :math:`\hat{\tau}_k` is the ATT estimate using :math:`k` pre-treatment periods, and :math:`\hat{\tau}_{\text{baseline}}` is the estimate using all available pre-treatment periods. **Interpretation**: .. list-table:: :header-rows: 1 :widths: 20 25 55 * - Sensitivity Ratio - Robustness Level - Interpretation * - < 10% - Highly Robust - Estimates stable across specifications * - 10-25% - Moderately Robust - Some sensitivity, generally acceptable * - 25-50% - Sensitive - Interpret with caution * - ≥ 50% - Highly Sensitive - Results depend heavily on specification Using robustness_pre_periods() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``robustness_pre_periods()`` function implements this sensitivity analysis:: from lwdid import robustness_pre_periods result = robustness_pre_periods( data, y='outcome', ivar='unit', tvar='year', gvar='first_treat', rolling='detrend', pre_period_range=(3, 10) ) print(result.summary()) result.plot() The function: 1. Estimates ATT using different numbers of pre-treatment periods 2. Computes the sensitivity ratio 3. Assesses robustness level 4. Generates recommendations No-Anticipation Sensitivity ~~~~~~~~~~~~~~~~~~~~~~~~~~~ When policy is announced before implementation, units may adjust behavior in anticipation. The ``sensitivity_no_anticipation()`` function tests robustness by excluding periods immediately before treatment:: from lwdid import sensitivity_no_anticipation result = sensitivity_no_anticipation( data, y='outcome', ivar='unit', tvar='year', gvar='first_treat', max_anticipation=3 ) if result.anticipation_detected: print(f"Consider excluding {result.recommended_exclusion} periods") The function: 1. Estimates ATT excluding 0, 1, 2, ... periods before treatment 2. Detects significant changes in estimates 3. Recommends how many periods to exclude Using exclude_pre_periods Parameter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Based on sensitivity analysis results, periods can be excluded in the main estimation using the ``exclude_pre_periods`` parameter:: from lwdid import lwdid # Exclude 2 periods immediately before treatment result = lwdid( data, y='outcome', d='d', ivar='unit', tvar='year', post='post', rolling='demean', exclude_pre_periods=2 ) This parameter: - Excludes the specified number of pre-treatment periods from transformation - Addresses potential anticipation effects - Implements the robustness check from Lee and Wooldridge (2026) Comprehensive Sensitivity Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``sensitivity_analysis()`` function provides a comprehensive assessment:: from lwdid import sensitivity_analysis result = sensitivity_analysis( data, y='outcome', ivar='unit', tvar='year', gvar='first_treat', analyses=['pre_periods', 'anticipation'] ) print(result.summary()) result.plot_all() This function combines multiple sensitivity analyses and provides an overall assessment of estimate robustness. References ---------- Lee, S. J., and Wooldridge, J. M. (2026). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes. *Available at SSRN 5325686*. Lee, S. J., and Wooldridge, J. M. (2025). A Simple Transformation Approach to Difference-in-Differences Estimation for Panel Data. *Available at SSRN 4516518*. Further Reading --------------- - :doc:`user_guide` - Comprehensive usage guide - :doc:`quickstart` - Quick start tutorial - :doc:`api/index` - Complete API reference