Validation Module (validation)

The validation module ensures data quality and assumption compliance before estimation. It checks panel structure, treatment timing, and data requirements.

Input validation and data preparation for difference-in-differences.

This module provides comprehensive validation utilities for panel data used in difference-in-differences analysis. It ensures data integrity, validates structural assumptions (time-invariance, continuity), and prepares data for downstream transformation and estimation.

The module supports both common timing designs (all units treated simultaneously) and staggered adoption designs (treatment timing varies by cohort). Validation checks include column existence, data types, time-invariance of treatment indicators and controls, time continuity, and treatment/control group adequacy.

Notes

Reserved column names created internally should not exist in input data to avoid conflicts:

  • d_ : Binary treatment indicator

  • post_ : Binary post-period indicator

  • tindex : Sequential time index

  • tq : Quarter index for quarterly data

  • ydot : Residualized outcome

  • ydot_postavg : Post-period average of residualized outcome

  • firstpost : Main regression sample indicator

lwdid.validation.validate_and_prepare_data(data, y, d, ivar, tvar, post, rolling, controls=None, season_var=None)[source]

Validate input data and execute data preparation pipeline.

This is the main entry point for all data validation and preparation in the lwdid package. It performs comprehensive checks and transformations to ensure data integrity before transformation and estimation.

The validation pipeline consists of five stages:

  1. Input validation: DataFrame type check, reserved column names check, required columns existence check, rolling parameter validation.

  2. Data type validation: Outcome variable numeric type check, control variables numeric type check.

  3. Time-invariance validation: Treatment indicator time-invariance check, control variables time-invariance check.

  4. Data preparation: String ID conversion to numeric codes, time index creation (tindex), binary treatment/post indicator creation (d_, post_), missing value handling.

  5. Time structure validation: Time continuity validation, post-treatment monotonicity check.

Parameters:
  • data (pd.DataFrame) – Long-format panel data with one row per unit-time observation.

  • y (str) – Outcome variable column name. Must be numeric.

  • d (str) – Treatment indicator column name. Must be time-invariant (constant within unit). d=1 for treated units, d=0 for control units.

  • ivar (str) – Unit identifier column name. Can be string or numeric.

  • tvar (str or list of str) –

    Time variable column name(s):

    • str: Annual data (e.g., ‘year’)

    • [str, str]: Quarterly data (e.g., [‘year’, ‘quarter’])

  • post (str) – Post-treatment indicator column name. Must be monotone non-decreasing in time. post=0 for pre-treatment periods, post=1 for post-treatment periods.

  • rolling (str) –

    Transformation method. Must be one of:

    • ’demean’: Unit-specific demeaning

    • ’detrend’: Unit-specific detrending

    • ’demeanq’: Quarterly demeaning with seasonal effects

    • ’detrendq’: Quarterly detrending with seasonal effects

  • controls (list of str, optional) – Control variable column names. Must be numeric and time-invariant. Default: None (no controls).

  • season_var (str, optional) – Column name of seasonal indicator variable for seasonal transformations (demeanq, detrendq). Values should be integers from 1 to Q representing seasonal periods (e.g., quarters 1-4, months 1-12, or weeks 1-52). If provided, allows demeanq/detrendq to work with a single tvar column. Default: None (uses tvar[1] for legacy quarterly data format).

Return type:

tuple[DataFrame, dict]

Returns:

  • data_clean (pd.DataFrame) – Cleaned and prepared data with the following modifications:

    • Original columns preserved

    • New columns added: tindex, d_, post_ (and tq for quarterly data)

    • String IDs converted to numeric codes (if applicable)

    • Missing values handled (rows with NaN in y, d, post, ivar, or time variables are dropped; missing values in control variables are handled later at the estimation stage)

  • metadata (dict) – Metadata dictionary containing:

    • ’N’: Total number of units

    • ’N_treated’: Number of treated units (d_ = 1)

    • ’N_control’: Number of control units (d_ = 0)

    • ’T’: Total number of time periods

    • ’K’: Number of pre-treatment periods

    • ’tpost1’: First post-treatment period index

    • ’is_quarterly’: Boolean indicating quarterly data

    • ’id_mapping’: Dict mapping original string IDs to numeric codes (if applicable)

Raises:

Notes

This function creates several internal columns (d_, post_, tindex, tq, ydot, ydot_postavg, firstpost). Input data must not contain columns with these names.

See also

_validate_required_columns

Validates column existence.

_validate_time_continuity

Validates time series continuity.

validate_quarter_coverage

Validates quarter coverage for quarterly methods.

lwdid.validation.validate_season_diversity(data, ivar, season_var, post, Q=4)[source]

Validate seasonal diversity and coverage for seasonal effects identification.

Ensures each unit has at least two distinct seasons in the pre-treatment period (required to identify seasonal effects) and that all post-period seasons also appear in the pre-period.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • ivar (str) – Unit identifier column name.

  • season_var (str) – Seasonal variable column name (values should be 1 to Q).

  • post (str) – Post-treatment indicator column name.

  • Q (int, default 4) – Number of seasonal periods per cycle. Common values: - 4: Quarterly data (default) - 12: Monthly data - 52: Weekly data

Raises:

InsufficientQuarterDiversityError – If any unit has fewer than 2 distinct seasons in pre-period, or if any post-period season does not appear in the pre-period.

Return type:

None

See also

validate_season_coverage

Validates only seasonal coverage.

lwdid.validation.validate_season_coverage(data, ivar, season_var, post, Q=4)[source]

Validate that post-period seasons appear in pre-period for each unit.

Seasonal transformation methods assume seasonal effects are constant over time. Each post-period season must appear in the pre-period to identify its seasonal coefficient.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • ivar (str) – Unit identifier column name.

  • season_var (str) – Seasonal variable column name (values should be 1 to Q).

  • post (str) – Post-treatment indicator column name.

  • Q (int, default 4) – Number of seasonal periods per cycle. Common values: - 4: Quarterly data (default) - 12: Monthly data - 52: Weekly data

Raises:

InsufficientQuarterDiversityError – If any unit has a post-period season that does not appear in its pre-period data.

Return type:

None

See also

validate_season_diversity

Also validates minimum seasonal diversity.

lwdid.validation.validate_quarter_diversity(data, ivar, quarter, post)[source]

Validate quarter diversity and coverage for seasonal effects identification.

Ensures each unit has at least two distinct quarters in the pre-treatment period (required to identify seasonal effects) and that all post-period quarters also appear in the pre-period.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • ivar (str) – Unit identifier column name.

  • quarter (str) – Quarter variable column name (values should be 1, 2, 3, or 4).

  • post (str) – Post-treatment indicator column name.

Raises:

InsufficientQuarterDiversityError – If any unit has fewer than 2 distinct quarters in pre-period, or if any post-period quarter does not appear in the pre-period.

Return type:

None

See also

validate_quarter_coverage

Validates only quarter coverage.

Notes

This is a backward-compatible wrapper for validate_season_diversity with Q=4.

lwdid.validation.validate_quarter_coverage(data, ivar, quarter, post)[source]

Validate that post-period quarters appear in pre-period for each unit.

Quarterly transformation methods assume seasonal effects are constant over time. Each post-period quarter must appear in the pre-period to identify its seasonal coefficient.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • ivar (str) – Unit identifier column name.

  • quarter (str) – Quarter variable column name (values should be 1, 2, 3, or 4).

  • post (str) – Post-treatment indicator column name.

Raises:

InsufficientQuarterDiversityError – If any unit has a post-period quarter that does not appear in its pre-period data.

Return type:

None

See also

validate_quarter_diversity

Also validates minimum quarter diversity.

Notes

This is a backward-compatible wrapper for validate_season_coverage with Q=4.

lwdid.validation.get_cohort_mask(unit_gvar, g)[source]

Create a boolean mask identifying units belonging to a specific cohort.

Handles floating-point comparison with tolerance to account for potential rounding errors in gvar values.

Parameters:
  • unit_gvar (pd.Series) – Series mapping unit identifiers to their gvar (first treatment period) values. Index should be unit identifiers.

  • g (int) – Target cohort (first treatment period) to identify.

Returns:

Boolean series with same index as unit_gvar. True for units in cohort g, False otherwise.

Return type:

pd.Series

Notes

Uses COHORT_FLOAT_TOLERANCE for floating-point comparison to handle potential precision issues from data transformations.

lwdid.validation.is_never_treated(gvar_value)[source]

Determine if a unit is never treated based on its gvar value.

This is the single source of truth for never-treated status identification. All modules (validation, control_groups, aggregation) should use this function to ensure consistent never-treated determination.

Parameters:

gvar_value (int or float) – The gvar (first treatment period) value for a unit.

Returns:

True if the unit is never treated, False otherwise.

Return type:

bool

Raises:

InvalidStaggeredDataError – If gvar_value is negative infinity (-np.inf), which is not a valid gvar value.

Notes

A unit is considered never treated if its gvar value is:

  • NaN or None (missing value)

  • 0 (explicitly coded as never treated)

  • np.inf (positive infinity, explicitly coded as never treated)

Positive integers indicate the first treatment period (cohort membership). Negative values are invalid and should be caught by validate_staggered_data().

See also

validate_staggered_data

Validates staggered DiD data structure.

lwdid.validation.validate_staggered_data(data, gvar, ivar, tvar, y, controls=None)[source]

Validate staggered DiD data and extract cohort structure.

Performs comprehensive validation for staggered adoption settings, checking gvar column validity, cohort identification, and data integrity.

Parameters:
  • data (pd.DataFrame) – Panel data in long format with one row per unit-time observation.

  • gvar (str) – Column name for first treatment period (cohort indicator). Valid values: positive integers (cohort), 0/inf/NaN (never treated).

  • ivar (str) – Unit identifier column name.

  • tvar (str or list of str) – Time variable column name(s). Single string for annual data, list of two strings for quarterly data (year, quarter).

  • y (str) – Outcome variable column name.

  • controls (list of str, optional) – Control variable column names. Default: None.

Returns:

Validation result dictionary containing:

  • cohorts : list of int, sorted list of treatment cohorts (excludes NT)

  • n_cohorts : int, number of distinct treatment cohorts

  • n_never_treated : int, number of never-treated units

  • n_treated : int, total number of treated units across all cohorts

  • has_never_treated : bool, whether never-treated units exist

  • cohort_sizes : dict, {cohort: n_units} mapping

  • T_min : int, minimum time period in data

  • T_max : int, maximum time period in data

  • N_total : int, total number of units

  • N_obs : int, total number of observations

  • warnings : list of str, warning messages

Return type:

dict

Raises:

See also

is_never_treated

Determines never-treated status from gvar value.

validate_and_prepare_data

Validation for common timing designs.

lwdid.validation.detect_frequency(data, tvar, ivar=None)[source]

Detect data frequency (quarterly, monthly, weekly) from time variable.

Analyzes the time variable to determine the most likely data frequency. Uses multiple heuristics including time interval analysis and observations per year counting.

Parameters:
  • data (pd.DataFrame) – Panel data containing the time variable.

  • tvar (str) – Column name of the time variable. Can be: - Integer time index (e.g., 1, 2, 3, …) - Year values (e.g., 2020, 2021, …) - Datetime values

  • ivar (str, optional) – Column name of the unit identifier. If provided, frequency detection is performed per-unit and aggregated for more robust estimation.

Returns:

Detection results with keys:

  • frequencystr or None

    Detected frequency: ‘quarterly’, ‘monthly’, ‘weekly’, ‘annual’, or None if detection failed.

  • Qint or None

    Corresponding Q value: 4 (quarterly), 12 (monthly), 52 (weekly), 1 (annual), or None if detection failed.

  • confidencefloat

    Confidence score in [0, 1] indicating detection reliability. Higher values indicate more consistent patterns.

  • methodstr

    Detection method used: ‘interval’, ‘obs_per_year’, or ‘heuristic’.

  • detailsdict

    Additional diagnostic information including: - median_interval: Median time interval between observations - obs_per_year: Average observations per year (if applicable) - n_units_analyzed: Number of units used in detection

Return type:

dict

Notes

Detection heuristics:

  1. Interval-based: Analyzes median time interval between consecutive observations. Works best with datetime or continuous time indices.

  2. Observations per year: Counts average observations per calendar year. Works best with year-based time variables.

  3. Value range: Examines the range of time values to infer frequency.

The function returns frequency=None when: - Time variable has insufficient variation - Multiple frequencies are equally likely - Data pattern is irregular or ambiguous

See also

lwdid

Main estimation function with auto_detect_frequency parameter.

Overview

This module performs comprehensive validation of input data to ensure:

  1. Panel structure is correct (unique unit-time pairs, continuous time)

  2. Treatment timing follows common timing assumption

  3. Sample size meets minimum requirements

  4. Control variables are time-invariant

  5. Data types are appropriate

All validation functions raise informative exceptions when requirements are violated, helping users identify and fix data issues quickly.

Validation Checks

Panel Structure Validation

Checks performed:

  • No duplicate (unit, time) observations

  • Time index forms a continuous sequence (no gaps)

  • Sufficient observations for estimation (\(N \geq 3\))

  • At least one treated unit (d = 1) and one control unit (d = 0)

Why it matters:

  • Duplicate observations indicate data errors

  • Time gaps violate the continuous panel assumption

  • Too few units make inference unreliable

  • Need at least one treated and one control unit for DiD estimation

Example error (conceptual):

InvalidParameterError indicating duplicate (unit, time) observations.
Each (unit, time) combination must appear at most once.

Treatment Timing Validation

Checks performed:

  • post indicator is binarized internally as 0/1 (non-zero values are treated as 1)

  • post is the same for all units in each time period (common timing)

  • post is monotone (no treatment reversals)

  • At least one pre-treatment and one post-treatment period exist (post != 0) exist

Why it matters:

  • Common timing is a core assumption of the method

  • Treatment reversals violate the persistence assumption

  • Need both pre- and post-treatment periods for DiD

Example errors (conceptual):

InvalidParameterError indicating that the common timing assumption is violated
because 'post' varies across units within the same period.
TimeDiscontinuityError indicating that 'post' is not monotone in time
(treatment reversals or suspensions).

Pre-Treatment Period Validation

Checks performed:

  • Each unit has sufficient pre-treatment periods for the chosen transformation:

    • demean: at least 1 pre-treatment observation per unit

    • detrend: at least 2 pre-treatment observations per unit

    • demeanq: at least 1 pre-treatment observation per unit, and enough pre-period observations to estimate quarterly fixed effects (number of pre-period observations >= number of distinct pre-period quarters + 1)

    • detrendq: at least 2 pre-treatment observations per unit, and enough pre-period observations to estimate a linear trend plus quarterly fixed effects (number of pre-period observations >= 1 + number of distinct pre-period quarters)

Why it matters:

  • Demean requires at least 1 pre-treatment observation per unit to compute the pre-treatment mean

  • Detrend requires at least 2 pre-treatment observations per unit to estimate a linear trend

  • Quarterly methods additionally require sufficient pre-treatment observations within each unit to estimate quarterly fixed effects without rank deficiency

Example error (conceptual):

InsufficientPrePeriodsError indicating that some units have fewer
pre-treatment periods than required by the chosen transformation.

Control Variables Validation

Checks performed:

  • Control variables exist in the data

  • Controls are time-invariant (constant within each unit)

  • Control variables are numeric; missing values (if any) are handled at the estimation stage rather than in the validation step

Why it matters:

  • Time-varying controls can be endogenous to treatment

  • Missing controls can lead to dropped observations when controls are included in the regression

Example error (conceptual):

InvalidParameterError indicating that control variable 'income' is
not time-invariant within units.
For example, 'income' varies within unit 'unit_42'.

Data Type Validation

Checks performed:

  • Outcome variable is numeric

  • Treatment indicator can be converted to numeric and is interpreted as 0 vs non-zero (non-zero values are treated as 1)

  • Unit and time identifiers are present

  • Rows with missing values in required variables (outcome, treatment, unit identifier, time variable(s), or post) are dropped with a warning

Why it matters:

  • Non-numeric outcomes cannot be used in regression

  • Missing values in key variables change the effective sample after dropping affected rows

Example error (conceptual):

InvalidParameterError indicating that outcome variable 'y' is not
numeric.

Validation Functions

In practice, structural validation of panel layout, treatment timing, control variables, and data types is performed internally by validate_and_prepare_data(), which is called at the beginning of lwdid(). Additional pre-treatment period requirements for each rolling method and quarterly coverage checks are enforced in the transformation step (see lwdid.transformations). The earlier sections (panel structure, treatment timing, pre-treatment periods, controls, data types) describe conceptual checks rather than public helper functions.

The pandas-based snippets in the following sections illustrate how to diagnose and fix common problems yourself before calling lwdid(), but there are no separate public functions named validate_panel_structure, validate_treatment_timing, or validate_controls in the current implementation.

validate_and_prepare_data()

from lwdid.validation import validate_and_prepare_data
import pandas as pd

data = pd.read_csv('panel_data.csv')

data_clean, metadata = validate_and_prepare_data(
    data=data,
    y='outcome',
    d='treated',
    ivar='unit',
    tvar='year',          # or ['year', 'quarter'] for quarterly data
    post='post',
    rolling='demean',     # required rolling method: 'demean', 'detrend', 'demeanq', or 'detrendq'
    controls=['x1', 'x2'] # optional time-invariant controls
)

print(metadata['N'], metadata['T'], metadata['K'])

Quarterly Helper Checks

For quarterly data, the module also provides helper functions such as validate_quarter_coverage that can be used in advanced workflows to pre-check seasonal coverage requirements for demeanq/detrendq. In typical usage these helpers are called indirectly by lwdid.lwdid() via the transformation module rather than being used directly.

Never-Treated Unit Identification

The is_never_treated() function provides a standardized way to identify never-treated units in staggered adoption designs. This function is the single source of truth for never-treated identification across all modules.

is_never_treated() Function

from lwdid.validation import is_never_treated
import numpy as np
import pandas as pd

# Check individual values
is_never_treated(0)        # True - zero indicates never-treated
is_never_treated(np.inf)   # True - infinity indicates never-treated
is_never_treated(np.nan)   # True - NaN indicates never-treated
is_never_treated(None)     # True - None indicates never-treated
is_never_treated(pd.NA)    # True - pandas NA indicates never-treated
is_never_treated(2005)     # False - positive integer is treatment cohort
is_never_treated(-np.inf)  # Raises InvalidStaggeredDataError

Valid Never-Treated Encodings:

The following values are recognized as never-treated:

  1. Zero (0 or 0.0): Common encoding in Stata and other software

  2. Positive infinity (np.inf): Represents “treated at infinity” (never)

  3. NaN/NA/None: Missing treatment time indicates never-treated

  4. Near-zero values: Values within floating-point tolerance (\(|x| < 10^{-10}\))

Invalid Values:

  • Negative infinity (-np.inf): Raises InvalidStaggeredDataError

  • Negative numbers: Should be caught by validate_staggered_data()

Usage with DataFrames:

import pandas as pd
import numpy as np
from lwdid.validation import is_never_treated

# Create sample data
data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5] * 3,
    'year': [2000, 2001, 2002] * 5,
    'y': np.random.randn(15),
    'gvar': [0, np.inf, np.nan, 2001, 2002] * 3
})

# Identify never-treated units
unit_gvar = data.groupby('id')['gvar'].first()
nt_mask = unit_gvar.apply(is_never_treated)

print(f"Never-treated units: {nt_mask.sum()}")  # Output: 3
print(f"NT unit IDs: {unit_gvar[nt_mask].index.tolist()}")  # [1, 2, 3]

Cross-Module Consistency:

The is_never_treated() function is used consistently across all modules:

  • lwdid.validation: Primary definition

  • lwdid.staggered.control_groups: Control group selection

  • lwdid.staggered.aggregation: Weight calculations

  • lwdid.staggered.randomization: Randomization inference

This ensures that never-treated identification is consistent throughout the estimation pipeline.

Staggered Adoption Validation

For staggered adoption designs (where units are treated at different times), additional validation checks are performed when the gvar parameter is specified instead of post.

Staggered-Specific Checks

Checks performed:

  • gvar column exists and is time-invariant within units

  • gvar values are valid:

    • Positive integers indicate treatment cohorts (first treatment period)

    • Values of 0, inf, or NaN indicate never-treated units

  • At least one treatment cohort exists

  • At least one control unit exists (never-treated or not-yet-treated, depending on the control_group strategy)

  • Each cohort has sufficient pre-treatment periods for the chosen transformation (demean requires \(g - 1 \geq 1\), detrend requires \(g - 1 \geq 2\))

Why it matters:

  • Time-varying gvar violates the staggered design assumption

  • Invalid gvar values prevent proper cohort identification

  • Insufficient pre-treatment periods make transformation impossible

Example errors (conceptual):

InvalidStaggeredDataError indicating that 'gvar' varies within unit.
The first treatment period must be constant across all observations
for a given unit.
NoNeverTreatedError indicating that no never-treated units exist
when control_group='never_treated' is specified.

Control Group Strategy Validation

The validation differs based on the chosen control group strategy:

never_treated:

  • Requires at least one unit with gvar equal to 0, inf, or NaN

  • These units serve as controls for all cohort-time effect estimations

  • Required when using aggregate='cohort' or aggregate='overall'

not_yet_treated:

  • Uses never-treated units plus units not yet treated at each calendar time

  • More flexible but requires the no-anticipation assumption to hold

  • For cohort \(g\) at time \(r\), valid controls include units with first treatment period \(h > r\)

Staggered Data Usage

For staggered designs, validation is performed internally by lwdid() when the gvar parameter is provided:

from lwdid import lwdid
import pandas as pd

data = pd.read_csv('staggered_data.csv')

# gvar indicates first treatment period; 0 or NaN for never-treated
results = lwdid(
    data=data,
    y='outcome',
    ivar='unit',
    tvar='year',
    gvar='first_treat_year',  # First treatment period column
    rolling='demean',
    control_group='not_yet_treated',
    aggregate='overall'
)

Error: Invalid Cohort Values

Problem (conceptual):

InvalidStaggeredDataError indicating that 'gvar' contains invalid values.

Cause: gvar contains values that cannot be interpreted as valid cohort indicators (e.g., negative numbers, non-numeric values other than NaN).

Solution:

# Check gvar values
print(data['gvar'].unique())

# Ensure valid cohort values: positive integers for treated, 0/NaN for never-treated
# Convert never-treated indicator if needed
data['gvar'] = data['gvar'].replace({-1: 0, 'never': 0})

# Ensure numeric type
data['gvar'] = pd.to_numeric(data['gvar'], errors='coerce')

Error: No Never-Treated Units

Problem (conceptual):

NoNeverTreatedError indicating that control_group='never_treated' requires
at least one never-treated unit, but none were found.

Cause: All units are eventually treated, but control_group='never_treated' was specified.

Solution:

# Check for never-treated units
never_treated = data[data['gvar'].isin([0, np.inf]) | data['gvar'].isna()]
print(f"Never-treated units: {never_treated['unit'].nunique()}")

# Option 1: Switch to not_yet_treated control group
results = lwdid(..., control_group='not_yet_treated')

# Option 2: Use 'never_treated' only if such units exist
# Note: aggregate='cohort' and aggregate='overall' require never-treated units

Error: Insufficient Pre-Treatment Periods for Cohort

Problem (conceptual):

InsufficientPrePeriodsError indicating that cohort g=2005 has insufficient
pre-treatment periods for detrend transformation (requires at least 2).

Cause: Some cohorts are treated too early in the panel, leaving insufficient pre-treatment periods for the chosen transformation.

Solution:

# Check pre-treatment periods by cohort
cohorts = data[data['gvar'] > 0]['gvar'].unique()
min_year = data['year'].min()

for g in sorted(cohorts):
    pre_periods = g - min_year
    print(f"Cohort {g}: {pre_periods} pre-treatment periods")

# Option 1: Use 'demean' instead of 'detrend' (requires only 1 pre-period)
results = lwdid(..., rolling='demean')

# Option 2: Exclude early cohorts
data = data[~data['gvar'].isin([cohorts_with_insufficient_pre_periods])]

Common Validation Errors and Solutions

Error: Duplicate Observations

Problem (conceptual):

InvalidParameterError indicating duplicate (unit, time) observations.

Cause: Multiple rows with the same (unit, time) combination.

Solution:

# Check for duplicates
duplicates = data[data.duplicated(subset=['unit', 'year'], keep=False)]
print(duplicates)

# Remove duplicates (if appropriate)
data = data.drop_duplicates(subset=['unit', 'year'], keep='first')

# Or aggregate duplicates
data = data.groupby(['unit', 'year']).mean().reset_index()

Error: Non-Common Treatment Timing

Problem (conceptual):

InvalidParameterError indicating violation of the common timing assumption.

Cause: post varies across units in the same time period.

Solution:

# Check if post is time-based
post_by_time = data.groupby('year')['post'].nunique()
print(post_by_time[post_by_time > 1])  # Periods with varying post

# Create time-based post indicator
treatment_year = 2020
data['post'] = (data['year'] >= treatment_year).astype(int)

Error: Insufficient Pre-Treatment Periods

Problem (conceptual):

InsufficientPrePeriodsError indicating that some units lack enough
pre-treatment periods for the chosen transformation.

Cause: For methods that require at least two pre-treatment periods (for

example detrend/detrendq), some units have fewer than 2 pre-treatment periods, or more generally fewer pre-treatment periods than required by the chosen transformation.

Solution:

# Check pre-treatment periods by unit
pre_periods = data[data['post'] == 0].groupby('unit').size()
print(pre_periods[pre_periods < 2])  # Units with < 2 pre-periods

# Option 1: Use 'demean' instead (requires T0 >= 1)
results = lwdid(..., rolling='demean')

# Option 2: Drop units with insufficient pre-treatment periods
units_to_keep = pre_periods[pre_periods >= 2].index
data = data[data['unit'].isin(units_to_keep)]

Error: Time-Varying Controls

Problem (conceptual):

InvalidParameterError indicating that control variable 'income' is
not time-invariant (constant within each unit).

Cause: Control variable changes over time for some units.

Solution:

# Check which controls vary
for control in ['income', 'population']:
    varying = data.groupby('unit')[control].nunique()
    print(f"{control} varies in {(varying > 1).sum()} units")

# Option 1: Use baseline (first period) value
baseline = data.groupby('unit')['income'].first().reset_index()
baseline.columns = ['unit', 'income_baseline']
data = data.drop('income', axis=1).merge(baseline, on='unit')

# Option 2: Use pre-treatment average
pre_avg = data[data['post'] == 0].groupby('unit')['income'].mean()
pre_avg = pre_avg.reset_index()
pre_avg.columns = ['unit', 'income_pre_avg']
data = data.drop('income', axis=1).merge(pre_avg, on='unit')

Error: Time Gaps

Problem (conceptual):

TimeDiscontinuityError indicating that the time index is not continuous.

Cause: Missing time periods in the data.

Solution:

# Check time sequence
time_values = sorted(data['year'].unique())
gaps = [time_values[i+1] - time_values[i]
        for i in range(len(time_values)-1) if time_values[i+1] - time_values[i] > 1]
print(f"Gaps found: {gaps}")

# Option 1: Fill gaps with missing values (if appropriate)
# Create complete panel
units = data['unit'].unique()
years = range(data['year'].min(), data['year'].max() + 1)
complete_index = pd.MultiIndex.from_product([units, years],
                                             names=['unit', 'year'])
data = data.set_index(['unit', 'year']).reindex(complete_index).reset_index()

# Option 2: Restrict to continuous sub-period
data = data[data['year'] >= 2015]  # Use only recent years

Best Practices

Pre-Validation Checks

Before running lwdid(), perform these checks:

import pandas as pd

# 1. Check for duplicates
assert not data.duplicated(subset=['unit', 'year']).any(), "Duplicates found"

# 2. Check time continuity
time_seq = sorted(data['year'].unique())
assert all(time_seq[i+1] - time_seq[i] == 1
           for i in range(len(time_seq)-1)), "Time gaps found"

# 3. Check post is time-based
assert data.groupby('year')['post'].nunique().max() == 1, "Post varies by unit"

# 4. Check controls are time-invariant
for control in ['x1', 'x2']:
    assert data.groupby('unit')[control].nunique().max() == 1, \
           f"{control} varies within units"

# 5. Check sample size
n_units = data['unit'].nunique()
assert n_units >= 3, f"Too few units: {n_units}"

print("All pre-validation checks passed!")

Data Preparation Checklist

Before using lwdid(), ensure:

  1. Data is in long format (one row per unit-time observation)

  2. No duplicate (unit, time) pairs

  3. Time variable forms continuous sequence

  4. post is binary (0/1) and time-based

  5. d is time-invariant treatment group indicator

  6. Control variables (if any) are time-invariant

  7. No missing values in required variables

  8. Sufficient pre-treatment periods for chosen transformation

  9. At least \(N \geq 3\) units

See Also