Validation Module (validation)

The validation module ensures data quality and assumption compliance before estimation. It checks panel structure, treatment timing, and data requirements.

Input validation and data preparation for difference-in-differences.

This module provides comprehensive validation utilities for panel data used in difference-in-differences analysis. It ensures data integrity, validates structural assumptions (time-invariance, continuity), and prepares data for downstream transformation and estimation.

The module supports both common timing designs (all units treated simultaneously) and staggered adoption designs (treatment timing varies by cohort). Validation checks include column existence, data types, time-invariance of treatment indicators and controls, time continuity, and treatment/control group adequacy.

Notes

Reserved column names created internally should not exist in input data to avoid conflicts:

d_ : Binary treatment indicator
post_ : Binary post-period indicator
tindex : Sequential time index
tq : Quarter index for quarterly data
ydot : Residualized outcome
ydot_postavg : Post-period average of residualized outcome
firstpost : Main regression sample indicator

lwdid.validation.validate_and_prepare_data(data, y, d, ivar, tvar, post, rolling, controls=None, season_var=None)[source]

Validate input data and execute data preparation pipeline.

This is the main entry point for all data validation and preparation in the lwdid package. It performs comprehensive checks and transformations to ensure data integrity before transformation and estimation.

The validation pipeline consists of five stages:

Input validation: DataFrame type check, reserved column names check, required columns existence check, rolling parameter validation.
Data type validation: Outcome variable numeric type check, control variables numeric type check.
Time-invariance validation: Treatment indicator time-invariance check, control variables time-invariance check.
Data preparation: String ID conversion to numeric codes, time index creation (tindex), binary treatment/post indicator creation (d_, post_), missing value handling.
Time structure validation: Time continuity validation, post-treatment monotonicity check.

Parameters:

data (pd.DataFrame) – Long-format panel data with one row per unit-time observation.
y (str) – Outcome variable column name. Must be numeric.
d (str) – Treatment indicator column name. Must be time-invariant (constant within unit). d=1 for treated units, d=0 for control units.
ivar (str) – Unit identifier column name. Can be string or numeric.
tvar (str or list of str) –
Time variable column name(s):
- str: Annual data (e.g., ‘year’)
- [str, str]: Quarterly data (e.g., [‘year’, ‘quarter’])
post (str) – Post-treatment indicator column name. Must be monotone non-decreasing in time. post=0 for pre-treatment periods, post=1 for post-treatment periods.
rolling (str) –
Transformation method. Must be one of:
- ’demean’: Unit-specific demeaning
- ’detrend’: Unit-specific detrending
- ’demeanq’: Quarterly demeaning with seasonal effects
- ’detrendq’: Quarterly detrending with seasonal effects
controls (list of str, optional) – Control variable column names. Must be numeric and time-invariant. Default: None (no controls).
season_var (str, optional) – Column name of seasonal indicator variable for seasonal transformations (demeanq, detrendq). Values should be integers from 1 to Q representing seasonal periods (e.g., quarters 1-4, months 1-12, or weeks 1-52). If provided, allows demeanq/detrendq to work with a single tvar column. Default: None (uses tvar[1] for legacy quarterly data format).

Return type:

tuple[DataFrame, dict]

Returns:

data_clean (pd.DataFrame) – Cleaned and prepared data with the following modifications:
- Original columns preserved
- New columns added: tindex, d_, post_ (and tq for quarterly data)
- String IDs converted to numeric codes (if applicable)
- Missing values handled (rows with NaN in y, d, post, ivar, or time variables are dropped; missing values in control variables are handled later at the estimation stage)
metadata (dict) – Metadata dictionary containing:
- ’N’: Total number of units
- ’N_treated’: Number of treated units (d_ = 1)
- ’N_control’: Number of control units (d_ = 0)
- ’T’: Total number of time periods
- ’K’: Number of pre-treatment periods
- ’tpost1’: First post-treatment period index
- ’is_quarterly’: Boolean indicating quarterly data
- ’id_mapping’: Dict mapping original string IDs to numeric codes (if applicable)

Raises:

TypeError – If data is not a pandas DataFrame.
MissingRequiredColumnError – If required columns are missing from data.
InvalidParameterError – If reserved column names exist in data.
InvalidRollingMethodError – If rolling parameter is not one of the four valid methods.
InsufficientDataError – If sample size is insufficient (no treated/control units).
TimeDiscontinuityError – If time series has gaps or post variable is non-monotone.
InsufficientQuarterDiversityError – If quarterly helper checks for demeanq/detrendq detect invalid quarter patterns (raised indirectly via quarterly validation utilities).

Notes

This function creates several internal columns (d_, post_, tindex, tq, ydot, ydot_postavg, firstpost). Input data must not contain columns with these names.

See also

_validate_required_columns: Validates column existence.
_validate_time_continuity: Validates time series continuity.
validate_quarter_coverage: Validates quarter coverage for quarterly methods.

lwdid.validation.validate_season_diversity(data, ivar, season_var, post, Q=4)[source]

Validate seasonal diversity and coverage for seasonal effects identification.

Ensures each unit has at least two distinct seasons in the pre-treatment period (required to identify seasonal effects) and that all post-period seasons also appear in the pre-period.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
season_var (str) – Seasonal variable column name (values should be 1 to Q).
post (str) – Post-treatment indicator column name.
Q (int, default 4) – Number of seasonal periods per cycle. Common values: - 4: Quarterly data (default) - 12: Monthly data - 52: Weekly data

Raises:

InsufficientQuarterDiversityError – If any unit has fewer than 2 distinct seasons in pre-period, or if any post-period season does not appear in the pre-period.

Return type:

None

See also

validate_season_coverage: Validates only seasonal coverage.

lwdid.validation.validate_season_coverage(data, ivar, season_var, post, Q=4)[source]

Validate that post-period seasons appear in pre-period for each unit.

Seasonal transformation methods assume seasonal effects are constant over time. Each post-period season must appear in the pre-period to identify its seasonal coefficient.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
season_var (str) – Seasonal variable column name (values should be 1 to Q).
post (str) – Post-treatment indicator column name.
Q (int, default 4) – Number of seasonal periods per cycle. Common values: - 4: Quarterly data (default) - 12: Monthly data - 52: Weekly data

Raises:

InsufficientQuarterDiversityError – If any unit has a post-period season that does not appear in its pre-period data.

Return type:

None

See also

validate_season_diversity: Also validates minimum seasonal diversity.

lwdid.validation.validate_quarter_diversity(data, ivar, quarter, post)[source]

Validate quarter diversity and coverage for seasonal effects identification.

Ensures each unit has at least two distinct quarters in the pre-treatment period (required to identify seasonal effects) and that all post-period quarters also appear in the pre-period.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
quarter (str) – Quarter variable column name (values should be 1, 2, 3, or 4).
post (str) – Post-treatment indicator column name.

Raises:

InsufficientQuarterDiversityError – If any unit has fewer than 2 distinct quarters in pre-period, or if any post-period quarter does not appear in the pre-period.

Return type:

None

See also

validate_quarter_coverage: Validates only quarter coverage.

Notes

This is a backward-compatible wrapper for validate_season_diversity with Q=4.

lwdid.validation.validate_quarter_coverage(data, ivar, quarter, post)[source]

Validate that post-period quarters appear in pre-period for each unit.

Quarterly transformation methods assume seasonal effects are constant over time. Each post-period quarter must appear in the pre-period to identify its seasonal coefficient.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
quarter (str) – Quarter variable column name (values should be 1, 2, 3, or 4).
post (str) – Post-treatment indicator column name.

Raises:

InsufficientQuarterDiversityError – If any unit has a post-period quarter that does not appear in its pre-period data.

Return type:

None

See also

validate_quarter_diversity: Also validates minimum quarter diversity.

Notes

This is a backward-compatible wrapper for validate_season_coverage with Q=4.

lwdid.validation.get_cohort_mask(unit_gvar, g)[source]

Create a boolean mask identifying units belonging to a specific cohort.

Handles floating-point comparison with tolerance to account for potential rounding errors in gvar values.

Parameters:

unit_gvar (pd.Series) – Series mapping unit identifiers to their gvar (first treatment period) values. Index should be unit identifiers.
g (int) – Target cohort (first treatment period) to identify.

Returns:

Boolean series with same index as unit_gvar. True for units in cohort g, False otherwise.

Return type:

pd.Series

Notes

Uses COHORT_FLOAT_TOLERANCE for floating-point comparison to handle potential precision issues from data transformations.

lwdid.validation.is_never_treated(gvar_value)[source]

Determine if a unit is never treated based on its gvar value.

This is the single source of truth for never-treated status identification. All modules (validation, control_groups, aggregation) should use this function to ensure consistent never-treated determination.

Parameters:: gvar_value (int or float) – The gvar (first treatment period) value for a unit.
Returns:: True if the unit is never treated, False otherwise.
Return type:: bool
Raises:: InvalidStaggeredDataError – If gvar_value is negative infinity (-np.inf), which is not a valid gvar value.

Notes

A unit is considered never treated if its gvar value is:

NaN or None (missing value)
0 (explicitly coded as never treated)
np.inf (positive infinity, explicitly coded as never treated)

Positive integers indicate the first treatment period (cohort membership). Negative values are invalid and should be caught by validate_staggered_data().

See also

validate_staggered_data: Validates staggered DiD data structure.

lwdid.validation.validate_staggered_data(data, gvar, ivar, tvar, y, controls=None)[source]

Validate staggered DiD data and extract cohort structure.

Performs comprehensive validation for staggered adoption settings, checking gvar column validity, cohort identification, and data integrity.

Parameters:

data (pd.DataFrame) – Panel data in long format with one row per unit-time observation.
gvar (str) – Column name for first treatment period (cohort indicator). Valid values: positive integers (cohort), 0/inf/NaN (never treated).
ivar (str) – Unit identifier column name.
tvar (str or list of str) – Time variable column name(s). Single string for annual data, list of two strings for quarterly data (year, quarter).
y (str) – Outcome variable column name.
controls (list of str, optional) – Control variable column names. Default: None.

Returns:

Validation result dictionary containing:

cohorts : list of int, sorted list of treatment cohorts (excludes NT)
n_cohorts : int, number of distinct treatment cohorts
n_never_treated : int, number of never-treated units
n_treated : int, total number of treated units across all cohorts
has_never_treated : bool, whether never-treated units exist
cohort_sizes : dict, {cohort: n_units} mapping
T_min : int, minimum time period in data
T_max : int, maximum time period in data
N_total : int, total number of units
N_obs : int, total number of observations
warnings : list of str, warning messages

Return type:

dict

Raises:

TypeError – If data is not a pandas DataFrame.
MissingRequiredColumnError – If required columns (gvar, ivar, tvar, y) are missing from data.
InvalidStaggeredDataError – If gvar column contains invalid values (negative numbers, strings) or if there are no valid treatment cohorts.

See also

is_never_treated: Determines never-treated status from gvar value.
validate_and_prepare_data: Validation for common timing designs.

lwdid.validation.detect_frequency(data, tvar, ivar=None)[source]

Detect data frequency (quarterly, monthly, weekly) from time variable.

Analyzes the time variable to determine the most likely data frequency. Uses multiple heuristics including time interval analysis and observations per year counting.

Parameters:

data (pd.DataFrame) – Panel data containing the time variable.
tvar (str) – Column name of the time variable. Can be: - Integer time index (e.g., 1, 2, 3, …) - Year values (e.g., 2020, 2021, …) - Datetime values
ivar (str, optional) – Column name of the unit identifier. If provided, frequency detection is performed per-unit and aggregated for more robust estimation.

Returns:

Detection results with keys:

frequencystr or None
Detected frequency: ‘quarterly’, ‘monthly’, ‘weekly’, ‘annual’, or None if detection failed.
Qint or None
Corresponding Q value: 4 (quarterly), 12 (monthly), 52 (weekly), 1 (annual), or None if detection failed.
confidencefloat
Confidence score in [0, 1] indicating detection reliability. Higher values indicate more consistent patterns.
methodstr
Detection method used: ‘interval’, ‘obs_per_year’, or ‘heuristic’.
detailsdict
Additional diagnostic information including: - median_interval: Median time interval between observations - obs_per_year: Average observations per year (if applicable) - n_units_analyzed: Number of units used in detection

Return type:

dict

Notes

Detection heuristics:

Interval-based: Analyzes median time interval between consecutive observations. Works best with datetime or continuous time indices.
Observations per year: Counts average observations per calendar year. Works best with year-based time variables.
Value range: Examines the range of time values to infer frequency.

The function returns frequency=None when: - Time variable has insufficient variation - Multiple frequencies are equally likely - Data pattern is irregular or ambiguous

See also

lwdid: Main estimation function with auto_detect_frequency parameter.

Overview

This module performs comprehensive validation of input data to ensure:

Panel structure is correct (unique unit-time pairs, continuous time)
Treatment timing follows common timing assumption
Sample size meets minimum requirements
Control variables are time-invariant
Data types are appropriate

All validation functions raise informative exceptions when requirements are violated, helping users identify and fix data issues quickly.

Validation Checks

Panel Structure Validation

Checks performed:

No duplicate (unit, time) observations
Time index forms a continuous sequence (no gaps)
Sufficient observations for estimation (\(N \geq 3\))
At least one treated unit (d = 1) and one control unit (d = 0)

Why it matters:

Duplicate observations indicate data errors
Time gaps violate the continuous panel assumption
Too few units make inference unreliable
Need at least one treated and one control unit for DiD estimation

Example error (conceptual):

InvalidParameterError indicating duplicate (unit, time) observations.
Each (unit, time) combination must appear at most once.

Treatment Timing Validation

Checks performed:

post indicator is binarized internally as 0/1 (non-zero values are treated as 1)
post is the same for all units in each time period (common timing)
post is monotone (no treatment reversals)
At least one pre-treatment and one post-treatment period exist (post != 0) exist

Why it matters:

Common timing is a core assumption of the method
Treatment reversals violate the persistence assumption
Need both pre- and post-treatment periods for DiD

Example errors (conceptual):

InvalidParameterError indicating that the common timing assumption is violated
because 'post' varies across units within the same period.

TimeDiscontinuityError indicating that 'post' is not monotone in time
(treatment reversals or suspensions).

Pre-Treatment Period Validation

Checks performed:

Each unit has sufficient pre-treatment periods for the chosen transformation:
- demean: at least 1 pre-treatment observation per unit
- detrend: at least 2 pre-treatment observations per unit
- demeanq: at least 1 pre-treatment observation per unit, and enough pre-period observations to estimate quarterly fixed effects (number of pre-period observations >= number of distinct pre-period quarters + 1)
- detrendq: at least 2 pre-treatment observations per unit, and enough pre-period observations to estimate a linear trend plus quarterly fixed effects (number of pre-period observations >= 1 + number of distinct pre-period quarters)

Why it matters:

Demean requires at least 1 pre-treatment observation per unit to compute the pre-treatment mean
Detrend requires at least 2 pre-treatment observations per unit to estimate a linear trend
Quarterly methods additionally require sufficient pre-treatment observations within each unit to estimate quarterly fixed effects without rank deficiency

Example error (conceptual):

InsufficientPrePeriodsError indicating that some units have fewer
pre-treatment periods than required by the chosen transformation.

Control Variables Validation

Checks performed:

Control variables exist in the data
Controls are time-invariant (constant within each unit)
Control variables are numeric; missing values (if any) are handled at the estimation stage rather than in the validation step

Why it matters:

Time-varying controls can be endogenous to treatment
Missing controls can lead to dropped observations when controls are included in the regression

Example error (conceptual):

InvalidParameterError indicating that control variable 'income' is
not time-invariant within units.
For example, 'income' varies within unit 'unit_42'.

Data Type Validation

Checks performed:

Outcome variable is numeric
Treatment indicator can be converted to numeric and is interpreted as 0 vs non-zero (non-zero values are treated as 1)
Unit and time identifiers are present
Rows with missing values in required variables (outcome, treatment, unit identifier, time variable(s), or post) are dropped with a warning

Why it matters:

Non-numeric outcomes cannot be used in regression
Missing values in key variables change the effective sample after dropping affected rows

Example error (conceptual):

InvalidParameterError indicating that outcome variable 'y' is not
numeric.

Validation Functions

In practice, structural validation of panel layout, treatment timing, control variables, and data types is performed internally by validate_and_prepare_data(), which is called at the beginning of lwdid(). Additional pre-treatment period requirements for each rolling method and quarterly coverage checks are enforced in the transformation step (see lwdid.transformations). The earlier sections (panel structure, treatment timing, pre-treatment periods, controls, data types) describe conceptual checks rather than public helper functions.

The pandas-based snippets in the following sections illustrate how to diagnose and fix common problems yourself before calling lwdid(), but there are no separate public functions named validate_panel_structure, validate_treatment_timing, or validate_controls in the current implementation.

validate_and_prepare_data()

from lwdid.validation import validate_and_prepare_data
import pandas as pd

data = pd.read_csv('panel_data.csv')

data_clean, metadata = validate_and_prepare_data(
    data=data,
    y='outcome',
    d='treated',
    ivar='unit',
    tvar='year',          # or ['year', 'quarter'] for quarterly data
    post='post',
    rolling='demean',     # required rolling method: 'demean', 'detrend', 'demeanq', or 'detrendq'
    controls=['x1', 'x2'] # optional time-invariant controls
)

print(metadata['N'], metadata['T'], metadata['K'])

Quarterly Helper Checks

For quarterly data, the module also provides helper functions such as validate_quarter_coverage that can be used in advanced workflows to pre-check seasonal coverage requirements for demeanq/detrendq. In typical usage these helpers are called indirectly by lwdid.lwdid() via the transformation module rather than being used directly.

Never-Treated Unit Identification

The is_never_treated() function provides a standardized way to identify never-treated units in staggered adoption designs. This function is the single source of truth for never-treated identification across all modules.

is_never_treated() Function

from lwdid.validation import is_never_treated
import numpy as np
import pandas as pd

# Check individual values
is_never_treated(0)        # True - zero indicates never-treated
is_never_treated(np.inf)   # True - infinity indicates never-treated
is_never_treated(np.nan)   # True - NaN indicates never-treated
is_never_treated(None)     # True - None indicates never-treated
is_never_treated(pd.NA)    # True - pandas NA indicates never-treated
is_never_treated(2005)     # False - positive integer is treatment cohort
is_never_treated(-np.inf)  # Raises InvalidStaggeredDataError

Valid Never-Treated Encodings:

The following values are recognized as never-treated:

Zero (0 or 0.0): Common encoding in Stata and other software
Positive infinity (np.inf): Represents “treated at infinity” (never)
NaN/NA/None: Missing treatment time indicates never-treated
Near-zero values: Values within floating-point tolerance (\(|x| < 10^{-10}\))

Invalid Values:

Negative infinity (-np.inf): Raises InvalidStaggeredDataError
Negative numbers: Should be caught by validate_staggered_data()

Usage with DataFrames:

import pandas as pd
import numpy as np
from lwdid.validation import is_never_treated

# Create sample data
data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5] * 3,
    'year': [2000, 2001, 2002] * 5,
    'y': np.random.randn(15),
    'gvar': [0, np.inf, np.nan, 2001, 2002] * 3
})

# Identify never-treated units
unit_gvar = data.groupby('id')['gvar'].first()
nt_mask = unit_gvar.apply(is_never_treated)

print(f"Never-treated units: {nt_mask.sum()}")  # Output: 3
print(f"NT unit IDs: {unit_gvar[nt_mask].index.tolist()}")  # [1, 2, 3]

Cross-Module Consistency:

The is_never_treated() function is used consistently across all modules:

lwdid.validation: Primary definition
lwdid.staggered.control_groups: Control group selection
lwdid.staggered.aggregation: Weight calculations
lwdid.staggered.randomization: Randomization inference

This ensures that never-treated identification is consistent throughout the estimation pipeline.

Staggered Adoption Validation

For staggered adoption designs (where units are treated at different times), additional validation checks are performed when the gvar parameter is specified instead of post.

Staggered-Specific Checks

Checks performed:

gvar column exists and is time-invariant within units
gvar values are valid:
- Positive integers indicate treatment cohorts (first treatment period)
- Values of 0, inf, or NaN indicate never-treated units
At least one treatment cohort exists
At least one control unit exists (never-treated or not-yet-treated, depending on the control_group strategy)
Each cohort has sufficient pre-treatment periods for the chosen transformation (demean requires \(g - 1 \geq 1\), detrend requires \(g - 1 \geq 2\))

Why it matters:

Time-varying gvar violates the staggered design assumption
Invalid gvar values prevent proper cohort identification
Insufficient pre-treatment periods make transformation impossible

Example errors (conceptual):

InvalidStaggeredDataError indicating that 'gvar' varies within unit.
The first treatment period must be constant across all observations
for a given unit.

NoNeverTreatedError indicating that no never-treated units exist
when control_group='never_treated' is specified.

Control Group Strategy Validation

The validation differs based on the chosen control group strategy:

never_treated:

Requires at least one unit with gvar equal to 0, inf, or NaN
These units serve as controls for all cohort-time effect estimations
Required when using aggregate='cohort' or aggregate='overall'

not_yet_treated:

Uses never-treated units plus units not yet treated at each calendar time
More flexible but requires the no-anticipation assumption to hold
For cohort \(g\) at time \(r\), valid controls include units with first treatment period \(h > r\)

Staggered Data Usage

For staggered designs, validation is performed internally by lwdid() when the gvar parameter is provided:

from lwdid import lwdid
import pandas as pd

data = pd.read_csv('staggered_data.csv')

# gvar indicates first treatment period; 0 or NaN for never-treated
results = lwdid(
    data=data,
    y='outcome',
    ivar='unit',
    tvar='year',
    gvar='first_treat_year',  # First treatment period column
    rolling='demean',
    control_group='not_yet_treated',
    aggregate='overall'
)

Error: Invalid Cohort Values

Problem (conceptual):

InvalidStaggeredDataError indicating that 'gvar' contains invalid values.

Cause: gvar contains values that cannot be interpreted as valid cohort indicators (e.g., negative numbers, non-numeric values other than NaN).

Solution:

# Check gvar values
print(data['gvar'].unique())

# Ensure valid cohort values: positive integers for treated, 0/NaN for never-treated
# Convert never-treated indicator if needed
data['gvar'] = data['gvar'].replace({-1: 0, 'never': 0})

# Ensure numeric type
data['gvar'] = pd.to_numeric(data['gvar'], errors='coerce')

Error: No Never-Treated Units

Problem (conceptual):

NoNeverTreatedError indicating that control_group='never_treated' requires
at least one never-treated unit, but none were found.

Cause: All units are eventually treated, but control_group='never_treated' was specified.

Solution:

# Check for never-treated units
never_treated = data[data['gvar'].isin([0, np.inf]) | data['gvar'].isna()]
print(f"Never-treated units: {never_treated['unit'].nunique()}")

# Option 1: Switch to not_yet_treated control group
results = lwdid(..., control_group='not_yet_treated')

# Option 2: Use 'never_treated' only if such units exist
# Note: aggregate='cohort' and aggregate='overall' require never-treated units

Error: Insufficient Pre-Treatment Periods for Cohort

Problem (conceptual):

InsufficientPrePeriodsError indicating that cohort g=2005 has insufficient
pre-treatment periods for detrend transformation (requires at least 2).

Cause: Some cohorts are treated too early in the panel, leaving insufficient pre-treatment periods for the chosen transformation.

Solution:

# Check pre-treatment periods by cohort
cohorts = data[data['gvar'] > 0]['gvar'].unique()
min_year = data['year'].min()

for g in sorted(cohorts):
    pre_periods = g - min_year
    print(f"Cohort {g}: {pre_periods} pre-treatment periods")

# Option 1: Use 'demean' instead of 'detrend' (requires only 1 pre-period)
results = lwdid(..., rolling='demean')

# Option 2: Exclude early cohorts
data = data[~data['gvar'].isin([cohorts_with_insufficient_pre_periods])]

Common Validation Errors and Solutions

Error: Duplicate Observations

Problem (conceptual):

InvalidParameterError indicating duplicate (unit, time) observations.

Cause: Multiple rows with the same (unit, time) combination.

Solution:

# Check for duplicates
duplicates = data[data.duplicated(subset=['unit', 'year'], keep=False)]
print(duplicates)

# Remove duplicates (if appropriate)
data = data.drop_duplicates(subset=['unit', 'year'], keep='first')

# Or aggregate duplicates
data = data.groupby(['unit', 'year']).mean().reset_index()

Error: Non-Common Treatment Timing

Problem (conceptual):

InvalidParameterError indicating violation of the common timing assumption.

Cause: post varies across units in the same time period.

Solution:

# Check if post is time-based
post_by_time = data.groupby('year')['post'].nunique()
print(post_by_time[post_by_time > 1])  # Periods with varying post

# Create time-based post indicator
treatment_year = 2020
data['post'] = (data['year'] >= treatment_year).astype(int)

Error: Insufficient Pre-Treatment Periods

Problem (conceptual):

InsufficientPrePeriodsError indicating that some units lack enough
pre-treatment periods for the chosen transformation.

Cause: For methods that require at least two pre-treatment periods (for

example detrend/detrendq), some units have fewer than 2 pre-treatment periods, or more generally fewer pre-treatment periods than required by the chosen transformation.

Solution:

# Check pre-treatment periods by unit
pre_periods = data[data['post'] == 0].groupby('unit').size()
print(pre_periods[pre_periods < 2])  # Units with < 2 pre-periods

# Option 1: Use 'demean' instead (requires T0 >= 1)
results = lwdid(..., rolling='demean')

# Option 2: Drop units with insufficient pre-treatment periods
units_to_keep = pre_periods[pre_periods >= 2].index
data = data[data['unit'].isin(units_to_keep)]

Error: Time-Varying Controls

Problem (conceptual):

InvalidParameterError indicating that control variable 'income' is
not time-invariant (constant within each unit).

Cause: Control variable changes over time for some units.

Solution:

# Check which controls vary
for control in ['income', 'population']:
    varying = data.groupby('unit')[control].nunique()
    print(f"{control} varies in {(varying > 1).sum()} units")

# Option 1: Use baseline (first period) value
baseline = data.groupby('unit')['income'].first().reset_index()
baseline.columns = ['unit', 'income_baseline']
data = data.drop('income', axis=1).merge(baseline, on='unit')

# Option 2: Use pre-treatment average
pre_avg = data[data['post'] == 0].groupby('unit')['income'].mean()
pre_avg = pre_avg.reset_index()
pre_avg.columns = ['unit', 'income_pre_avg']
data = data.drop('income', axis=1).merge(pre_avg, on='unit')

Error: Time Gaps

Problem (conceptual):

TimeDiscontinuityError indicating that the time index is not continuous.

Cause: Missing time periods in the data.

Solution:

# Check time sequence
time_values = sorted(data['year'].unique())
gaps = [time_values[i+1] - time_values[i]
        for i in range(len(time_values)-1) if time_values[i+1] - time_values[i] > 1]
print(f"Gaps found: {gaps}")

# Option 1: Fill gaps with missing values (if appropriate)
# Create complete panel
units = data['unit'].unique()
years = range(data['year'].min(), data['year'].max() + 1)
complete_index = pd.MultiIndex.from_product([units, years],
                                             names=['unit', 'year'])
data = data.set_index(['unit', 'year']).reindex(complete_index).reset_index()

# Option 2: Restrict to continuous sub-period
data = data[data['year'] >= 2015]  # Use only recent years

Best Practices

Pre-Validation Checks

Before running lwdid(), perform these checks:

import pandas as pd

# 1. Check for duplicates
assert not data.duplicated(subset=['unit', 'year']).any(), "Duplicates found"

# 2. Check time continuity
time_seq = sorted(data['year'].unique())
assert all(time_seq[i+1] - time_seq[i] == 1
           for i in range(len(time_seq)-1)), "Time gaps found"

# 3. Check post is time-based
assert data.groupby('year')['post'].nunique().max() == 1, "Post varies by unit"

# 4. Check controls are time-invariant
for control in ['x1', 'x2']:
    assert data.groupby('unit')[control].nunique().max() == 1, \
           f"{control} varies within units"

# 5. Check sample size
n_units = data['unit'].nunique()
assert n_units >= 3, f"Too few units: {n_units}"

print("All pre-validation checks passed!")

Data Preparation Checklist

Before using lwdid(), ensure:

Data is in long format (one row per unit-time observation)
No duplicate (unit, time) pairs
Time variable forms continuous sequence
post is binary (0/1) and time-based
d is time-invariant treatment group indicator
Control variables (if any) are time-invariant
No missing values in required variables
Sufficient pre-treatment periods for chosen transformation
At least \(N \geq 3\) units

Validation Module (validation)

Overview

Validation Checks

Panel Structure Validation

Treatment Timing Validation

Pre-Treatment Period Validation

Control Variables Validation

Data Type Validation

Validation Functions

validate_and_prepare_data()

Quarterly Helper Checks

Never-Treated Unit Identification

is_never_treated() Function

Staggered Adoption Validation

Staggered-Specific Checks

Control Group Strategy Validation

Staggered Data Usage

Error: Invalid Cohort Values

Error: No Never-Treated Units

Error: Insufficient Pre-Treatment Periods for Cohort

Common Validation Errors and Solutions

Error: Duplicate Observations

Error: Non-Common Treatment Timing

Error: Insufficient Pre-Treatment Periods

Error: Time-Varying Controls

Error: Time Gaps

Best Practices

Pre-Validation Checks

Data Preparation Checklist

See Also