Validation Module (validation) =============================== The validation module ensures data quality and assumption compliance before estimation. It checks panel structure, treatment timing, and data requirements. .. automodule:: lwdid.validation :members: :undoc-members: :show-inheritance: Overview -------- This module performs comprehensive validation of input data to ensure: 1. **Panel structure** is correct (unique unit-time pairs, continuous time) 2. **Treatment timing** follows common timing assumption 3. **Sample size** meets minimum requirements 4. **Control variables** are time-invariant 5. **Data types** are appropriate All validation functions raise informative exceptions when requirements are violated, helping users identify and fix data issues quickly. Validation Checks ----------------- Panel Structure Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - No duplicate (unit, time) observations - Time index forms a continuous sequence (no gaps) - Sufficient observations for estimation (:math:`N \geq 3`) - At least one treated unit (d = 1) and one control unit (d = 0) **Why it matters:** - Duplicate observations indicate data errors - Time gaps violate the continuous panel assumption - Too few units make inference unreliable - Need at least one treated and one control unit for DiD estimation **Example error (conceptual):** .. code-block:: text InvalidParameterError indicating duplicate (unit, time) observations. Each (unit, time) combination must appear at most once. Treatment Timing Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - ``post`` indicator is binarized internally as 0/1 (non-zero values are treated as 1) - ``post`` is the same for all units in each time period (common timing) - ``post`` is monotone (no treatment reversals) - At least one pre-treatment and one post-treatment period exist (``post != 0``) exist **Why it matters:** - Common timing is a core assumption of the method - Treatment reversals violate the persistence assumption - Need both pre- and post-treatment periods for DiD **Example errors (conceptual):** .. code-block:: text InvalidParameterError indicating that the common timing assumption is violated because 'post' varies across units within the same period. .. code-block:: text TimeDiscontinuityError indicating that 'post' is not monotone in time (treatment reversals or suspensions). Pre-Treatment Period Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - Each unit has sufficient pre-treatment periods for the chosen transformation: - ``demean``: at least 1 pre-treatment observation per unit - ``detrend``: at least 2 pre-treatment observations per unit - ``demeanq``: at least 1 pre-treatment observation per unit, and enough pre-period observations to estimate quarterly fixed effects (number of pre-period observations >= number of distinct pre-period quarters + 1) - ``detrendq``: at least 2 pre-treatment observations per unit, and enough pre-period observations to estimate a linear trend plus quarterly fixed effects (number of pre-period observations >= 1 + number of distinct pre-period quarters) **Why it matters:** - Demean requires at least 1 pre-treatment observation per unit to compute the pre-treatment mean - Detrend requires at least 2 pre-treatment observations per unit to estimate a linear trend - Quarterly methods additionally require sufficient pre-treatment observations within each unit to estimate quarterly fixed effects without rank deficiency **Example error (conceptual):** .. code-block:: text InsufficientPrePeriodsError indicating that some units have fewer pre-treatment periods than required by the chosen transformation. Control Variables Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - Control variables exist in the data - Controls are time-invariant (constant within each unit) - Control variables are numeric; missing values (if any) are handled at the estimation stage rather than in the validation step **Why it matters:** - Time-varying controls can be endogenous to treatment - Missing controls can lead to dropped observations when controls are included in the regression **Example error (conceptual):** .. code-block:: text InvalidParameterError indicating that control variable 'income' is not time-invariant within units. For example, 'income' varies within unit 'unit_42'. Data Type Validation ~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - Outcome variable is numeric - Treatment indicator can be converted to numeric and is interpreted as 0 vs non-zero (non-zero values are treated as 1) - Unit and time identifiers are present - Rows with missing values in required variables (outcome, treatment, unit identifier, time variable(s), or post) are dropped with a warning **Why it matters:** - Non-numeric outcomes cannot be used in regression - Missing values in key variables change the effective sample after dropping affected rows **Example error (conceptual):** .. code-block:: text InvalidParameterError indicating that outcome variable 'y' is not numeric. Validation Functions -------------------- In practice, structural validation of panel layout, treatment timing, control variables, and data types is performed internally by ``validate_and_prepare_data()``, which is called at the beginning of ``lwdid()``. Additional pre-treatment period requirements for each ``rolling`` method and quarterly coverage checks are enforced in the transformation step (see :mod:`lwdid.transformations`). The earlier sections (panel structure, treatment timing, pre-treatment periods, controls, data types) describe **conceptual checks** rather than public helper functions. The pandas-based snippets in the following sections illustrate how to diagnose and fix common problems yourself before calling ``lwdid()``, but there are no separate public functions named ``validate_panel_structure``, ``validate_treatment_timing``, or ``validate_controls`` in the current implementation. validate_and_prepare_data() ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from lwdid.validation import validate_and_prepare_data import pandas as pd data = pd.read_csv('panel_data.csv') data_clean, metadata = validate_and_prepare_data( data=data, y='outcome', d='treated', ivar='unit', tvar='year', # or ['year', 'quarter'] for quarterly data post='post', rolling='demean', # required rolling method: 'demean', 'detrend', 'demeanq', or 'detrendq' controls=['x1', 'x2'] # optional time-invariant controls ) print(metadata['N'], metadata['T'], metadata['K']) Quarterly Helper Checks ~~~~~~~~~~~~~~~~~~~~~~~ For quarterly data, the module also provides helper functions such as ``validate_quarter_coverage`` that can be used in advanced workflows to pre-check seasonal coverage requirements for ``demeanq``/``detrendq``. In typical usage these helpers are called indirectly by :func:`lwdid.lwdid` via the transformation module rather than being used directly. Never-Treated Unit Identification ---------------------------------- The ``is_never_treated()`` function provides a standardized way to identify never-treated units in staggered adoption designs. This function is the single source of truth for never-treated identification across all modules. is_never_treated() Function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from lwdid.validation import is_never_treated import numpy as np import pandas as pd # Check individual values is_never_treated(0) # True - zero indicates never-treated is_never_treated(np.inf) # True - infinity indicates never-treated is_never_treated(np.nan) # True - NaN indicates never-treated is_never_treated(None) # True - None indicates never-treated is_never_treated(pd.NA) # True - pandas NA indicates never-treated is_never_treated(2005) # False - positive integer is treatment cohort is_never_treated(-np.inf) # Raises InvalidStaggeredDataError **Valid Never-Treated Encodings:** The following values are recognized as never-treated: 1. **Zero (0 or 0.0)**: Common encoding in Stata and other software 2. **Positive infinity (np.inf)**: Represents "treated at infinity" (never) 3. **NaN/NA/None**: Missing treatment time indicates never-treated 4. **Near-zero values**: Values within floating-point tolerance (:math:`|x| < 10^{-10}`) **Invalid Values:** - **Negative infinity (-np.inf)**: Raises ``InvalidStaggeredDataError`` - **Negative numbers**: Should be caught by ``validate_staggered_data()`` **Usage with DataFrames:** .. code-block:: python import pandas as pd import numpy as np from lwdid.validation import is_never_treated # Create sample data data = pd.DataFrame({ 'id': [1, 2, 3, 4, 5] * 3, 'year': [2000, 2001, 2002] * 5, 'y': np.random.randn(15), 'gvar': [0, np.inf, np.nan, 2001, 2002] * 3 }) # Identify never-treated units unit_gvar = data.groupby('id')['gvar'].first() nt_mask = unit_gvar.apply(is_never_treated) print(f"Never-treated units: {nt_mask.sum()}") # Output: 3 print(f"NT unit IDs: {unit_gvar[nt_mask].index.tolist()}") # [1, 2, 3] **Cross-Module Consistency:** The ``is_never_treated()`` function is used consistently across all modules: - ``lwdid.validation``: Primary definition - ``lwdid.staggered.control_groups``: Control group selection - ``lwdid.staggered.aggregation``: Weight calculations - ``lwdid.staggered.randomization``: Randomization inference This ensures that never-treated identification is consistent throughout the estimation pipeline. Staggered Adoption Validation ----------------------------- For staggered adoption designs (where units are treated at different times), additional validation checks are performed when the ``gvar`` parameter is specified instead of ``post``. Staggered-Specific Checks ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Checks performed:** - ``gvar`` column exists and is time-invariant within units - ``gvar`` values are valid: - Positive integers indicate treatment cohorts (first treatment period) - Values of 0, ``inf``, or ``NaN`` indicate never-treated units - At least one treatment cohort exists - At least one control unit exists (never-treated or not-yet-treated, depending on the ``control_group`` strategy) - Each cohort has sufficient pre-treatment periods for the chosen transformation (``demean`` requires :math:`g - 1 \geq 1`, ``detrend`` requires :math:`g - 1 \geq 2`) **Why it matters:** - Time-varying ``gvar`` violates the staggered design assumption - Invalid ``gvar`` values prevent proper cohort identification - Insufficient pre-treatment periods make transformation impossible **Example errors (conceptual):** .. code-block:: text InvalidStaggeredDataError indicating that 'gvar' varies within unit. The first treatment period must be constant across all observations for a given unit. .. code-block:: text NoNeverTreatedError indicating that no never-treated units exist when control_group='never_treated' is specified. Control Group Strategy Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The validation differs based on the chosen control group strategy: **never_treated:** - Requires at least one unit with ``gvar`` equal to 0, ``inf``, or ``NaN`` - These units serve as controls for all cohort-time effect estimations - Required when using ``aggregate='cohort'`` or ``aggregate='overall'`` **not_yet_treated:** - Uses never-treated units plus units not yet treated at each calendar time - More flexible but requires the no-anticipation assumption to hold - For cohort :math:`g` at time :math:`r`, valid controls include units with first treatment period :math:`h > r` Staggered Data Usage ~~~~~~~~~~~~~~~~~~~~ For staggered designs, validation is performed internally by ``lwdid()`` when the ``gvar`` parameter is provided: .. code-block:: python from lwdid import lwdid import pandas as pd data = pd.read_csv('staggered_data.csv') # gvar indicates first treatment period; 0 or NaN for never-treated results = lwdid( data=data, y='outcome', ivar='unit', tvar='year', gvar='first_treat_year', # First treatment period column rolling='demean', control_group='not_yet_treated', aggregate='overall' ) Error: Invalid Cohort Values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InvalidStaggeredDataError indicating that 'gvar' contains invalid values. **Cause:** ``gvar`` contains values that cannot be interpreted as valid cohort indicators (e.g., negative numbers, non-numeric values other than ``NaN``). **Solution:** .. code-block:: python # Check gvar values print(data['gvar'].unique()) # Ensure valid cohort values: positive integers for treated, 0/NaN for never-treated # Convert never-treated indicator if needed data['gvar'] = data['gvar'].replace({-1: 0, 'never': 0}) # Ensure numeric type data['gvar'] = pd.to_numeric(data['gvar'], errors='coerce') Error: No Never-Treated Units ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text NoNeverTreatedError indicating that control_group='never_treated' requires at least one never-treated unit, but none were found. **Cause:** All units are eventually treated, but ``control_group='never_treated'`` was specified. **Solution:** .. code-block:: python # Check for never-treated units never_treated = data[data['gvar'].isin([0, np.inf]) | data['gvar'].isna()] print(f"Never-treated units: {never_treated['unit'].nunique()}") # Option 1: Switch to not_yet_treated control group results = lwdid(..., control_group='not_yet_treated') # Option 2: Use 'never_treated' only if such units exist # Note: aggregate='cohort' and aggregate='overall' require never-treated units Error: Insufficient Pre-Treatment Periods for Cohort ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InsufficientPrePeriodsError indicating that cohort g=2005 has insufficient pre-treatment periods for detrend transformation (requires at least 2). **Cause:** Some cohorts are treated too early in the panel, leaving insufficient pre-treatment periods for the chosen transformation. **Solution:** .. code-block:: python # Check pre-treatment periods by cohort cohorts = data[data['gvar'] > 0]['gvar'].unique() min_year = data['year'].min() for g in sorted(cohorts): pre_periods = g - min_year print(f"Cohort {g}: {pre_periods} pre-treatment periods") # Option 1: Use 'demean' instead of 'detrend' (requires only 1 pre-period) results = lwdid(..., rolling='demean') # Option 2: Exclude early cohorts data = data[~data['gvar'].isin([cohorts_with_insufficient_pre_periods])] Common Validation Errors and Solutions --------------------------------------- Error: Duplicate Observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InvalidParameterError indicating duplicate (unit, time) observations. **Cause:** Multiple rows with the same (unit, time) combination. **Solution:** .. code-block:: python # Check for duplicates duplicates = data[data.duplicated(subset=['unit', 'year'], keep=False)] print(duplicates) # Remove duplicates (if appropriate) data = data.drop_duplicates(subset=['unit', 'year'], keep='first') # Or aggregate duplicates data = data.groupby(['unit', 'year']).mean().reset_index() Error: Non-Common Treatment Timing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InvalidParameterError indicating violation of the common timing assumption. **Cause:** ``post`` varies across units in the same time period. **Solution:** .. code-block:: python # Check if post is time-based post_by_time = data.groupby('year')['post'].nunique() print(post_by_time[post_by_time > 1]) # Periods with varying post # Create time-based post indicator treatment_year = 2020 data['post'] = (data['year'] >= treatment_year).astype(int) Error: Insufficient Pre-Treatment Periods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InsufficientPrePeriodsError indicating that some units lack enough pre-treatment periods for the chosen transformation. **Cause:** For methods that require at least two pre-treatment periods (for example ``detrend``/``detrendq``), some units have fewer than 2 pre-treatment periods, or more generally fewer pre-treatment periods than required by the chosen transformation. **Solution:** .. code-block:: python # Check pre-treatment periods by unit pre_periods = data[data['post'] == 0].groupby('unit').size() print(pre_periods[pre_periods < 2]) # Units with < 2 pre-periods # Option 1: Use 'demean' instead (requires T0 >= 1) results = lwdid(..., rolling='demean') # Option 2: Drop units with insufficient pre-treatment periods units_to_keep = pre_periods[pre_periods >= 2].index data = data[data['unit'].isin(units_to_keep)] Error: Time-Varying Controls ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text InvalidParameterError indicating that control variable 'income' is not time-invariant (constant within each unit). **Cause:** Control variable changes over time for some units. **Solution:** .. code-block:: python # Check which controls vary for control in ['income', 'population']: varying = data.groupby('unit')[control].nunique() print(f"{control} varies in {(varying > 1).sum()} units") # Option 1: Use baseline (first period) value baseline = data.groupby('unit')['income'].first().reset_index() baseline.columns = ['unit', 'income_baseline'] data = data.drop('income', axis=1).merge(baseline, on='unit') # Option 2: Use pre-treatment average pre_avg = data[data['post'] == 0].groupby('unit')['income'].mean() pre_avg = pre_avg.reset_index() pre_avg.columns = ['unit', 'income_pre_avg'] data = data.drop('income', axis=1).merge(pre_avg, on='unit') Error: Time Gaps ~~~~~~~~~~~~~~~~ **Problem (conceptual):** .. code-block:: text TimeDiscontinuityError indicating that the time index is not continuous. **Cause:** Missing time periods in the data. **Solution:** .. code-block:: python # Check time sequence time_values = sorted(data['year'].unique()) gaps = [time_values[i+1] - time_values[i] for i in range(len(time_values)-1) if time_values[i+1] - time_values[i] > 1] print(f"Gaps found: {gaps}") # Option 1: Fill gaps with missing values (if appropriate) # Create complete panel units = data['unit'].unique() years = range(data['year'].min(), data['year'].max() + 1) complete_index = pd.MultiIndex.from_product([units, years], names=['unit', 'year']) data = data.set_index(['unit', 'year']).reindex(complete_index).reset_index() # Option 2: Restrict to continuous sub-period data = data[data['year'] >= 2015] # Use only recent years Best Practices -------------- Pre-Validation Checks ~~~~~~~~~~~~~~~~~~~~~~ Before running ``lwdid()``, perform these checks: .. code-block:: python import pandas as pd # 1. Check for duplicates assert not data.duplicated(subset=['unit', 'year']).any(), "Duplicates found" # 2. Check time continuity time_seq = sorted(data['year'].unique()) assert all(time_seq[i+1] - time_seq[i] == 1 for i in range(len(time_seq)-1)), "Time gaps found" # 3. Check post is time-based assert data.groupby('year')['post'].nunique().max() == 1, "Post varies by unit" # 4. Check controls are time-invariant for control in ['x1', 'x2']: assert data.groupby('unit')[control].nunique().max() == 1, \ f"{control} varies within units" # 5. Check sample size n_units = data['unit'].nunique() assert n_units >= 3, f"Too few units: {n_units}" print("All pre-validation checks passed!") Data Preparation Checklist ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before using ``lwdid()``, ensure: 1. Data is in long format (one row per unit-time observation) 2. No duplicate (unit, time) pairs 3. Time variable forms continuous sequence 4. ``post`` is binary (0/1) and time-based 5. ``d`` is time-invariant treatment group indicator 6. Control variables (if any) are time-invariant 7. No missing values in required variables 8. Sufficient pre-treatment periods for chosen transformation 9. At least :math:`N \geq 3` units See Also -------- - :func:`lwdid.lwdid` - Main estimation function - :doc:`../user_guide` - Comprehensive usage guide - :doc:`exceptions` - Exception classes raised by validation