Preprocessing Module (preprocessing)

The preprocessing module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods.

Preprocessing utilities for repeated cross-sectional data.

This module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods. The aggregation follows the methodology described in Lee & Wooldridge (2026).

Key Functions

aggregate_to_panel: Aggregate repeated cross-sectional data to panel format.

Key Classes

AggregationResult: Container for aggregation results and metadata.
CellStatistics: Statistics for individual (unit, period) cells.

Example

>>> from lwdid.preprocessing import aggregate_to_panel
>>> result = aggregate_to_panel(
...     data=survey_data,
...     unit_var='state',
...     time_var='year',
...     outcome_var='income',
...     weight_var='survey_weight',
...     treatment_var='treated',
... )
>>> panel_data = result.panel_data
>>> print(result.summary())

Overview

When treatment is assigned at the unit level (e.g., state, county) but data is collected at a lower level (e.g., individuals, firms), it is common to aggregate outcomes to the unit-by-period level before applying DiD methods.

This module implements the aggregation methodology described in Lee and Wooldridge (2026), using the weighted average formula:

\[\bar{Y}_{st} = \sum_{i \in (s,t)} w_{ist} Y_{ist}\]

where the weights \(w_{ist}\) are normalized within each (unit, period) cell to sum to one.

Main Functions

aggregate_to_panel

lwdid.preprocessing.aggregate_to_panel(data, unit_var, time_var, outcome_var, *, weight_var=None, controls=None, treatment_var=None, gvar=None, frequency='annual', min_cell_size=1, compute_variance=False)[source]

Aggregate repeated cross-sectional data to panel format.

This function aggregates lower-level repeated cross-sectional data (e.g., individuals, counties) to the unit-by-period level (e.g., state-year) using weighted means. The aggregation follows Lee & Wooldridge (2026), Section 3.

Formula: Y_bar_st = sum_{i in (s,t)} w_ist * Y_ist, where sum w_ist = 1

Parameters:

data (pd.DataFrame) – Repeated cross-sectional data in long format.
unit_var (str) – Column name for aggregation unit (e.g., ‘state’).
time_var (str or list of str) – Time variable(s). Single string for annual data, list of [year, quarter/month/week] for high-frequency data.
outcome_var (str) – Outcome variable column name.
weight_var (str, optional) – Survey weight column name. If None, uses equal weights (1/n_st).
controls (list of str, optional) – Control variable column names to aggregate.
treatment_var (str, optional) – Treatment indicator column name. Must be constant within each cell.
gvar (str, optional) – Treatment timing variable. Must be constant within each unit.
frequency ({'annual', 'quarterly', 'monthly', 'weekly'}, default='annual') – Aggregation frequency.
min_cell_size (int, default=1) – Minimum observations per cell. Cells below threshold are excluded.
compute_variance (bool, default=False) – Whether to compute within-cell variance estimates.

Returns:

Container with aggregated panel data and metadata.

Return type:

AggregationResult

Raises:

TypeError – If data is not a pandas DataFrame.
ValueError – If input data is empty or outcome is not numeric.
MissingRequiredColumnError – If required columns are missing.
InvalidAggregationError – If treatment varies within cells or gvar varies within units.
InsufficientCellSizeError – If all cells are below min_cell_size threshold.

Examples

>>> import pandas as pd
>>> from lwdid.preprocessing import aggregate_to_panel
>>> # Create sample repeated cross-section data
>>> data = pd.DataFrame({
...     'state': ['CA', 'CA', 'CA', 'TX', 'TX', 'TX'],
...     'year': [2000, 2000, 2001, 2000, 2001, 2001],
...     'income': [50000, 55000, 60000, 45000, 48000, 52000],
...     'weight': [1.0, 1.2, 0.8, 1.0, 1.1, 0.9],
... })
>>> result = aggregate_to_panel(
...     data, 'state', 'year', 'income', weight_var='weight'
... )
>>> print(result.panel_data)

Result Classes

AggregationResult

class lwdid.preprocessing.AggregationResult(panel_data, n_original_obs, n_cells, n_units, n_periods, cell_stats, min_cell_size, max_cell_size, mean_cell_size, median_cell_size, unit_var, time_var, outcome_var, weight_var, frequency, n_excluded_cells=0, excluded_cells_info=<factory>)[source]

Container for aggregation results and metadata.

This class holds the aggregated panel data along with comprehensive metadata about the aggregation process, including cell statistics and configuration parameters.

Variables:

panel_data (pd.DataFrame) – Aggregated panel data with one row per (unit, period) combination.
n_original_obs (int) – Total number of observations in the original data.
n_cells (int) – Number of (unit, period) cells in the output.
n_units (int) – Number of unique units in the output.
n_periods (int) – Number of unique periods in the output.
cell_stats (pd.DataFrame) – DataFrame with statistics for each cell.
min_cell_size (int) – Minimum cell size across all cells.
max_cell_size (int) – Maximum cell size across all cells.
mean_cell_size (float) – Mean cell size across all cells.
median_cell_size (float) – Median cell size across all cells.
unit_var (str) – Name of the unit variable column.
time_var (str or list of str) – Name(s) of the time variable column(s).
outcome_var (str) – Name of the outcome variable column.
weight_var (str or None) – Name of the weight variable column (None if equal weights).
frequency (str) – Aggregation frequency (‘annual’, ‘quarterly’, ‘monthly’, ‘weekly’).
n_excluded_cells (int) – Number of cells excluded due to min_cell_size or all-NaN outcomes.
excluded_cells_info (list of dict) – Information about excluded cells.

panel_data: DataFrame

n_original_obs: int

n_cells: int

n_units: int

n_periods: int

cell_stats: DataFrame

min_cell_size: int

max_cell_size: int

mean_cell_size: float

median_cell_size: float

unit_var: str

time_var: str | list[str]

outcome_var: str

weight_var: str | None

frequency: str

n_excluded_cells: int = 0

excluded_cells_info: list

summary()[source]

Return formatted summary of aggregation.

Returns:: Multi-line string with aggregation statistics.
Return type:: str

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> print(result.summary())
Aggregation Summary
===================
Original observations: 10000
Output cells: 150
Units: 50
Periods: 3
...

to_dict()[source]

Return aggregation parameters as dictionary.

Returns:: Dictionary containing all aggregation parameters and statistics.
Return type:: dict

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> params = result.to_dict()
>>> params['n_units']
50

to_csv(path, include_metadata=True)[source]

Export panel data to CSV with optional metadata header.

Parameters:

path (str) – Output file path.
include_metadata (bool, default=True) – Whether to include aggregation metadata as header comments.

Return type:

None

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> result.to_csv('aggregated_panel.csv')

CellStatistics

class lwdid.preprocessing.CellStatistics(unit, period, n_obs, outcome_mean, outcome_variance=None, effective_sample_size=None, weight_type='equal')[source]

Statistics for a single (unit, period) cell.

Variables:

unit (Any) – Unit identifier value.
period (Any) – Period identifier value (year, or tuple of year/quarter/month/week).
n_obs (int) – Number of observations in the cell.
outcome_mean (float) – Weighted mean of the outcome variable.
outcome_variance (float or None) – Weighted variance of the outcome (None if not computed or n_obs == 1).
effective_sample_size (float or None) – Effective sample size when survey weights are used.
weight_type ({'equal', 'survey'}) – Type of weights used for aggregation.

unit: Any

period: Any

n_obs: int

outcome_mean: float

outcome_variance: float | None = None

effective_sample_size: float | None = None

weight_type: Literal['equal', 'survey'] = 'equal'

Usage Examples

Basic Aggregation

Aggregate individual-level survey data to state-year panel:

from lwdid.preprocessing import aggregate_to_panel

# Aggregate to state-year panel
result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    treatment_var='treated'
)

# Access the aggregated panel data
panel_data = result.panel_data

# View summary statistics
print(result.summary())

Weighted Aggregation

Use survey weights for proper aggregation:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    weight_var='survey_weight',
    treatment_var='treated',
    gvar='first_treat_year'
)

# Check effective sample sizes
print(f"Min cell size: {result.min_cell_size}")
print(f"Mean cell size: {result.mean_cell_size:.1f}")

High-Frequency Aggregation

Aggregate to quarterly or monthly panels:

# Quarterly aggregation
result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var=['year', 'quarter'],
    outcome_var='income',
    frequency='quarterly'
)

# Monthly aggregation
result = aggregate_to_panel(
    data=survey_data,
    unit_var='county',
    time_var=['year', 'month'],
    outcome_var='employment',
    frequency='monthly'
)

Aggregating Control Variables

Include time-invariant controls in the aggregation:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    controls=['population', 'median_age', 'urban_pct'],
    treatment_var='treated'
)

# Control variables are included in the output
print(result.panel_data.columns.tolist())

Minimum Cell Size Requirement

Exclude cells with insufficient observations:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    min_cell_size=30,  # Require at least 30 observations per cell
    compute_variance=True
)

# Check excluded cells
print(f"Excluded {result.n_excluded_cells} cells")
for info in result.excluded_cells_info:
    print(f"  {info}")

Integration with lwdid

After aggregation, use the panel data with lwdid:

import lwdid
from lwdid.preprocessing import aggregate_to_panel

# Step 1: Aggregate repeated cross-sectional data
agg_result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    treatment_var='treated',
    gvar='first_treat_year'
)

# Step 2: Estimate treatment effects on aggregated panel
results = lwdid.lwdid(
    data=agg_result.panel_data,
    y='income',
    ivar='state',
    tvar='year',
    gvar='first_treat_year',
    rolling='demean',
    estimator='ipwra'
)

print(results.summary())

Methodological Notes

Weight Normalization

Survey weights are normalized within each (unit, period) cell to sum to one:

\[w_{ist}^* = \frac{w_{ist}}{\sum_{j \in (s,t)} w_{jst}}\]

This ensures that each cell contributes equally to the estimation regardless of sample size differences across cells.

Variance Computation

When compute_variance=True, the weighted variance within each cell is computed for diagnostic purposes:

\[Var(\bar{Y}_{st}) = \frac{\sum_{i} w_{ist}^* (Y_{ist} - \bar{Y}_{st})^2}{1 - \sum_{i} (w_{ist}^*)^2}\]

This uses the Bessel correction adjusted for weights.

Treatment Consistency

The function validates that treatment status is consistent within each unit across all time periods. If treatment varies within a unit, a warning is issued and the modal treatment value is used.

Data Requirements

Input data must satisfy:

Unit identifier: Column identifying the aggregation unit (e.g., state)
Time identifier: Column(s) identifying the time period
Outcome variable: Numeric column to be aggregated
No duplicate observations: Each row should represent a unique lower-level observation (e.g., individual in a specific state-year)

Optional inputs:

Survey weights: For weighted aggregation
Treatment indicator: Binary treatment status (validated for consistency)
Cohort variable: First treatment period for staggered designs
Control variables: Time-invariant covariates to include in output