Preprocessing Module (preprocessing)

The preprocessing module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods.

Preprocessing utilities for repeated cross-sectional data.

This module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods. The aggregation follows the methodology described in Lee & Wooldridge (2026).

Key Functions

aggregate_to_panel

Aggregate repeated cross-sectional data to panel format.

Key Classes

AggregationResult

Container for aggregation results and metadata.

CellStatistics

Statistics for individual (unit, period) cells.

Example

>>> from lwdid.preprocessing import aggregate_to_panel
>>> result = aggregate_to_panel(
...     data=survey_data,
...     unit_var='state',
...     time_var='year',
...     outcome_var='income',
...     weight_var='survey_weight',
...     treatment_var='treated',
... )
>>> panel_data = result.panel_data
>>> print(result.summary())

Overview

When treatment is assigned at the unit level (e.g., state, county) but data is collected at a lower level (e.g., individuals, firms), it is common to aggregate outcomes to the unit-by-period level before applying DiD methods.

This module implements the aggregation methodology described in Lee and Wooldridge (2026), using the weighted average formula:

\[\bar{Y}_{st} = \sum_{i \in (s,t)} w_{ist} Y_{ist}\]

where the weights \(w_{ist}\) are normalized within each (unit, period) cell to sum to one.

Main Functions

aggregate_to_panel

lwdid.preprocessing.aggregate_to_panel(data, unit_var, time_var, outcome_var, *, weight_var=None, controls=None, treatment_var=None, gvar=None, frequency='annual', min_cell_size=1, compute_variance=False)[source]

Aggregate repeated cross-sectional data to panel format.

This function aggregates lower-level repeated cross-sectional data (e.g., individuals, counties) to the unit-by-period level (e.g., state-year) using weighted means. The aggregation follows Lee & Wooldridge (2026), Section 3.

Formula: Y_bar_st = sum_{i in (s,t)} w_ist * Y_ist, where sum w_ist = 1

Parameters:
  • data (pd.DataFrame) – Repeated cross-sectional data in long format.

  • unit_var (str) – Column name for aggregation unit (e.g., ‘state’).

  • time_var (str or list of str) – Time variable(s). Single string for annual data, list of [year, quarter/month/week] for high-frequency data.

  • outcome_var (str) – Outcome variable column name.

  • weight_var (str, optional) – Survey weight column name. If None, uses equal weights (1/n_st).

  • controls (list of str, optional) – Control variable column names to aggregate.

  • treatment_var (str, optional) – Treatment indicator column name. Must be constant within each cell.

  • gvar (str, optional) – Treatment timing variable. Must be constant within each unit.

  • frequency ({'annual', 'quarterly', 'monthly', 'weekly'}, default='annual') – Aggregation frequency.

  • min_cell_size (int, default=1) – Minimum observations per cell. Cells below threshold are excluded.

  • compute_variance (bool, default=False) – Whether to compute within-cell variance estimates.

Returns:

Container with aggregated panel data and metadata.

Return type:

AggregationResult

Raises:

Examples

>>> import pandas as pd
>>> from lwdid.preprocessing import aggregate_to_panel
>>> # Create sample repeated cross-section data
>>> data = pd.DataFrame({
...     'state': ['CA', 'CA', 'CA', 'TX', 'TX', 'TX'],
...     'year': [2000, 2000, 2001, 2000, 2001, 2001],
...     'income': [50000, 55000, 60000, 45000, 48000, 52000],
...     'weight': [1.0, 1.2, 0.8, 1.0, 1.1, 0.9],
... })
>>> result = aggregate_to_panel(
...     data, 'state', 'year', 'income', weight_var='weight'
... )
>>> print(result.panel_data)

Result Classes

AggregationResult

class lwdid.preprocessing.AggregationResult(panel_data, n_original_obs, n_cells, n_units, n_periods, cell_stats, min_cell_size, max_cell_size, mean_cell_size, median_cell_size, unit_var, time_var, outcome_var, weight_var, frequency, n_excluded_cells=0, excluded_cells_info=<factory>)[source]

Container for aggregation results and metadata.

This class holds the aggregated panel data along with comprehensive metadata about the aggregation process, including cell statistics and configuration parameters.

Variables:
  • panel_data (pd.DataFrame) – Aggregated panel data with one row per (unit, period) combination.

  • n_original_obs (int) – Total number of observations in the original data.

  • n_cells (int) – Number of (unit, period) cells in the output.

  • n_units (int) – Number of unique units in the output.

  • n_periods (int) – Number of unique periods in the output.

  • cell_stats (pd.DataFrame) – DataFrame with statistics for each cell.

  • min_cell_size (int) – Minimum cell size across all cells.

  • max_cell_size (int) – Maximum cell size across all cells.

  • mean_cell_size (float) – Mean cell size across all cells.

  • median_cell_size (float) – Median cell size across all cells.

  • unit_var (str) – Name of the unit variable column.

  • time_var (str or list of str) – Name(s) of the time variable column(s).

  • outcome_var (str) – Name of the outcome variable column.

  • weight_var (str or None) – Name of the weight variable column (None if equal weights).

  • frequency (str) – Aggregation frequency (‘annual’, ‘quarterly’, ‘monthly’, ‘weekly’).

  • n_excluded_cells (int) – Number of cells excluded due to min_cell_size or all-NaN outcomes.

  • excluded_cells_info (list of dict) – Information about excluded cells.

panel_data: DataFrame
n_original_obs: int
n_cells: int
n_units: int
n_periods: int
cell_stats: DataFrame
min_cell_size: int
max_cell_size: int
mean_cell_size: float
median_cell_size: float
unit_var: str
time_var: str | list[str]
outcome_var: str
weight_var: str | None
frequency: str
n_excluded_cells: int = 0
excluded_cells_info: list
summary()[source]

Return formatted summary of aggregation.

Returns:

Multi-line string with aggregation statistics.

Return type:

str

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> print(result.summary())
Aggregation Summary
===================
Original observations: 10000
Output cells: 150
Units: 50
Periods: 3
...
to_dict()[source]

Return aggregation parameters as dictionary.

Returns:

Dictionary containing all aggregation parameters and statistics.

Return type:

dict

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> params = result.to_dict()
>>> params['n_units']
50
to_csv(path, include_metadata=True)[source]

Export panel data to CSV with optional metadata header.

Parameters:
  • path (str) – Output file path.

  • include_metadata (bool, default=True) – Whether to include aggregation metadata as header comments.

Return type:

None

Examples

>>> result = aggregate_to_panel(data, 'state', 'year', 'income')
>>> result.to_csv('aggregated_panel.csv')

CellStatistics

class lwdid.preprocessing.CellStatistics(unit, period, n_obs, outcome_mean, outcome_variance=None, effective_sample_size=None, weight_type='equal')[source]

Statistics for a single (unit, period) cell.

Variables:
  • unit (Any) – Unit identifier value.

  • period (Any) – Period identifier value (year, or tuple of year/quarter/month/week).

  • n_obs (int) – Number of observations in the cell.

  • outcome_mean (float) – Weighted mean of the outcome variable.

  • outcome_variance (float or None) – Weighted variance of the outcome (None if not computed or n_obs == 1).

  • effective_sample_size (float or None) – Effective sample size when survey weights are used.

  • weight_type ({'equal', 'survey'}) – Type of weights used for aggregation.

unit: Any
period: Any
n_obs: int
outcome_mean: float
outcome_variance: float | None = None
effective_sample_size: float | None = None
weight_type: Literal['equal', 'survey'] = 'equal'

Usage Examples

Basic Aggregation

Aggregate individual-level survey data to state-year panel:

from lwdid.preprocessing import aggregate_to_panel

# Aggregate to state-year panel
result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    treatment_var='treated'
)

# Access the aggregated panel data
panel_data = result.panel_data

# View summary statistics
print(result.summary())

Weighted Aggregation

Use survey weights for proper aggregation:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    weight_var='survey_weight',
    treatment_var='treated',
    gvar='first_treat_year'
)

# Check effective sample sizes
print(f"Min cell size: {result.min_cell_size}")
print(f"Mean cell size: {result.mean_cell_size:.1f}")

High-Frequency Aggregation

Aggregate to quarterly or monthly panels:

# Quarterly aggregation
result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var=['year', 'quarter'],
    outcome_var='income',
    frequency='quarterly'
)

# Monthly aggregation
result = aggregate_to_panel(
    data=survey_data,
    unit_var='county',
    time_var=['year', 'month'],
    outcome_var='employment',
    frequency='monthly'
)

Aggregating Control Variables

Include time-invariant controls in the aggregation:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    controls=['population', 'median_age', 'urban_pct'],
    treatment_var='treated'
)

# Control variables are included in the output
print(result.panel_data.columns.tolist())

Minimum Cell Size Requirement

Exclude cells with insufficient observations:

result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    min_cell_size=30,  # Require at least 30 observations per cell
    compute_variance=True
)

# Check excluded cells
print(f"Excluded {result.n_excluded_cells} cells")
for info in result.excluded_cells_info:
    print(f"  {info}")

Integration with lwdid

After aggregation, use the panel data with lwdid:

import lwdid
from lwdid.preprocessing import aggregate_to_panel

# Step 1: Aggregate repeated cross-sectional data
agg_result = aggregate_to_panel(
    data=survey_data,
    unit_var='state',
    time_var='year',
    outcome_var='income',
    treatment_var='treated',
    gvar='first_treat_year'
)

# Step 2: Estimate treatment effects on aggregated panel
results = lwdid.lwdid(
    data=agg_result.panel_data,
    y='income',
    ivar='state',
    tvar='year',
    gvar='first_treat_year',
    rolling='demean',
    estimator='ipwra'
)

print(results.summary())

Methodological Notes

Weight Normalization

Survey weights are normalized within each (unit, period) cell to sum to one:

\[w_{ist}^* = \frac{w_{ist}}{\sum_{j \in (s,t)} w_{jst}}\]

This ensures that each cell contributes equally to the estimation regardless of sample size differences across cells.

Variance Computation

When compute_variance=True, the weighted variance within each cell is computed for diagnostic purposes:

\[Var(\bar{Y}_{st}) = \frac{\sum_{i} w_{ist}^* (Y_{ist} - \bar{Y}_{st})^2}{1 - \sum_{i} (w_{ist}^*)^2}\]

This uses the Bessel correction adjusted for weights.

Treatment Consistency

The function validates that treatment status is consistent within each unit across all time periods. If treatment varies within a unit, a warning is issued and the modal treatment value is used.

Data Requirements

Input data must satisfy:

  1. Unit identifier: Column identifying the aggregation unit (e.g., state)

  2. Time identifier: Column(s) identifying the time period

  3. Outcome variable: Numeric column to be aggregated

  4. No duplicate observations: Each row should represent a unique lower-level observation (e.g., individual in a specific state-year)

Optional inputs:

  • Survey weights: For weighted aggregation

  • Treatment indicator: Binary treatment status (validated for consistency)

  • Cohort variable: First treatment period for staggered designs

  • Control variables: Time-invariant covariates to include in output