Preprocessing Module (preprocessing)
The preprocessing module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods.
Preprocessing utilities for repeated cross-sectional data.
This module provides functionality to aggregate repeated cross-sectional data to panel format for use with lwdid estimation methods. The aggregation follows the methodology described in Lee & Wooldridge (2026).
Key Functions
- aggregate_to_panel
Aggregate repeated cross-sectional data to panel format.
Key Classes
- AggregationResult
Container for aggregation results and metadata.
- CellStatistics
Statistics for individual (unit, period) cells.
Example
>>> from lwdid.preprocessing import aggregate_to_panel
>>> result = aggregate_to_panel(
... data=survey_data,
... unit_var='state',
... time_var='year',
... outcome_var='income',
... weight_var='survey_weight',
... treatment_var='treated',
... )
>>> panel_data = result.panel_data
>>> print(result.summary())
Overview
When treatment is assigned at the unit level (e.g., state, county) but data is collected at a lower level (e.g., individuals, firms), it is common to aggregate outcomes to the unit-by-period level before applying DiD methods.
This module implements the aggregation methodology described in Lee and Wooldridge (2026), using the weighted average formula:
where the weights \(w_{ist}\) are normalized within each (unit, period) cell to sum to one.
Main Functions
aggregate_to_panel
- lwdid.preprocessing.aggregate_to_panel(data, unit_var, time_var, outcome_var, *, weight_var=None, controls=None, treatment_var=None, gvar=None, frequency='annual', min_cell_size=1, compute_variance=False)[source]
Aggregate repeated cross-sectional data to panel format.
This function aggregates lower-level repeated cross-sectional data (e.g., individuals, counties) to the unit-by-period level (e.g., state-year) using weighted means. The aggregation follows Lee & Wooldridge (2026), Section 3.
Formula: Y_bar_st = sum_{i in (s,t)} w_ist * Y_ist, where sum w_ist = 1
- Parameters:
data (pd.DataFrame) – Repeated cross-sectional data in long format.
unit_var (str) – Column name for aggregation unit (e.g., ‘state’).
time_var (str or list of str) – Time variable(s). Single string for annual data, list of [year, quarter/month/week] for high-frequency data.
outcome_var (str) – Outcome variable column name.
weight_var (str, optional) – Survey weight column name. If None, uses equal weights (1/n_st).
controls (list of str, optional) – Control variable column names to aggregate.
treatment_var (str, optional) – Treatment indicator column name. Must be constant within each cell.
gvar (str, optional) – Treatment timing variable. Must be constant within each unit.
frequency ({'annual', 'quarterly', 'monthly', 'weekly'}, default='annual') – Aggregation frequency.
min_cell_size (int, default=1) – Minimum observations per cell. Cells below threshold are excluded.
compute_variance (bool, default=False) – Whether to compute within-cell variance estimates.
- Returns:
Container with aggregated panel data and metadata.
- Return type:
AggregationResult
- Raises:
TypeError – If data is not a pandas DataFrame.
ValueError – If input data is empty or outcome is not numeric.
MissingRequiredColumnError – If required columns are missing.
InvalidAggregationError – If treatment varies within cells or gvar varies within units.
InsufficientCellSizeError – If all cells are below min_cell_size threshold.
Examples
>>> import pandas as pd >>> from lwdid.preprocessing import aggregate_to_panel >>> # Create sample repeated cross-section data >>> data = pd.DataFrame({ ... 'state': ['CA', 'CA', 'CA', 'TX', 'TX', 'TX'], ... 'year': [2000, 2000, 2001, 2000, 2001, 2001], ... 'income': [50000, 55000, 60000, 45000, 48000, 52000], ... 'weight': [1.0, 1.2, 0.8, 1.0, 1.1, 0.9], ... }) >>> result = aggregate_to_panel( ... data, 'state', 'year', 'income', weight_var='weight' ... ) >>> print(result.panel_data)
Result Classes
AggregationResult
- class lwdid.preprocessing.AggregationResult(panel_data, n_original_obs, n_cells, n_units, n_periods, cell_stats, min_cell_size, max_cell_size, mean_cell_size, median_cell_size, unit_var, time_var, outcome_var, weight_var, frequency, n_excluded_cells=0, excluded_cells_info=<factory>)[source]
Container for aggregation results and metadata.
This class holds the aggregated panel data along with comprehensive metadata about the aggregation process, including cell statistics and configuration parameters.
- Variables:
panel_data (pd.DataFrame) – Aggregated panel data with one row per (unit, period) combination.
n_original_obs (int) – Total number of observations in the original data.
n_cells (int) – Number of (unit, period) cells in the output.
n_units (int) – Number of unique units in the output.
n_periods (int) – Number of unique periods in the output.
cell_stats (pd.DataFrame) – DataFrame with statistics for each cell.
min_cell_size (int) – Minimum cell size across all cells.
max_cell_size (int) – Maximum cell size across all cells.
mean_cell_size (float) – Mean cell size across all cells.
median_cell_size (float) – Median cell size across all cells.
unit_var (str) – Name of the unit variable column.
time_var (str or list of str) – Name(s) of the time variable column(s).
outcome_var (str) – Name of the outcome variable column.
weight_var (str or None) – Name of the weight variable column (None if equal weights).
frequency (str) – Aggregation frequency (‘annual’, ‘quarterly’, ‘monthly’, ‘weekly’).
n_excluded_cells (int) – Number of cells excluded due to min_cell_size or all-NaN outcomes.
excluded_cells_info (list of dict) – Information about excluded cells.
- panel_data: DataFrame
- n_original_obs: int
- n_cells: int
- n_units: int
- n_periods: int
- cell_stats: DataFrame
- min_cell_size: int
- max_cell_size: int
- mean_cell_size: float
- median_cell_size: float
- unit_var: str
- outcome_var: str
- frequency: str
- n_excluded_cells: int = 0
- excluded_cells_info: list
- summary()[source]
Return formatted summary of aggregation.
- Returns:
Multi-line string with aggregation statistics.
- Return type:
Examples
>>> result = aggregate_to_panel(data, 'state', 'year', 'income') >>> print(result.summary()) Aggregation Summary =================== Original observations: 10000 Output cells: 150 Units: 50 Periods: 3 ...
- to_dict()[source]
Return aggregation parameters as dictionary.
- Returns:
Dictionary containing all aggregation parameters and statistics.
- Return type:
Examples
>>> result = aggregate_to_panel(data, 'state', 'year', 'income') >>> params = result.to_dict() >>> params['n_units'] 50
- to_csv(path, include_metadata=True)[source]
Export panel data to CSV with optional metadata header.
- Parameters:
- Return type:
Examples
>>> result = aggregate_to_panel(data, 'state', 'year', 'income') >>> result.to_csv('aggregated_panel.csv')
CellStatistics
- class lwdid.preprocessing.CellStatistics(unit, period, n_obs, outcome_mean, outcome_variance=None, effective_sample_size=None, weight_type='equal')[source]
Statistics for a single (unit, period) cell.
- Variables:
unit (Any) – Unit identifier value.
period (Any) – Period identifier value (year, or tuple of year/quarter/month/week).
n_obs (int) – Number of observations in the cell.
outcome_mean (float) – Weighted mean of the outcome variable.
outcome_variance (float or None) – Weighted variance of the outcome (None if not computed or n_obs == 1).
effective_sample_size (float or None) – Effective sample size when survey weights are used.
weight_type ({'equal', 'survey'}) – Type of weights used for aggregation.
- unit: Any
- period: Any
- n_obs: int
- outcome_mean: float
- weight_type: Literal['equal', 'survey'] = 'equal'
Usage Examples
Basic Aggregation
Aggregate individual-level survey data to state-year panel:
from lwdid.preprocessing import aggregate_to_panel
# Aggregate to state-year panel
result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var='year',
outcome_var='income',
treatment_var='treated'
)
# Access the aggregated panel data
panel_data = result.panel_data
# View summary statistics
print(result.summary())
Weighted Aggregation
Use survey weights for proper aggregation:
result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var='year',
outcome_var='income',
weight_var='survey_weight',
treatment_var='treated',
gvar='first_treat_year'
)
# Check effective sample sizes
print(f"Min cell size: {result.min_cell_size}")
print(f"Mean cell size: {result.mean_cell_size:.1f}")
High-Frequency Aggregation
Aggregate to quarterly or monthly panels:
# Quarterly aggregation
result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var=['year', 'quarter'],
outcome_var='income',
frequency='quarterly'
)
# Monthly aggregation
result = aggregate_to_panel(
data=survey_data,
unit_var='county',
time_var=['year', 'month'],
outcome_var='employment',
frequency='monthly'
)
Aggregating Control Variables
Include time-invariant controls in the aggregation:
result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var='year',
outcome_var='income',
controls=['population', 'median_age', 'urban_pct'],
treatment_var='treated'
)
# Control variables are included in the output
print(result.panel_data.columns.tolist())
Minimum Cell Size Requirement
Exclude cells with insufficient observations:
result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var='year',
outcome_var='income',
min_cell_size=30, # Require at least 30 observations per cell
compute_variance=True
)
# Check excluded cells
print(f"Excluded {result.n_excluded_cells} cells")
for info in result.excluded_cells_info:
print(f" {info}")
Integration with lwdid
After aggregation, use the panel data with lwdid:
import lwdid
from lwdid.preprocessing import aggregate_to_panel
# Step 1: Aggregate repeated cross-sectional data
agg_result = aggregate_to_panel(
data=survey_data,
unit_var='state',
time_var='year',
outcome_var='income',
treatment_var='treated',
gvar='first_treat_year'
)
# Step 2: Estimate treatment effects on aggregated panel
results = lwdid.lwdid(
data=agg_result.panel_data,
y='income',
ivar='state',
tvar='year',
gvar='first_treat_year',
rolling='demean',
estimator='ipwra'
)
print(results.summary())
Methodological Notes
Weight Normalization
Survey weights are normalized within each (unit, period) cell to sum to one:
This ensures that each cell contributes equally to the estimation regardless of sample size differences across cells.
Variance Computation
When compute_variance=True, the weighted variance within each cell is
computed for diagnostic purposes:
This uses the Bessel correction adjusted for weights.
Treatment Consistency
The function validates that treatment status is consistent within each unit across all time periods. If treatment varies within a unit, a warning is issued and the modal treatment value is used.
Data Requirements
Input data must satisfy:
Unit identifier: Column identifying the aggregation unit (e.g., state)
Time identifier: Column(s) identifying the time period
Outcome variable: Numeric column to be aggregated
No duplicate observations: Each row should represent a unique lower-level observation (e.g., individual in a specific state-year)
Optional inputs:
Survey weights: For weighted aggregation
Treatment indicator: Binary treatment status (validated for consistency)
Cohort variable: First treatment period for staggered designs
Control variables: Time-invariant covariates to include in output