Preprocessing Module (preprocessing)
====================================

The preprocessing module provides functionality to aggregate repeated
cross-sectional data to panel format for use with lwdid estimation methods.

.. automodule:: lwdid.preprocessing
   :no-members:

Overview
--------

When treatment is assigned at the unit level (e.g., state, county) but data
is collected at a lower level (e.g., individuals, firms), it is common to
aggregate outcomes to the unit-by-period level before applying DiD methods.

This module implements the aggregation methodology described in Lee and
Wooldridge (2026), using the weighted average formula:

.. math::

   \bar{Y}_{st} = \sum_{i \in (s,t)} w_{ist} Y_{ist}

where the weights :math:`w_{ist}` are normalized within each (unit, period)
cell to sum to one.

Main Functions
--------------

aggregate_to_panel
~~~~~~~~~~~~~~~~~~

.. autofunction:: lwdid.preprocessing.aggregate_to_panel
   :no-index:

Result Classes
--------------

AggregationResult
~~~~~~~~~~~~~~~~~

.. autoclass:: lwdid.preprocessing.AggregationResult
   :members:
   :undoc-members:
   :no-index:

CellStatistics
~~~~~~~~~~~~~~

.. autoclass:: lwdid.preprocessing.CellStatistics
   :members:
   :undoc-members:
   :no-index:

Usage Examples
--------------

Basic Aggregation
~~~~~~~~~~~~~~~~~

Aggregate individual-level survey data to state-year panel:

.. code-block:: python

   from lwdid.preprocessing import aggregate_to_panel

   # Aggregate to state-year panel
   result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var='year',
       outcome_var='income',
       treatment_var='treated'
   )

   # Access the aggregated panel data
   panel_data = result.panel_data

   # View summary statistics
   print(result.summary())

Weighted Aggregation
~~~~~~~~~~~~~~~~~~~~

Use survey weights for proper aggregation:

.. code-block:: python

   result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var='year',
       outcome_var='income',
       weight_var='survey_weight',
       treatment_var='treated',
       gvar='first_treat_year'
   )

   # Check effective sample sizes
   print(f"Min cell size: {result.min_cell_size}")
   print(f"Mean cell size: {result.mean_cell_size:.1f}")

High-Frequency Aggregation
~~~~~~~~~~~~~~~~~~~~~~~~~~

Aggregate to quarterly or monthly panels:

.. code-block:: python

   # Quarterly aggregation
   result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var=['year', 'quarter'],
       outcome_var='income',
       frequency='quarterly'
   )

   # Monthly aggregation
   result = aggregate_to_panel(
       data=survey_data,
       unit_var='county',
       time_var=['year', 'month'],
       outcome_var='employment',
       frequency='monthly'
   )

Aggregating Control Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Include time-invariant controls in the aggregation:

.. code-block:: python

   result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var='year',
       outcome_var='income',
       controls=['population', 'median_age', 'urban_pct'],
       treatment_var='treated'
   )

   # Control variables are included in the output
   print(result.panel_data.columns.tolist())

Minimum Cell Size Requirement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Exclude cells with insufficient observations:

.. code-block:: python

   result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var='year',
       outcome_var='income',
       min_cell_size=30,  # Require at least 30 observations per cell
       compute_variance=True
   )

   # Check excluded cells
   print(f"Excluded {result.n_excluded_cells} cells")
   for info in result.excluded_cells_info:
       print(f"  {info}")

Integration with lwdid
~~~~~~~~~~~~~~~~~~~~~~

After aggregation, use the panel data with lwdid:

.. code-block:: python

   import lwdid
   from lwdid.preprocessing import aggregate_to_panel

   # Step 1: Aggregate repeated cross-sectional data
   agg_result = aggregate_to_panel(
       data=survey_data,
       unit_var='state',
       time_var='year',
       outcome_var='income',
       treatment_var='treated',
       gvar='first_treat_year'
   )

   # Step 2: Estimate treatment effects on aggregated panel
   results = lwdid.lwdid(
       data=agg_result.panel_data,
       y='income',
       ivar='state',
       tvar='year',
       gvar='first_treat_year',
       rolling='demean',
       estimator='ipwra'
   )

   print(results.summary())

Methodological Notes
--------------------

Weight Normalization
~~~~~~~~~~~~~~~~~~~~

Survey weights are normalized within each (unit, period) cell to sum to one:

.. math::

   w_{ist}^* = \frac{w_{ist}}{\sum_{j \in (s,t)} w_{jst}}

This ensures that each cell contributes equally to the estimation regardless
of sample size differences across cells.

Variance Computation
~~~~~~~~~~~~~~~~~~~~

When ``compute_variance=True``, the weighted variance within each cell is
computed for diagnostic purposes:

.. math::

   Var(\bar{Y}_{st}) = \frac{\sum_{i} w_{ist}^* (Y_{ist} - \bar{Y}_{st})^2}{1 - \sum_{i} (w_{ist}^*)^2}

This uses the Bessel correction adjusted for weights.

Treatment Consistency
~~~~~~~~~~~~~~~~~~~~~

The function validates that treatment status is consistent within each unit
across all time periods. If treatment varies within a unit, a warning is
issued and the modal treatment value is used.

Data Requirements
-----------------

Input data must satisfy:

1. **Unit identifier**: Column identifying the aggregation unit (e.g., state)
2. **Time identifier**: Column(s) identifying the time period
3. **Outcome variable**: Numeric column to be aggregated
4. **No duplicate observations**: Each row should represent a unique lower-level
   observation (e.g., individual in a specific state-year)

Optional inputs:

- **Survey weights**: For weighted aggregation
- **Treatment indicator**: Binary treatment status (validated for consistency)
- **Cohort variable**: First treatment period for staggered designs
- **Control variables**: Time-invariant covariates to include in output