Selection Diagnostics Module (selection_diagnostics)

Diagnostic tools for assessing potential selection bias in unbalanced panel data.

This module helps evaluate whether missing data patterns in unbalanced panels may compromise the validity of difference-in-differences estimation. The key assumption is that selection (missing data) may depend on unobserved time-invariant heterogeneity (which is removed by the rolling transformation), but cannot systematically depend on outcome shocks in the untreated state.

Overview

The selection mechanism assumption is analogous to the standard fixed effects assumption: units may be selected based on permanent characteristics, but not based on time-varying shocks. When this assumption holds, the rolling transformation removes selection bias along with unit fixed effects.

This module provides:

Balance analysis: Assess how balanced the panel is across units and time
Attrition diagnostics: Identify patterns in unit dropout
Missing data classification: Classify missing patterns as MCAR, MAR, or MNAR
Risk assessment: Evaluate the overall risk of selection bias

Enums

class lwdid.selection_diagnostics.MissingPattern(value)[source]

Missing data pattern classification based on Rubin’s taxonomy.

Variables:

MCAR (str) – Missing Completely At Random - missingness is independent of all data, both observed and unobserved. This is the most benign pattern.
MAR (str) – Missing At Random - missingness depends only on observed data. Acceptable under the selection mechanism assumption when controls are included.
MNAR (str) – Missing Not At Random - missingness depends on unobserved data. This may violate the selection mechanism assumption if missingness depends on outcome shocks in the untreated state.
UNKNOWN (str) – Pattern could not be determined with available data.

Notes

The selection mechanism assumption requires that missingness may depend on unobserved time-invariant heterogeneity, but cannot systematically depend on time-varying outcome shocks.

MCAR and MAR patterns are generally acceptable. MNAR patterns may be acceptable if missingness depends only on time-invariant factors (which are removed by the rolling transformation), but problematic if missingness depends on time-varying outcome shocks.

MCAR = 'missing_completely_at_random'

MAR = 'missing_at_random'

MNAR = 'missing_not_at_random'

UNKNOWN = 'unknown'

class lwdid.selection_diagnostics.SelectionRisk(value)[source]

Risk level for selection bias in ATT estimation.

Variables:

LOW (str) – Low risk - selection mechanism assumption likely holds. Proceed with estimation.
MEDIUM (str) – Medium risk - some indicators suggest potential issues. Consider using detrending and sensitivity analysis.
HIGH (str) – High risk - strong evidence of problematic selection. Results should be interpreted with caution.
UNKNOWN (str) – Risk could not be assessed with available data.

Notes

Risk assessment is based on multiple factors:

Missing data pattern (MCAR < MAR < MNAR)
Attrition rate (lower is better)
Differential attrition before/after treatment
Panel balance ratio

The rolling transformation removes unit-specific averages, so selection is allowed to depend on unobserved time-constant heterogeneity, similar to the standard fixed effects assumption.

LOW = 'low'

MEDIUM = 'medium'

HIGH = 'high'

UNKNOWN = 'unknown'

Data Classes

class lwdid.selection_diagnostics.SelectionDiagnostics(missing_pattern, missing_pattern_confidence, selection_risk, attrition_analysis, balance_statistics, recommendations, warnings, missing_rate_overall, missing_rate_by_period, missing_rate_by_cohort, selection_tests=<factory>, unit_stats=<factory>)[source]

Complete selection mechanism diagnostics for unbalanced panels.

This class aggregates all diagnostic information about missing data patterns and potential selection bias in panel data for DiD estimation.

Variables:

missing_pattern (MissingPattern) – Classified missing data pattern (MCAR, MAR, MNAR, UNKNOWN).
missing_pattern_confidence (float) – Confidence level (0-1) in the pattern classification.
selection_risk (SelectionRisk) – Assessed risk level for selection bias.
attrition_analysis (AttritionAnalysis) – Detailed attrition pattern analysis.
balance_statistics (BalanceStatistics) – Panel balance statistics.
recommendations (list[str]) – Actionable recommendations based on diagnostics.
warnings (list[str]) – Warning messages about potential issues.
missing_rate_overall (float) – Overall missing rate across all unit-periods.
missing_rate_by_period (dict[int, float]) – Missing rate by time period.
missing_rate_by_cohort (dict[int, float]) – Missing rate by treatment cohort.
selection_tests (List[SelectionTestResult]) – Results of statistical tests for selection.
unit_stats (List[UnitMissingStats]) – Per-unit missing data statistics.

Notes

The selection mechanism assumption requires that selection may depend on unobserved time-invariant heterogeneity, but cannot systematically depend on time-varying outcome shocks.

This is analogous to the standard fixed effects assumption. The rolling transformation removes unit-specific averages (or trends), which eliminates bias from selection on time-invariant factors.

See also

diagnose_selection_mechanism: Function to create this diagnostics object.

missing_pattern: MissingPattern

missing_pattern_confidence: float

selection_risk: SelectionRisk

attrition_analysis: AttritionAnalysis

balance_statistics: BalanceStatistics

recommendations: list[str]

warnings: list[str]

missing_rate_overall: float

missing_rate_by_period: dict[int, float]

missing_rate_by_cohort: dict[int, float]

selection_tests: List[SelectionTestResult]

unit_stats: List[UnitMissingStats]

summary()[source]

Generate a human-readable summary of diagnostics.

Returns:: Formatted summary string containing key diagnostic information, warnings, and recommendations.
Return type:: str

to_dict()[source]

Convert diagnostics to dictionary format.

Returns:: Dictionary containing all diagnostic information.
Return type:: dict[str, Any]

class lwdid.selection_diagnostics.BalanceStatistics(is_balanced, n_units, n_periods, min_obs_per_unit, max_obs_per_unit, mean_obs_per_unit, std_obs_per_unit, balance_ratio, units_below_demean_threshold=0, units_below_detrend_threshold=0, pct_usable_demean=100.0, pct_usable_detrend=100.0)[source]

Panel balance statistics.

Variables:

is_balanced (bool) – True if all units have the same number of observations.
n_units (int) – Total number of unique units in the panel.
n_periods (int) – Total number of unique time periods in the panel.
min_obs_per_unit (int) – Minimum observations across all units.
max_obs_per_unit (int) – Maximum observations across all units.
mean_obs_per_unit (float) – Average observations per unit.
std_obs_per_unit (float) – Standard deviation of observations per unit.
balance_ratio (float) – Ratio of min to max observations (1.0 = perfectly balanced). Lower values indicate more severe imbalance.
units_below_demean_threshold (int) – Number of treated units with < 1 pre-treatment observation. These units cannot be used with demeaning.
units_below_detrend_threshold (int) – Number of treated units with < 2 pre-treatment observations. These units cannot be used with detrending.
pct_usable_demean (float) – Percentage of treated units usable for demeaning (0-100).
pct_usable_detrend (float) – Percentage of treated units usable for detrending (0-100).

Notes

For treatment cohort g in period r, the transformed outcome can only be computed if there are enough observed pre-treatment periods (t < g):

Demeaning requires at least one pre-treatment period to compute the mean.
Detrending requires at least two pre-treatment periods to estimate a linear trend.

Units with insufficient pre-treatment observations are excluded from the corresponding transformation method.

is_balanced: bool

n_units: int

n_periods: int

min_obs_per_unit: int

max_obs_per_unit: int

mean_obs_per_unit: float

std_obs_per_unit: float

balance_ratio: float

units_below_demean_threshold: int = 0

units_below_detrend_threshold: int = 0

pct_usable_demean: float = 100.0

pct_usable_detrend: float = 100.0

class lwdid.selection_diagnostics.AttritionAnalysis(n_units_complete, n_units_partial, attrition_rate, attrition_by_cohort=<factory>, attrition_by_period=<factory>, early_dropout_rate=0.0, late_entry_rate=0.0, dropout_before_treatment=0, dropout_after_treatment=0)[source]

Analysis of unit dropout patterns in panel data.

Variables:

n_units_complete (int) – Number of units with complete observations across all periods.
n_units_partial (int) – Number of units with at least one missing period.
attrition_rate (float) – Proportion of units with incomplete observations (n_partial / n_total).
attrition_by_cohort (dict[int, float]) – Attrition rate by treatment cohort. Keys are cohort identifiers, values are attrition rates within each cohort.
attrition_by_period (dict[int, float]) – Cumulative attrition rate by time period. Shows the proportion of units not observed at each time point.
early_dropout_rate (float) – Rate of units that exit before the final period (last_obs < T_max).
late_entry_rate (float) – Rate of units that enter after the first period (first_obs > T_min).
dropout_before_treatment (int) – Number of treated units that dropout before their treatment period. High values may indicate anticipation effects.
dropout_after_treatment (int) – Number of treated units that dropout after treatment starts. High values may indicate treatment-induced attrition.

Notes

Differential attrition patterns (e.g., more dropout after treatment than before) may indicate selection related to treatment effects, which would violate the selection mechanism assumption.

n_units_complete: int

n_units_partial: int

attrition_rate: float

attrition_by_cohort: dict[int, float]

attrition_by_period: dict[int, float]

early_dropout_rate: float = 0.0

late_entry_rate: float = 0.0

dropout_before_treatment: int = 0

dropout_after_treatment: int = 0

class lwdid.selection_diagnostics.UnitMissingStats(unit_id, cohort, is_treated, n_total_periods, n_observed, n_missing, missing_rate, first_observed, last_observed, observation_span, n_pre_treatment=None, n_post_treatment=None, pre_treatment_missing_rate=None, post_treatment_missing_rate=None, can_use_demean=True, can_use_detrend=True, reason_if_excluded=None)[source]

Missing data statistics for a single unit.

Variables:

unit_id (Any) – Unit identifier.
cohort (int | None) – Treatment cohort (None for never-treated units).
is_treated (bool) – Whether the unit is ever treated.
n_total_periods (int) – Total periods in the panel.
n_observed (int) – Number of observed periods for this unit.
n_missing (int) – Number of missing periods for this unit.
missing_rate (float) – Proportion of missing periods (n_missing / n_total_periods).
first_observed (int) – First period with observation.
last_observed (int) – Last period with observation.
observation_span (int) – Span from first to last observation (last - first + 1).
n_pre_treatment (int | None) – Pre-treatment observations (treated units only).
n_post_treatment (int | None) – Post-treatment observations (treated units only).
pre_treatment_missing_rate (float | None) – Missing rate in pre-treatment period.
post_treatment_missing_rate (float | None) – Missing rate in post-treatment period.
can_use_demean (bool) – Whether unit has sufficient data for demeaning (≥1 pre-treatment obs).
can_use_detrend (bool) – Whether unit has sufficient data for detrending (≥2 pre-treatment obs).
reason_if_excluded (str | None) – Reason for exclusion if unit cannot be used.

unit_id: Any

cohort: int | None

is_treated: bool

n_total_periods: int

n_observed: int

n_missing: int

missing_rate: float

first_observed: int

last_observed: int

observation_span: int

n_pre_treatment: int | None = None

n_post_treatment: int | None = None

pre_treatment_missing_rate: float | None = None

post_treatment_missing_rate: float | None = None

can_use_demean: bool = True

can_use_detrend: bool = True

reason_if_excluded: str | None = None

class lwdid.selection_diagnostics.SelectionTestResult(test_name, statistic, pvalue, reject_null, interpretation, details=<factory>)[source]

Result of a statistical test for selection mechanism.

Variables:

test_name (str) – Name of the statistical test performed.
statistic (float) – Test statistic value.
pvalue (float) – P-value of the test.
reject_null (bool) – Whether to reject the null hypothesis at alpha=0.05.
interpretation (str) – Human-readable interpretation of the test result.
details (dict[str, Any]) – Additional test-specific details (e.g., means, correlations).

Notes

Common tests include:

Little’s MCAR Test: Tests if data is missing completely at random
Selection on Observables: Tests if missingness depends on controls
Lagged Outcome Test: Tests if missingness depends on past outcomes

test_name: str

statistic: float

pvalue: float

reject_null: bool

interpretation: str

details: dict[str, Any]

Main Functions

lwdid.selection_diagnostics.diagnose_selection_mechanism(data, y, ivar, tvar, gvar=None, controls=None, never_treated_values=None, verbose=True)[source]

Diagnose potential selection mechanism violations in unbalanced panels.

This function implements diagnostic procedures to assess whether the selection mechanism assumption is likely to hold. The key assumption is that selection (missing data) may depend on time-invariant heterogeneity but not on time-varying outcome shocks.

Parameters:

data (pd.DataFrame) – Panel data in long format.
y (str) – Outcome variable column name.
ivar (str) – Unit identifier column name.
tvar (str) – Time variable column name.
gvar (str, optional) – Cohort variable for staggered designs. If None, assumes common timing and skips cohort-specific diagnostics.
controls (list of str, optional) – Control variable column names for additional diagnostics.
never_treated_values (list, optional) – Values in gvar indicating never-treated units. Default: [0, np.inf].
verbose (bool, default True) – Whether to print diagnostic summary.

Returns:

Comprehensive diagnostic results including:

missing_pattern: Classified pattern (MCAR, MAR, MNAR)
selection_risk: Risk level for selection bias
attrition_analysis: Detailed attrition patterns
balance_statistics: Panel balance metrics
recommendations: Actionable suggestions
selection_tests: Statistical test results

Return type:

SelectionDiagnostics

Notes

The function performs several diagnostic procedures:

Balance Statistics: Computes panel balance metrics and identifies units that cannot be used for demeaning (< 1 pre-period) or detrending (< 2 pre-periods).
Attrition Analysis: Analyzes dropout patterns by cohort and time, distinguishing between early dropout (before treatment) and late dropout (after treatment).
Missing Pattern Classification: Uses Little’s MCAR test and auxiliary regressions to classify the missing data mechanism.
Selection Risk Assessment: Combines multiple indicators to assess the overall risk of selection bias.

The selection mechanism assumption requires that selection may depend on unobserved time-constant heterogeneity (which is removed by the rolling transformation, similar to the fixed effects estimator), but cannot systematically depend on time-varying outcome shocks.

See also

plot_missing_pattern: Visualize missing data patterns.
get_unit_missing_stats: Get per-unit missing statistics as DataFrame.

lwdid.selection_diagnostics.get_unit_missing_stats(data, y, ivar, tvar, gvar=None, never_treated_values=None)[source]

Compute per-unit missing data statistics as a DataFrame.

Parameters:

data (pd.DataFrame) – Panel data in long format.
y (str) – Outcome variable column name.
ivar (str) – Unit identifier column name.
tvar (str) – Time variable column name.
gvar (str, optional) – Cohort variable for staggered designs.
never_treated_values (list, optional) – Values indicating never-treated units. Default: [0, np.inf].

Returns:

DataFrame with one row per unit containing:

unit_id: Unit identifier
cohort: Treatment cohort (NaN for never-treated)
is_treated: Whether unit is ever treated
n_observed: Number of observed periods
n_missing: Number of missing periods
missing_rate: Proportion missing
n_pre_treatment: Pre-treatment observations
n_post_treatment: Post-treatment observations
can_use_demean: Sufficient data for demeaning
can_use_detrend: Sufficient data for detrending

Return type:

pd.DataFrame

See also

diagnose_selection_mechanism: Comprehensive diagnostics.

lwdid.selection_diagnostics.plot_missing_pattern(data, ivar, tvar, y=None, gvar=None, sort_by='cohort', figsize=(12, 8), cmap='RdYlGn', show_cohort_lines=True, never_treated_values=None, max_units=200, ax=None)[source]

Visualize missing data patterns in panel data.

Creates a heatmap showing observation availability across units and time. Units can be sorted by cohort, missing rate, or unit ID.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
tvar (str) – Time variable column name.
y (str, optional) – Outcome variable. If provided, checks for missing Y values. If None, checks for missing rows.
gvar (str, optional) – Cohort variable. If provided, shows treatment timing.
sort_by (str, default 'cohort') – How to sort units: ‘cohort’, ‘missing_rate’, ‘unit_id’.
figsize (tuple, default (12, 8)) – Figure size in inches.
cmap (str, default 'RdYlGn') – Colormap for the heatmap (not used, custom colors applied).
show_cohort_lines (bool, default True) – Whether to show treatment timing lines.
never_treated_values (list, optional) – Values indicating never-treated units. Default: [0, np.inf].
max_units (int, default 200) – Maximum number of units to display. If more units exist, a random sample is shown.
ax (matplotlib.axes.Axes, optional) – Axes to plot on. If None, creates new figure.

Returns:

Figure containing the missing pattern heatmap.

Return type:

matplotlib.figure.Figure

Notes

The heatmap uses the following color coding:

Green: Observed (Y value present)
Red: Missing (Y value missing or row absent)
Black line: Treatment timing (if gvar provided)

See also

diagnose_selection_mechanism: Comprehensive diagnostics.

Example Usage

from lwdid import diagnose_selection_mechanism, get_unit_missing_stats

# Run comprehensive selection diagnostics
diagnostics = diagnose_selection_mechanism(
    data=panel_data,
    ivar='unit',
    tvar='year',
    gvar='first_treat'
)

# Check risk level
print(f"Selection risk: {diagnostics.risk_level}")
print(f"Missing pattern: {diagnostics.missing_pattern}")

# Get per-unit statistics
unit_stats = get_unit_missing_stats(
    data=panel_data,
    ivar='unit',
    tvar='year'
)

# Visualize missing patterns
from lwdid import plot_missing_pattern
fig, ax = plot_missing_pattern(
    data=panel_data,
    ivar='unit',
    tvar='year'
)

Interpretation Guide

Risk Levels:

LOW: Selection mechanism assumption likely holds. Proceed with estimation.
MEDIUM: Some indicators suggest potential issues. Consider using detrending and sensitivity analysis.
HIGH: Strong evidence of problematic selection. Results should be interpreted with caution.

Missing Patterns:

MCAR: Missing Completely At Random. Most benign pattern, no bias expected.
MAR: Missing At Random. Acceptable when controls are included.
MNAR: Missing Not At Random. May violate selection mechanism assumption if missingness depends on outcome shocks.

Selection Diagnostics Module (selection_diagnostics)

Overview

Enums

Data Classes

Main Functions

Example Usage

Interpretation Guide

See Also