Selection Diagnostics Module (selection_diagnostics)

Diagnostic tools for assessing potential selection bias in unbalanced panel data.

This module helps evaluate whether missing data patterns in unbalanced panels may compromise the validity of difference-in-differences estimation. The key assumption is that selection (missing data) may depend on unobserved time-invariant heterogeneity (which is removed by the rolling transformation), but cannot systematically depend on outcome shocks in the untreated state.

Overview

The selection mechanism assumption is analogous to the standard fixed effects assumption: units may be selected based on permanent characteristics, but not based on time-varying shocks. When this assumption holds, the rolling transformation removes selection bias along with unit fixed effects.

This module provides:

  • Balance analysis: Assess how balanced the panel is across units and time

  • Attrition diagnostics: Identify patterns in unit dropout

  • Missing data classification: Classify missing patterns as MCAR, MAR, or MNAR

  • Risk assessment: Evaluate the overall risk of selection bias

Enums

class lwdid.selection_diagnostics.MissingPattern(value)[source]

Missing data pattern classification based on Rubin’s taxonomy.

Variables:
  • MCAR (str) – Missing Completely At Random - missingness is independent of all data, both observed and unobserved. This is the most benign pattern.

  • MAR (str) – Missing At Random - missingness depends only on observed data. Acceptable under the selection mechanism assumption when controls are included.

  • MNAR (str) – Missing Not At Random - missingness depends on unobserved data. This may violate the selection mechanism assumption if missingness depends on outcome shocks in the untreated state.

  • UNKNOWN (str) – Pattern could not be determined with available data.

Notes

The selection mechanism assumption requires that missingness may depend on unobserved time-invariant heterogeneity, but cannot systematically depend on time-varying outcome shocks.

MCAR and MAR patterns are generally acceptable. MNAR patterns may be acceptable if missingness depends only on time-invariant factors (which are removed by the rolling transformation), but problematic if missingness depends on time-varying outcome shocks.

MCAR = 'missing_completely_at_random'
MAR = 'missing_at_random'
MNAR = 'missing_not_at_random'
UNKNOWN = 'unknown'
class lwdid.selection_diagnostics.SelectionRisk(value)[source]

Risk level for selection bias in ATT estimation.

Variables:
  • LOW (str) – Low risk - selection mechanism assumption likely holds. Proceed with estimation.

  • MEDIUM (str) – Medium risk - some indicators suggest potential issues. Consider using detrending and sensitivity analysis.

  • HIGH (str) – High risk - strong evidence of problematic selection. Results should be interpreted with caution.

  • UNKNOWN (str) – Risk could not be assessed with available data.

Notes

Risk assessment is based on multiple factors:

  • Missing data pattern (MCAR < MAR < MNAR)

  • Attrition rate (lower is better)

  • Differential attrition before/after treatment

  • Panel balance ratio

The rolling transformation removes unit-specific averages, so selection is allowed to depend on unobserved time-constant heterogeneity, similar to the standard fixed effects assumption.

LOW = 'low'
MEDIUM = 'medium'
HIGH = 'high'
UNKNOWN = 'unknown'

Data Classes

class lwdid.selection_diagnostics.SelectionDiagnostics(missing_pattern, missing_pattern_confidence, selection_risk, attrition_analysis, balance_statistics, recommendations, warnings, missing_rate_overall, missing_rate_by_period, missing_rate_by_cohort, selection_tests=<factory>, unit_stats=<factory>)[source]

Complete selection mechanism diagnostics for unbalanced panels.

This class aggregates all diagnostic information about missing data patterns and potential selection bias in panel data for DiD estimation.

Variables:
  • missing_pattern (MissingPattern) – Classified missing data pattern (MCAR, MAR, MNAR, UNKNOWN).

  • missing_pattern_confidence (float) – Confidence level (0-1) in the pattern classification.

  • selection_risk (SelectionRisk) – Assessed risk level for selection bias.

  • attrition_analysis (AttritionAnalysis) – Detailed attrition pattern analysis.

  • balance_statistics (BalanceStatistics) – Panel balance statistics.

  • recommendations (list[str]) – Actionable recommendations based on diagnostics.

  • warnings (list[str]) – Warning messages about potential issues.

  • missing_rate_overall (float) – Overall missing rate across all unit-periods.

  • missing_rate_by_period (dict[int, float]) – Missing rate by time period.

  • missing_rate_by_cohort (dict[int, float]) – Missing rate by treatment cohort.

  • selection_tests (List[SelectionTestResult]) – Results of statistical tests for selection.

  • unit_stats (List[UnitMissingStats]) – Per-unit missing data statistics.

Notes

The selection mechanism assumption requires that selection may depend on unobserved time-invariant heterogeneity, but cannot systematically depend on time-varying outcome shocks.

This is analogous to the standard fixed effects assumption. The rolling transformation removes unit-specific averages (or trends), which eliminates bias from selection on time-invariant factors.

See also

diagnose_selection_mechanism

Function to create this diagnostics object.

missing_pattern: MissingPattern
missing_pattern_confidence: float
selection_risk: SelectionRisk
attrition_analysis: AttritionAnalysis
balance_statistics: BalanceStatistics
recommendations: list[str]
warnings: list[str]
missing_rate_overall: float
missing_rate_by_period: dict[int, float]
missing_rate_by_cohort: dict[int, float]
selection_tests: List[SelectionTestResult]
unit_stats: List[UnitMissingStats]
summary()[source]

Generate a human-readable summary of diagnostics.

Returns:

Formatted summary string containing key diagnostic information, warnings, and recommendations.

Return type:

str

to_dict()[source]

Convert diagnostics to dictionary format.

Returns:

Dictionary containing all diagnostic information.

Return type:

dict[str, Any]

class lwdid.selection_diagnostics.BalanceStatistics(is_balanced, n_units, n_periods, min_obs_per_unit, max_obs_per_unit, mean_obs_per_unit, std_obs_per_unit, balance_ratio, units_below_demean_threshold=0, units_below_detrend_threshold=0, pct_usable_demean=100.0, pct_usable_detrend=100.0)[source]

Panel balance statistics.

Variables:
  • is_balanced (bool) – True if all units have the same number of observations.

  • n_units (int) – Total number of unique units in the panel.

  • n_periods (int) – Total number of unique time periods in the panel.

  • min_obs_per_unit (int) – Minimum observations across all units.

  • max_obs_per_unit (int) – Maximum observations across all units.

  • mean_obs_per_unit (float) – Average observations per unit.

  • std_obs_per_unit (float) – Standard deviation of observations per unit.

  • balance_ratio (float) – Ratio of min to max observations (1.0 = perfectly balanced). Lower values indicate more severe imbalance.

  • units_below_demean_threshold (int) – Number of treated units with < 1 pre-treatment observation. These units cannot be used with demeaning.

  • units_below_detrend_threshold (int) – Number of treated units with < 2 pre-treatment observations. These units cannot be used with detrending.

  • pct_usable_demean (float) – Percentage of treated units usable for demeaning (0-100).

  • pct_usable_detrend (float) – Percentage of treated units usable for detrending (0-100).

Notes

For treatment cohort g in period r, the transformed outcome can only be computed if there are enough observed pre-treatment periods (t < g):

  • Demeaning requires at least one pre-treatment period to compute the mean.

  • Detrending requires at least two pre-treatment periods to estimate a linear trend.

Units with insufficient pre-treatment observations are excluded from the corresponding transformation method.

is_balanced: bool
n_units: int
n_periods: int
min_obs_per_unit: int
max_obs_per_unit: int
mean_obs_per_unit: float
std_obs_per_unit: float
balance_ratio: float
units_below_demean_threshold: int = 0
units_below_detrend_threshold: int = 0
pct_usable_demean: float = 100.0
pct_usable_detrend: float = 100.0
class lwdid.selection_diagnostics.AttritionAnalysis(n_units_complete, n_units_partial, attrition_rate, attrition_by_cohort=<factory>, attrition_by_period=<factory>, early_dropout_rate=0.0, late_entry_rate=0.0, dropout_before_treatment=0, dropout_after_treatment=0)[source]

Analysis of unit dropout patterns in panel data.

Variables:
  • n_units_complete (int) – Number of units with complete observations across all periods.

  • n_units_partial (int) – Number of units with at least one missing period.

  • attrition_rate (float) – Proportion of units with incomplete observations (n_partial / n_total).

  • attrition_by_cohort (dict[int, float]) – Attrition rate by treatment cohort. Keys are cohort identifiers, values are attrition rates within each cohort.

  • attrition_by_period (dict[int, float]) – Cumulative attrition rate by time period. Shows the proportion of units not observed at each time point.

  • early_dropout_rate (float) – Rate of units that exit before the final period (last_obs < T_max).

  • late_entry_rate (float) – Rate of units that enter after the first period (first_obs > T_min).

  • dropout_before_treatment (int) – Number of treated units that dropout before their treatment period. High values may indicate anticipation effects.

  • dropout_after_treatment (int) – Number of treated units that dropout after treatment starts. High values may indicate treatment-induced attrition.

Notes

Differential attrition patterns (e.g., more dropout after treatment than before) may indicate selection related to treatment effects, which would violate the selection mechanism assumption.

n_units_complete: int
n_units_partial: int
attrition_rate: float
attrition_by_cohort: dict[int, float]
attrition_by_period: dict[int, float]
early_dropout_rate: float = 0.0
late_entry_rate: float = 0.0
dropout_before_treatment: int = 0
dropout_after_treatment: int = 0
class lwdid.selection_diagnostics.UnitMissingStats(unit_id, cohort, is_treated, n_total_periods, n_observed, n_missing, missing_rate, first_observed, last_observed, observation_span, n_pre_treatment=None, n_post_treatment=None, pre_treatment_missing_rate=None, post_treatment_missing_rate=None, can_use_demean=True, can_use_detrend=True, reason_if_excluded=None)[source]

Missing data statistics for a single unit.

Variables:
  • unit_id (Any) – Unit identifier.

  • cohort (int | None) – Treatment cohort (None for never-treated units).

  • is_treated (bool) – Whether the unit is ever treated.

  • n_total_periods (int) – Total periods in the panel.

  • n_observed (int) – Number of observed periods for this unit.

  • n_missing (int) – Number of missing periods for this unit.

  • missing_rate (float) – Proportion of missing periods (n_missing / n_total_periods).

  • first_observed (int) – First period with observation.

  • last_observed (int) – Last period with observation.

  • observation_span (int) – Span from first to last observation (last - first + 1).

  • n_pre_treatment (int | None) – Pre-treatment observations (treated units only).

  • n_post_treatment (int | None) – Post-treatment observations (treated units only).

  • pre_treatment_missing_rate (float | None) – Missing rate in pre-treatment period.

  • post_treatment_missing_rate (float | None) – Missing rate in post-treatment period.

  • can_use_demean (bool) – Whether unit has sufficient data for demeaning (≥1 pre-treatment obs).

  • can_use_detrend (bool) – Whether unit has sufficient data for detrending (≥2 pre-treatment obs).

  • reason_if_excluded (str | None) – Reason for exclusion if unit cannot be used.

unit_id: Any
cohort: int | None
is_treated: bool
n_total_periods: int
n_observed: int
n_missing: int
missing_rate: float
first_observed: int
last_observed: int
observation_span: int
n_pre_treatment: int | None = None
n_post_treatment: int | None = None
pre_treatment_missing_rate: float | None = None
post_treatment_missing_rate: float | None = None
can_use_demean: bool = True
can_use_detrend: bool = True
reason_if_excluded: str | None = None
class lwdid.selection_diagnostics.SelectionTestResult(test_name, statistic, pvalue, reject_null, interpretation, details=<factory>)[source]

Result of a statistical test for selection mechanism.

Variables:
  • test_name (str) – Name of the statistical test performed.

  • statistic (float) – Test statistic value.

  • pvalue (float) – P-value of the test.

  • reject_null (bool) – Whether to reject the null hypothesis at alpha=0.05.

  • interpretation (str) – Human-readable interpretation of the test result.

  • details (dict[str, Any]) – Additional test-specific details (e.g., means, correlations).

Notes

Common tests include:

  • Little’s MCAR Test: Tests if data is missing completely at random

  • Selection on Observables: Tests if missingness depends on controls

  • Lagged Outcome Test: Tests if missingness depends on past outcomes

test_name: str
statistic: float
pvalue: float
reject_null: bool
interpretation: str
details: dict[str, Any]

Main Functions

lwdid.selection_diagnostics.diagnose_selection_mechanism(data, y, ivar, tvar, gvar=None, controls=None, never_treated_values=None, verbose=True)[source]

Diagnose potential selection mechanism violations in unbalanced panels.

This function implements diagnostic procedures to assess whether the selection mechanism assumption is likely to hold. The key assumption is that selection (missing data) may depend on time-invariant heterogeneity but not on time-varying outcome shocks.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • y (str) – Outcome variable column name.

  • ivar (str) – Unit identifier column name.

  • tvar (str) – Time variable column name.

  • gvar (str, optional) – Cohort variable for staggered designs. If None, assumes common timing and skips cohort-specific diagnostics.

  • controls (list of str, optional) – Control variable column names for additional diagnostics.

  • never_treated_values (list, optional) – Values in gvar indicating never-treated units. Default: [0, np.inf].

  • verbose (bool, default True) – Whether to print diagnostic summary.

Returns:

Comprehensive diagnostic results including:

  • missing_pattern: Classified pattern (MCAR, MAR, MNAR)

  • selection_risk: Risk level for selection bias

  • attrition_analysis: Detailed attrition patterns

  • balance_statistics: Panel balance metrics

  • recommendations: Actionable suggestions

  • selection_tests: Statistical test results

Return type:

SelectionDiagnostics

Notes

The function performs several diagnostic procedures:

  1. Balance Statistics: Computes panel balance metrics and identifies units that cannot be used for demeaning (< 1 pre-period) or detrending (< 2 pre-periods).

  2. Attrition Analysis: Analyzes dropout patterns by cohort and time, distinguishing between early dropout (before treatment) and late dropout (after treatment).

  3. Missing Pattern Classification: Uses Little’s MCAR test and auxiliary regressions to classify the missing data mechanism.

  4. Selection Risk Assessment: Combines multiple indicators to assess the overall risk of selection bias.

The selection mechanism assumption requires that selection may depend on unobserved time-constant heterogeneity (which is removed by the rolling transformation, similar to the fixed effects estimator), but cannot systematically depend on time-varying outcome shocks.

See also

plot_missing_pattern

Visualize missing data patterns.

get_unit_missing_stats

Get per-unit missing statistics as DataFrame.

lwdid.selection_diagnostics.get_unit_missing_stats(data, y, ivar, tvar, gvar=None, never_treated_values=None)[source]

Compute per-unit missing data statistics as a DataFrame.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • y (str) – Outcome variable column name.

  • ivar (str) – Unit identifier column name.

  • tvar (str) – Time variable column name.

  • gvar (str, optional) – Cohort variable for staggered designs.

  • never_treated_values (list, optional) – Values indicating never-treated units. Default: [0, np.inf].

Returns:

DataFrame with one row per unit containing:

  • unit_id: Unit identifier

  • cohort: Treatment cohort (NaN for never-treated)

  • is_treated: Whether unit is ever treated

  • n_observed: Number of observed periods

  • n_missing: Number of missing periods

  • missing_rate: Proportion missing

  • n_pre_treatment: Pre-treatment observations

  • n_post_treatment: Post-treatment observations

  • can_use_demean: Sufficient data for demeaning

  • can_use_detrend: Sufficient data for detrending

Return type:

pd.DataFrame

See also

diagnose_selection_mechanism

Comprehensive diagnostics.

lwdid.selection_diagnostics.plot_missing_pattern(data, ivar, tvar, y=None, gvar=None, sort_by='cohort', figsize=(12, 8), cmap='RdYlGn', show_cohort_lines=True, never_treated_values=None, max_units=200, ax=None)[source]

Visualize missing data patterns in panel data.

Creates a heatmap showing observation availability across units and time. Units can be sorted by cohort, missing rate, or unit ID.

Parameters:
  • data (pd.DataFrame) – Panel data in long format.

  • ivar (str) – Unit identifier column name.

  • tvar (str) – Time variable column name.

  • y (str, optional) – Outcome variable. If provided, checks for missing Y values. If None, checks for missing rows.

  • gvar (str, optional) – Cohort variable. If provided, shows treatment timing.

  • sort_by (str, default 'cohort') – How to sort units: ‘cohort’, ‘missing_rate’, ‘unit_id’.

  • figsize (tuple, default (12, 8)) – Figure size in inches.

  • cmap (str, default 'RdYlGn') – Colormap for the heatmap (not used, custom colors applied).

  • show_cohort_lines (bool, default True) – Whether to show treatment timing lines.

  • never_treated_values (list, optional) – Values indicating never-treated units. Default: [0, np.inf].

  • max_units (int, default 200) – Maximum number of units to display. If more units exist, a random sample is shown.

  • ax (matplotlib.axes.Axes, optional) – Axes to plot on. If None, creates new figure.

Returns:

Figure containing the missing pattern heatmap.

Return type:

matplotlib.figure.Figure

Notes

The heatmap uses the following color coding:

  • Green: Observed (Y value present)

  • Red: Missing (Y value missing or row absent)

  • Black line: Treatment timing (if gvar provided)

See also

diagnose_selection_mechanism

Comprehensive diagnostics.

Example Usage

from lwdid import diagnose_selection_mechanism, get_unit_missing_stats

# Run comprehensive selection diagnostics
diagnostics = diagnose_selection_mechanism(
    data=panel_data,
    ivar='unit',
    tvar='year',
    gvar='first_treat'
)

# Check risk level
print(f"Selection risk: {diagnostics.risk_level}")
print(f"Missing pattern: {diagnostics.missing_pattern}")

# Get per-unit statistics
unit_stats = get_unit_missing_stats(
    data=panel_data,
    ivar='unit',
    tvar='year'
)

# Visualize missing patterns
from lwdid import plot_missing_pattern
fig, ax = plot_missing_pattern(
    data=panel_data,
    ivar='unit',
    tvar='year'
)

Interpretation Guide

Risk Levels:

  • LOW: Selection mechanism assumption likely holds. Proceed with estimation.

  • MEDIUM: Some indicators suggest potential issues. Consider using detrending and sensitivity analysis.

  • HIGH: Strong evidence of problematic selection. Results should be interpreted with caution.

Missing Patterns:

  • MCAR: Missing Completely At Random. Most benign pattern, no bias expected.

  • MAR: Missing At Random. Acceptable when controls are included.

  • MNAR: Missing Not At Random. May violate selection mechanism assumption if missingness depends on outcome shocks.

See Also