Clustering Diagnostics Module (clustering_diagnostics)

Clustering diagnostics and recommendations for difference-in-differences.

This module provides tools for analyzing clustering structure in panel data and recommending appropriate clustering levels for standard error estimation. When treatment varies at a level higher than the observation unit, standard errors should be clustered at the policy variation level.

Overview

Proper clustering is essential for valid inference in DiD settings. This module helps:

Analyze hierarchical relationships: Between potential clustering variables
Detect treatment variation level: Identify where treatment assignment varies
Recommend clustering variables: With sufficient cluster counts
Check consistency: Between clustering choice and treatment variation

For reliable cluster-robust inference, a minimum of 20-30 clusters is generally recommended. When clusters are fewer, wild cluster bootstrap methods provide more accurate inference.

Enums

class lwdid.clustering_diagnostics.ClusteringLevel(value)[source]

Relative level of clustering variable to unit variable.

Variables:

LOWER (str) – Cluster variable is at a lower level than unit (invalid for clustering). Example: sub-unit ID when unit is individual.
SAME (str) – Cluster variable is at the same level as unit. Example: individual ID when unit is individual.
HIGHER (str) – Cluster variable is at a higher level than unit (recommended). Example: state when unit is county.

LOWER = 'lower'

SAME = 'same'

HIGHER = 'higher'

class lwdid.clustering_diagnostics.ClusteringWarningLevel(value)[source]

Severity level for clustering warnings.

Variables:

INFO (str) – Informational message, no action required.
WARNING (str) – Warning that may affect inference reliability.
ERROR (str) – Critical issue that prevents valid inference.

INFO = 'info'

WARNING = 'warning'

ERROR = 'error'

Data Classes

class lwdid.clustering_diagnostics.ClusterVarStats(var_name, n_clusters, n_treated_clusters, n_control_clusters, min_cluster_size, max_cluster_size, mean_cluster_size, median_cluster_size, cluster_size_cv, level_relative_to_unit, units_per_cluster, is_nested_in_unit, treatment_varies_within_cluster, n_clusters_with_treatment_variation=0)[source]

Statistics for a single potential clustering variable.

This class holds comprehensive statistics about a clustering variable, including cluster counts, size distributions, and validity indicators.

Variables:

var_name (str) – Name of the clustering variable.
n_clusters (int) – Total number of unique clusters.
n_treated_clusters (int) – Number of clusters containing treated units.
n_control_clusters (int) – Number of clusters containing only control units.
min_cluster_size (int) – Minimum number of observations in any cluster.
max_cluster_size (int) – Maximum number of observations in any cluster.
mean_cluster_size (float) – Mean cluster size.
median_cluster_size (float) – Median cluster size.
cluster_size_cv (float) – Coefficient of variation of cluster sizes (std/mean).
level_relative_to_unit (ClusteringLevel) – Whether cluster is at higher/same/lower level than unit.
units_per_cluster (float) – Average number of unique units per cluster.
is_nested_in_unit (bool) – True if cluster varies within unit (invalid for clustering).
treatment_varies_within_cluster (bool) – True if treatment status varies within clusters.
n_clusters_with_treatment_variation (int) – Number of clusters with within-cluster treatment variation.
Properties
----------
is_valid_cluster (bool) – Whether this is a valid clustering variable.
is_recommended (bool) – Whether this clustering level is recommended.
reliability_score (float) – Score indicating reliability of cluster-robust inference (0-1).

var_name: str

n_clusters: int

n_treated_clusters: int

n_control_clusters: int

min_cluster_size: int

max_cluster_size: int

mean_cluster_size: float

median_cluster_size: float

cluster_size_cv: float

level_relative_to_unit: ClusteringLevel

units_per_cluster: float

is_nested_in_unit: bool

treatment_varies_within_cluster: bool

n_clusters_with_treatment_variation: int = 0

property is_valid_cluster: bool

Whether this is a valid clustering variable.

A valid clustering variable must: 1. Not be nested within units (each unit belongs to one cluster) 2. Have at least 2 clusters 3. Not be at a lower level than the unit variable

Returns:: True if valid for clustering.
Return type:: bool

property is_recommended: bool

Whether this clustering level is recommended.

A recommended clustering variable must: 1. Be valid (see is_valid_cluster) 2. Have at least 20 clusters for reliable inference 3. Treatment should not vary within clusters

Returns:: True if recommended for clustering.
Return type:: bool

property reliability_score: float

Score indicating reliability of cluster-robust inference (0-1).

Based on: - Number of clusters (more is better, saturates at 50) - Balance of treated/control clusters - Cluster size variation (less is better)

Returns:: Reliability score between 0 and 1.
Return type:: float

class lwdid.clustering_diagnostics.ClusteringDiagnostics(cluster_structure, recommended_cluster_var, recommendation_reason, treatment_variation_level, warnings=<factory>)[source]

Diagnostic results for clustering structure analysis.

This class contains the complete results of clustering diagnostics, including statistics for each potential clustering variable and recommendations.

Variables:

cluster_structure (Dict[str, ClusterVarStats]) – Statistics for each potential clustering variable.
recommended_cluster_var (Optional[str]) – Recommended clustering variable name.
recommendation_reason (str) – Explanation for the recommendation.
treatment_variation_level (str) – Detected level at which treatment varies.
warnings (List[str]) – Warning messages about clustering choices.

cluster_structure: Dict[str, ClusterVarStats]

recommended_cluster_var: str | None

recommendation_reason: str

treatment_variation_level: str

warnings: List[str]

summary()[source]

Generate human-readable summary of diagnostics.

Returns:: Formatted summary string.
Return type:: str

class lwdid.clustering_diagnostics.ClusteringRecommendation(recommended_var, n_clusters, n_treated_clusters, n_control_clusters, confidence, reasons, alternatives=<factory>, warnings=<factory>, use_wild_bootstrap=False, wild_bootstrap_reason=None)[source]

Recommendation for clustering level selection.

This class provides a detailed recommendation for which clustering variable to use, along with confidence scores and alternatives.

Variables:

recommended_var (str) – Recommended clustering variable name.
n_clusters (int) – Number of clusters with recommended variable.
n_treated_clusters (int) – Number of treated clusters.
n_control_clusters (int) – Number of control clusters.
confidence (float) – Confidence in recommendation (0-1).
reasons (List[str]) – List of reasons supporting the recommendation.
alternatives (List[Dict[str, Any]]) – Alternative clustering options with their statistics.
warnings (List[str]) – Warning messages.
use_wild_bootstrap (bool) – Whether to recommend wild cluster bootstrap.
wild_bootstrap_reason (Optional[str]) – Reason for wild bootstrap recommendation.

recommended_var: str

n_clusters: int

n_treated_clusters: int

n_control_clusters: int

confidence: float

reasons: List[str]

alternatives: List[Dict[str, Any]]

warnings: List[str]

use_wild_bootstrap: bool = False

wild_bootstrap_reason: str | None = None

summary()[source]

Generate human-readable summary of recommendation.

Returns:: Formatted summary string.
Return type:: str

class lwdid.clustering_diagnostics.ClusteringConsistencyResult(is_consistent, treatment_variation_level, cluster_level, n_clusters, n_treatment_changes_within_cluster, pct_clusters_with_variation, recommendation, details)[source]

Result of clustering consistency check.

This class contains the results of checking whether the chosen clustering level is consistent with the treatment variation level.

Variables:

is_consistent (bool) – Whether clustering level is consistent with treatment variation.
treatment_variation_level (str) – Detected level at which treatment varies.
cluster_level (str) – Level of the clustering variable.
n_clusters (int) – Number of clusters.
n_treatment_changes_within_cluster (int) – Number of clusters with treatment variation within.
pct_clusters_with_variation (float) – Percentage of clusters with within-cluster treatment variation.
recommendation (str) – Suggested action if inconsistent.
details (str) – Detailed explanation of the consistency check.

is_consistent: bool

treatment_variation_level: str

cluster_level: str

n_clusters: int

n_treatment_changes_within_cluster: int

pct_clusters_with_variation: float

recommendation: str

details: str

summary()[source]

Generate human-readable summary of consistency check.

Returns:: Formatted summary string.
Return type:: str

Main Functions

lwdid.clustering_diagnostics.diagnose_clustering(data, ivar, potential_cluster_vars, gvar=None, d=None, verbose=True)[source]

Diagnose clustering structure and recommend clustering level.

Analyzes the hierarchical structure of potential clustering variables relative to the unit of observation and treatment assignment.

This function helps users choose the appropriate clustering level for standard error estimation in difference-in-differences analysis.

Parameters:

data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
potential_cluster_vars (List[str]) – List of potential clustering variable column names to evaluate.
gvar (str, optional) – Cohort variable for staggered designs. Use this for staggered adoption designs where treatment timing varies across units.
d (str, optional) – Treatment indicator variable (for common timing). Use this for designs where all treated units receive treatment at the same time.
verbose (bool, default True) – Whether to print diagnostic summary.

Returns:

Diagnostic results containing: - cluster_structure: Statistics for each potential clustering variable - recommended_cluster_var: Recommended clustering variable name - recommendation_reason: Explanation for the recommendation - treatment_variation_level: Detected level at which treatment varies - warnings: Warning messages about clustering choices

Return type:

ClusteringDiagnostics

Raises:

ValueError – If inputs are invalid (missing columns, no treatment variable, etc.)

Notes

When the policy or treatment varies at a level higher than the unit of observation, standard errors should be clustered at the policy variation level to properly account for within-cluster correlation.

The function evaluates each potential clustering variable based on:

Number of clusters (more is better, 20-30 minimum recommended)
Balance between treated and control clusters
Whether treatment varies within clusters
Cluster size variation (coefficient of variation)

Example Usage

Diagnosing Clustering Structure

from lwdid import diagnose_clustering

# Analyze potential clustering variables
diagnostics = diagnose_clustering(
    data=panel_data,
    ivar='county',
    potential_cluster_vars=['state', 'region'],
    treatment_var='treated'
)

# Review cluster counts
for var_stats in diagnostics.cluster_var_stats:
    print(f"{var_stats.var_name}: {var_stats.n_clusters} clusters")

# Check warnings
for warning in diagnostics.warnings:
    print(f"[{warning.level}] {warning.message}")

Getting Clustering Recommendations

from lwdid import recommend_clustering_level

# Get recommendation with minimum cluster count
recommendation = recommend_clustering_level(
    data=panel_data,
    ivar='county',
    potential_cluster_vars=['state', 'region'],
    treatment_var='treated',
    min_clusters=20
)

print(f"Recommended: {recommendation.recommended_var}")
print(f"Reason: {recommendation.reason}")

# Use in estimation
results = lwdid(
    data, y='outcome', d='treated', ivar='county', tvar='year',
    post='post', rolling='demean',
    vce='cluster', cluster_var=recommendation.recommended_var
)

Checking Consistency

from lwdid import check_clustering_consistency

# Verify clustering choice is appropriate
consistency = check_clustering_consistency(
    data=panel_data,
    ivar='county',
    cluster_var='state',
    treatment_var='treated'
)

print(f"Consistent: {consistency.is_consistent}")
if not consistency.is_consistent:
    print(f"Issue: {consistency.message}")

Wild Cluster Bootstrap

When cluster counts are small (< 20), use wild cluster bootstrap:

from lwdid import wild_cluster_bootstrap

# Run wild cluster bootstrap
bootstrap_result = wild_cluster_bootstrap(
    data=transformed_data,
    y_transformed='y_dot',
    d='treated',
    cluster_var='state',
    n_bootstrap=999,
    weight_type='rademacher'
)

print(f"Bootstrap SE: {bootstrap_result.se_bootstrap:.4f}")
print(f"Bootstrap p-value: {bootstrap_result.pvalue:.4f}")

Guidelines

Minimum Cluster Counts:

20-30 clusters: Generally sufficient for cluster-robust standard errors
10-20 clusters: Use wild cluster bootstrap for improved inference
< 10 clusters: Wild cluster bootstrap essential; consider randomization inference

Clustering Level Selection:

Cluster at the level where treatment varies
If treatment varies at multiple levels, cluster at the highest level
Never cluster below the unit level

Clustering Diagnostics Module (clustering_diagnostics)

Overview

Enums

Data Classes

Main Functions

Example Usage

Diagnosing Clustering Structure

Getting Clustering Recommendations

Checking Consistency

Wild Cluster Bootstrap

Guidelines

See Also