Clustering Diagnostics Module (clustering_diagnostics)
Clustering diagnostics and recommendations for difference-in-differences.
This module provides tools for analyzing clustering structure in panel data and recommending appropriate clustering levels for standard error estimation. When treatment varies at a level higher than the observation unit, standard errors should be clustered at the policy variation level.
Overview
Proper clustering is essential for valid inference in DiD settings. This module helps:
Analyze hierarchical relationships: Between potential clustering variables
Detect treatment variation level: Identify where treatment assignment varies
Recommend clustering variables: With sufficient cluster counts
Check consistency: Between clustering choice and treatment variation
For reliable cluster-robust inference, a minimum of 20-30 clusters is generally recommended. When clusters are fewer, wild cluster bootstrap methods provide more accurate inference.
Enums
- class lwdid.clustering_diagnostics.ClusteringLevel(value)[source]
Relative level of clustering variable to unit variable.
- Variables:
LOWER (str) – Cluster variable is at a lower level than unit (invalid for clustering). Example: sub-unit ID when unit is individual.
SAME (str) – Cluster variable is at the same level as unit. Example: individual ID when unit is individual.
HIGHER (str) – Cluster variable is at a higher level than unit (recommended). Example: state when unit is county.
- LOWER = 'lower'
- SAME = 'same'
- HIGHER = 'higher'
- class lwdid.clustering_diagnostics.ClusteringWarningLevel(value)[source]
Severity level for clustering warnings.
- Variables:
- INFO = 'info'
- WARNING = 'warning'
- ERROR = 'error'
Data Classes
- class lwdid.clustering_diagnostics.ClusterVarStats(var_name, n_clusters, n_treated_clusters, n_control_clusters, min_cluster_size, max_cluster_size, mean_cluster_size, median_cluster_size, cluster_size_cv, level_relative_to_unit, units_per_cluster, is_nested_in_unit, treatment_varies_within_cluster, n_clusters_with_treatment_variation=0)[source]
Statistics for a single potential clustering variable.
This class holds comprehensive statistics about a clustering variable, including cluster counts, size distributions, and validity indicators.
- Variables:
var_name (str) – Name of the clustering variable.
n_clusters (int) – Total number of unique clusters.
n_treated_clusters (int) – Number of clusters containing treated units.
n_control_clusters (int) – Number of clusters containing only control units.
min_cluster_size (int) – Minimum number of observations in any cluster.
max_cluster_size (int) – Maximum number of observations in any cluster.
mean_cluster_size (float) – Mean cluster size.
median_cluster_size (float) – Median cluster size.
cluster_size_cv (float) – Coefficient of variation of cluster sizes (std/mean).
level_relative_to_unit (ClusteringLevel) – Whether cluster is at higher/same/lower level than unit.
units_per_cluster (float) – Average number of unique units per cluster.
is_nested_in_unit (bool) – True if cluster varies within unit (invalid for clustering).
treatment_varies_within_cluster (bool) – True if treatment status varies within clusters.
n_clusters_with_treatment_variation (int) – Number of clusters with within-cluster treatment variation.
Properties
----------
is_valid_cluster (bool) – Whether this is a valid clustering variable.
is_recommended (bool) – Whether this clustering level is recommended.
reliability_score (float) – Score indicating reliability of cluster-robust inference (0-1).
- var_name: str
- n_clusters: int
- n_treated_clusters: int
- n_control_clusters: int
- min_cluster_size: int
- max_cluster_size: int
- mean_cluster_size: float
- median_cluster_size: float
- cluster_size_cv: float
- level_relative_to_unit: ClusteringLevel
- units_per_cluster: float
- is_nested_in_unit: bool
- treatment_varies_within_cluster: bool
- n_clusters_with_treatment_variation: int = 0
- property is_valid_cluster: bool
Whether this is a valid clustering variable.
A valid clustering variable must: 1. Not be nested within units (each unit belongs to one cluster) 2. Have at least 2 clusters 3. Not be at a lower level than the unit variable
- Returns:
True if valid for clustering.
- Return type:
- property is_recommended: bool
Whether this clustering level is recommended.
A recommended clustering variable must: 1. Be valid (see is_valid_cluster) 2. Have at least 20 clusters for reliable inference 3. Treatment should not vary within clusters
- Returns:
True if recommended for clustering.
- Return type:
- property reliability_score: float
Score indicating reliability of cluster-robust inference (0-1).
Based on: - Number of clusters (more is better, saturates at 50) - Balance of treated/control clusters - Cluster size variation (less is better)
- Returns:
Reliability score between 0 and 1.
- Return type:
- class lwdid.clustering_diagnostics.ClusteringDiagnostics(cluster_structure, recommended_cluster_var, recommendation_reason, treatment_variation_level, warnings=<factory>)[source]
Diagnostic results for clustering structure analysis.
This class contains the complete results of clustering diagnostics, including statistics for each potential clustering variable and recommendations.
- Variables:
cluster_structure (Dict[str, ClusterVarStats]) – Statistics for each potential clustering variable.
recommended_cluster_var (Optional[str]) – Recommended clustering variable name.
recommendation_reason (str) – Explanation for the recommendation.
treatment_variation_level (str) – Detected level at which treatment varies.
warnings (List[str]) – Warning messages about clustering choices.
- recommendation_reason: str
- treatment_variation_level: str
- class lwdid.clustering_diagnostics.ClusteringRecommendation(recommended_var, n_clusters, n_treated_clusters, n_control_clusters, confidence, reasons, alternatives=<factory>, warnings=<factory>, use_wild_bootstrap=False, wild_bootstrap_reason=None)[source]
Recommendation for clustering level selection.
This class provides a detailed recommendation for which clustering variable to use, along with confidence scores and alternatives.
- Variables:
recommended_var (str) – Recommended clustering variable name.
n_clusters (int) – Number of clusters with recommended variable.
n_treated_clusters (int) – Number of treated clusters.
n_control_clusters (int) – Number of control clusters.
confidence (float) – Confidence in recommendation (0-1).
reasons (List[str]) – List of reasons supporting the recommendation.
alternatives (List[Dict[str, Any]]) – Alternative clustering options with their statistics.
warnings (List[str]) – Warning messages.
use_wild_bootstrap (bool) – Whether to recommend wild cluster bootstrap.
wild_bootstrap_reason (Optional[str]) – Reason for wild bootstrap recommendation.
- recommended_var: str
- n_clusters: int
- n_treated_clusters: int
- n_control_clusters: int
- confidence: float
- use_wild_bootstrap: bool = False
- class lwdid.clustering_diagnostics.ClusteringConsistencyResult(is_consistent, treatment_variation_level, cluster_level, n_clusters, n_treatment_changes_within_cluster, pct_clusters_with_variation, recommendation, details)[source]
Result of clustering consistency check.
This class contains the results of checking whether the chosen clustering level is consistent with the treatment variation level.
- Variables:
is_consistent (bool) – Whether clustering level is consistent with treatment variation.
treatment_variation_level (str) – Detected level at which treatment varies.
cluster_level (str) – Level of the clustering variable.
n_clusters (int) – Number of clusters.
n_treatment_changes_within_cluster (int) – Number of clusters with treatment variation within.
pct_clusters_with_variation (float) – Percentage of clusters with within-cluster treatment variation.
recommendation (str) – Suggested action if inconsistent.
details (str) – Detailed explanation of the consistency check.
- is_consistent: bool
- treatment_variation_level: str
- cluster_level: str
- n_clusters: int
- n_treatment_changes_within_cluster: int
- pct_clusters_with_variation: float
- recommendation: str
- details: str
Main Functions
- lwdid.clustering_diagnostics.diagnose_clustering(data, ivar, potential_cluster_vars, gvar=None, d=None, verbose=True)[source]
Diagnose clustering structure and recommend clustering level.
Analyzes the hierarchical structure of potential clustering variables relative to the unit of observation and treatment assignment.
This function helps users choose the appropriate clustering level for standard error estimation in difference-in-differences analysis.
- Parameters:
data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
potential_cluster_vars (List[str]) – List of potential clustering variable column names to evaluate.
gvar (str, optional) – Cohort variable for staggered designs. Use this for staggered adoption designs where treatment timing varies across units.
d (str, optional) – Treatment indicator variable (for common timing). Use this for designs where all treated units receive treatment at the same time.
verbose (bool, default True) – Whether to print diagnostic summary.
- Returns:
Diagnostic results containing: - cluster_structure: Statistics for each potential clustering variable - recommended_cluster_var: Recommended clustering variable name - recommendation_reason: Explanation for the recommendation - treatment_variation_level: Detected level at which treatment varies - warnings: Warning messages about clustering choices
- Return type:
ClusteringDiagnostics
- Raises:
ValueError – If inputs are invalid (missing columns, no treatment variable, etc.)
Notes
When the policy or treatment varies at a level higher than the unit of observation, standard errors should be clustered at the policy variation level to properly account for within-cluster correlation.
The function evaluates each potential clustering variable based on:
Number of clusters (more is better, 20-30 minimum recommended)
Balance between treated and control clusters
Whether treatment varies within clusters
Cluster size variation (coefficient of variation)
See also
recommend_clustering_levelGet detailed recommendation with alternatives.
check_clustering_consistencyValidate clustering choice against treatment.
- lwdid.clustering_diagnostics.recommend_clustering_level(data, ivar, tvar, potential_cluster_vars, gvar=None, d=None, min_clusters=20, verbose=True)[source]
Recommend optimal clustering level based on data characteristics.
This function provides a detailed recommendation for which clustering variable to use, along with confidence scores, alternatives, and guidance on whether wild cluster bootstrap is needed.
Algorithm: 1. Analyze each potential cluster variable 2. Detect treatment variation level 3. Filter to valid clustering options 4. Rank by reliability score 5. Check if wild bootstrap is needed (when clusters < min_clusters)
- Parameters:
data (pd.DataFrame) – Panel data in long format.
ivar (str) – Unit identifier column name.
tvar (str) – Time variable column name.
potential_cluster_vars (List[str]) – List of potential clustering variable column names.
gvar (str, optional) – Cohort/treatment timing variable column name (for staggered designs).
d (str, optional) – Treatment indicator variable (for common timing designs).
min_clusters (int, default 20) – Minimum recommended number of clusters. If the recommended clustering variable has fewer clusters, wild cluster bootstrap will be recommended.
verbose (bool, default True) – Whether to print recommendation summary.
- Returns:
Recommendation containing: - recommended_var: Recommended clustering variable name - n_clusters: Number of clusters with recommended variable - n_treated_clusters: Number of treated clusters - n_control_clusters: Number of control clusters - confidence: Confidence in recommendation (0-1) - reasons: List of reasons supporting the recommendation - alternatives: Alternative clustering options - warnings: Warning messages - use_wild_bootstrap: Whether to recommend wild cluster bootstrap - wild_bootstrap_reason: Reason for wild bootstrap recommendation
- Return type:
ClusteringRecommendation
- Raises:
ValueError – If no valid clustering options are found.
Notes
The reliability score is computed as a weighted combination of:
Number of clusters (50% weight, saturates at 50 clusters)
Balance of treated/control clusters (30% weight)
Cluster size variation (20% weight, lower CV is better)
When the number of clusters is below
min_clusters, the function recommends using wild cluster bootstrap for more reliable inference.See also
diagnose_clusteringGet detailed diagnostics for clustering structure.
check_clustering_consistencyCheck if clustering is consistent with treatment.
wild_cluster_bootstrapBootstrap inference for small cluster counts.
- lwdid.clustering_diagnostics.check_clustering_consistency(data, ivar, cluster_var, gvar=None, d=None, verbose=True)[source]
Check if clustering level is consistent with treatment variation level.
A consistent clustering choice means: 1. Treatment does not vary within clusters (or varies minimally) 2. Cluster level is at or above the treatment variation level
This function helps validate that the chosen clustering variable is appropriate for the treatment assignment mechanism.
- Parameters:
data (pd.DataFrame) – Panel data.
ivar (str) – Unit identifier.
cluster_var (str) – Clustering variable to check.
gvar (str, optional) – Treatment timing variable (for staggered designs).
d (str, optional) – Treatment indicator (for common timing designs).
verbose (bool, default True) – Whether to print results.
- Returns:
Consistency check results containing: - is_consistent: Whether clustering level is consistent - treatment_variation_level: Detected treatment variation level - cluster_level: Level of the clustering variable - n_clusters: Number of clusters - n_treatment_changes_within_cluster: Clusters with treatment variation - pct_clusters_with_variation: Percentage with variation - recommendation: Suggested action if inconsistent - details: Detailed explanation
- Return type:
ClusteringConsistencyResult
- Raises:
ValueError – If inputs are invalid.
Notes
A clustering choice is considered consistent if:
Less than 5% of clusters have within-cluster treatment variation
The cluster level is at the same level or higher than the unit
If treatment varies within clusters, standard errors may be conservative (too large), leading to under-rejection of the null hypothesis.
See also
diagnose_clusteringGet detailed diagnostics for clustering structure.
recommend_clustering_levelGet recommendation for clustering level.
Example Usage
Diagnosing Clustering Structure
from lwdid import diagnose_clustering
# Analyze potential clustering variables
diagnostics = diagnose_clustering(
data=panel_data,
ivar='county',
potential_cluster_vars=['state', 'region'],
treatment_var='treated'
)
# Review cluster counts
for var_stats in diagnostics.cluster_var_stats:
print(f"{var_stats.var_name}: {var_stats.n_clusters} clusters")
# Check warnings
for warning in diagnostics.warnings:
print(f"[{warning.level}] {warning.message}")
Getting Clustering Recommendations
from lwdid import recommend_clustering_level
# Get recommendation with minimum cluster count
recommendation = recommend_clustering_level(
data=panel_data,
ivar='county',
potential_cluster_vars=['state', 'region'],
treatment_var='treated',
min_clusters=20
)
print(f"Recommended: {recommendation.recommended_var}")
print(f"Reason: {recommendation.reason}")
# Use in estimation
results = lwdid(
data, y='outcome', d='treated', ivar='county', tvar='year',
post='post', rolling='demean',
vce='cluster', cluster_var=recommendation.recommended_var
)
Checking Consistency
from lwdid import check_clustering_consistency
# Verify clustering choice is appropriate
consistency = check_clustering_consistency(
data=panel_data,
ivar='county',
cluster_var='state',
treatment_var='treated'
)
print(f"Consistent: {consistency.is_consistent}")
if not consistency.is_consistent:
print(f"Issue: {consistency.message}")
Wild Cluster Bootstrap
When cluster counts are small (< 20), use wild cluster bootstrap:
from lwdid import wild_cluster_bootstrap
# Run wild cluster bootstrap
bootstrap_result = wild_cluster_bootstrap(
data=transformed_data,
y_transformed='y_dot',
d='treated',
cluster_var='state',
n_bootstrap=999,
weight_type='rademacher'
)
print(f"Bootstrap SE: {bootstrap_result.se_bootstrap:.4f}")
print(f"Bootstrap p-value: {bootstrap_result.pvalue:.4f}")
Guidelines
Minimum Cluster Counts:
20-30 clusters: Generally sufficient for cluster-robust standard errors
10-20 clusters: Use wild cluster bootstrap for improved inference
< 10 clusters: Wild cluster bootstrap essential; consider randomization inference
Clustering Level Selection:
Cluster at the level where treatment varies
If treatment varies at multiple levels, cluster at the highest level
Never cluster below the unit level
See Also
Inference Module (inference) - Wild cluster bootstrap implementation
Methodological Notes - Theoretical foundations of clustering
lwdid.lwdid()- Main estimation withvce='cluster'