Methodology
How we collected, processed, and analyzed 227 million Medicaid records to build Follow the Billions.
1Data Source
T-MSIS via CMS Integrated Data Repository
The primary data source is the Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files, maintained by the Centers for Medicare & Medicaid Services (CMS). These files represent the most comprehensive source of Medicaid claims data available, capturing enrollment, claims, and provider information across all 50 states, the District of Columbia, and U.S. territories.
Data was accessed through the CMS Integrated Data Repository (IDR), which provides aggregated, de-identified datasets for research purposes. Our extract covers the period from January 2018 through December 2024, comprising approximately 227 million aggregated records.
Key Parameters
2Data Schema
| Field | Type | Description |
|---|---|---|
| BILLING_PROVIDER_NPI_NUM | VARCHAR(10) | National Provider Identifier of the billing entity. Primary key for provider-level aggregation. |
| SERVICING_PROVIDER_NPI_NUM | VARCHAR(10) | NPI of the provider who delivered the service. Used for network analysis when different from billing NPI. |
| HCPCS_CODE | VARCHAR(5) | Healthcare Common Procedure Coding System code. Identifies the service or procedure billed. |
| CLAIM_FROM_MONTH | DATE | Month of service. Aggregated from claim-level dates to monthly granularity. |
| TOTAL_UNIQUE_BENEFICIARIES | INTEGER | Count of distinct Medicaid beneficiaries served by this provider/procedure/month combination. |
| TOTAL_CLAIMS | INTEGER | Total number of claims submitted for this provider/procedure/month combination. |
| TOTAL_PAID | DECIMAL | Total amount paid by Medicaid for this provider/procedure/month combination, in USD. |
3Aggregation Methods
Temporal Aggregation
Claims data is aggregated at monthly granularity. Each record represents the sum of all claims for a specific provider-procedure-month combination. This allows trend analysis at sub-annual resolution while maintaining manageable data volumes. Yearly aggregations are derived from monthly totals for growth rate and summary statistics.
Provider Aggregation
Providers are identified by their National Provider Identifier (NPI). Spending is aggregated across all procedure codes and months to produce total provider-level spending figures. Provider rankings use cumulative 2018-2024 spending. Concentration metrics (e.g., top-N share of spending) are calculated from these provider-level totals.
Procedure Code Aggregation
HCPCS procedure codes are aggregated across all providers and months to identify system-wide service patterns. Category labels (Personal Care, Residential, etc.) are assigned based on CMS code descriptions and manual classification of the top 200 codes by spending volume.
Derived Metrics
- Per-claim cost: TOTAL_PAID / TOTAL_CLAIMS
- Per-beneficiary cost: TOTAL_PAID / TOTAL_UNIQUE_BENEFICIARIES
- Growth rates: Year-over-year percentage change in spending
- Concentration ratios: Cumulative share of spending by top-N providers
- Claims velocity: TOTAL_CLAIMS / time period (e.g., claims per day)
4Network Methodology
Bipartite Graph Construction
The provider network is modeled as a bipartite graph with two node types: providers (identified by NPI) and procedure codes (identified by HCPCS code). An edge connects a provider to a procedure code if the provider billed that code during the analysis period. Edge weights reflect total spending on that provider-procedure pair.
Community Detection
Community structure is identified using modularity-based clustering on the projected provider-provider network (two providers are connected if they share procedure codes). This reveals groups of providers who operate in similar service spaces — Personal Care, Residential, Behavioral Health, etc. The algorithm identifies 9 distinct communities in the top-155 provider network.
Force-Directed Simulation
Network visualization uses a D3.js force-directed layout. Nodes are positioned using charge repulsion, link attraction, and collision detection forces. Node size encodes total spending (log scale). Node color encodes community membership. The simulation runs in the browser, allowing interactive exploration of the network structure.
Bridge Providers
Bridge providers are identified by their betweenness centrality in the projected network — providers with connections to multiple communities who serve as structural connectors in the system. Approximately 8 providers in the top-155 network exhibit significant bridge behavior, connecting otherwise separate service communities.
5Limitations
2024 Data is Preliminary
The 2024 T-MSIS data is preliminary and subject to revision by CMS. Claims processing lags mean that 2024 figures — particularly Q3 and Q4 — may be understated. The 76% spending decline in 2024 is partially attributable to this reporting lag, in addition to the enrollment unwinding.
Redacted NPIs
Some provider NPIs are redacted in the public-use analytic files. This means that certain high-volume providers — including the largest in the dataset — cannot be identified by name or organization. This is a significant limitation for transparency analysis.
No Outcome Data
The T-MSIS analytic files contain claims and spending data but no clinical outcome measures. High spending does not necessarily indicate waste, fraud, or poor care — it may reflect higher acuity, better access, or different service models. Conversely, low spending does not indicate efficiency.
State Reporting Variation
States vary in their T-MSIS reporting completeness and timeliness. Some states have data quality issues that may affect cross-state comparisons. We do not adjust for state-level reporting quality in the aggregate analysis.
Aggregation Level
All data is pre-aggregated at the provider-procedure-month level. Individual claim-level detail and beneficiary-level information are not available in this dataset. This limits the types of analysis that can be performed (e.g., no patient journey analysis, no individual claim auditing).
6Citation Guide
Citing the Podcast
Cole, M. & Chen, S. (2026). Follow the Billions: Inside the Medicaid Machine. DataPulse Media. https://followthebillions.com
Citing a Specific Episode
Cole, M. & Chen, S. (2026). "The Billion Dollar Question" [Podcast episode]. In Follow the Billions (Season 1, Episode 1). DataPulse Media. https://followthebillions.com/episodes/1/1
Citing the Data
Centers for Medicare & Medicaid Services (CMS). (2024). T-MSIS Analytic Files. Accessed via CMS Integrated Data Repository. Analysis by Follow the Billions.
If you use data, analysis, or findings from Follow the Billions in your own research, reporting, or policy work, we ask that you cite the source appropriately. For press inquiries or data sharing requests, contact us through the links in the footer.