The purpose of the project is to provide a containerized database service loaded with CMS data in order to increase the utility of the dataset and reduce the technical overhead associated with consuming it.
Stakeholders pursuing healthcare price transparency can use this database to learn about costs & volumes of HCPCS services, provider performance (as reported in source data and additional derived analytic data), and understand finer details related to healthcare costs.
Examples of detailed healthcare costs:
- variance between charges & payments
- variance in payments between providers for the same service
- average costs nationally for individual healthcare services
Examples of derived analytic data:
- Ranking of Providers by Total Services Provided
- Ranking of Providers by HCPCS Services Provided
- Estimated Sum Amount
charged&receivedby Provider - Estimated Sum Amount
charged&receivedby Provider & HCPCS Service - Estimated Sum Amount
charged&receivedby HCPCS Service - Ranking of Providers by Total $ Charged to CMS
- Ranking of Providers by Total $ Paid from CMS
For background knowledge see :
- Source Data: CMS Utilization & Payment Data
- NPPES NPI: wikipedia
- NPPES NPI: api
- HCPCS: wikipedia
- HCPCS: CMS Datasets
- Quick Start
- Start with Prisma for GraphQL API
- Source data
- Schema
- Geocoding Addresses
- Prisma IDs
- GraphQL Queries
- Development
- Project Origin & Inspiration
- Additional Data Sets
- Versioning
- License
make start
- Postgres Database | port 5432
- Credentials
- database: govdata
- user: dbuser
- password: dbpassword
This command starts the database and restores the schema (structure & data) from a db.dump that is baked into the docker image.
This initiation process will take 10+ mins (sample time report provided below) to complete on the first run. Subsequent starts rely on persistent data.
:: real 11m5.184s
:: user 0m15.680s
:: sys 0m2.800s
# get host ip address # this is the ip the prisma container will call to reach the db containers
ifconfig | grep inet
# update PRISMA_CONFIG in docker-compose.prisma.yml
# by replacing __REPLACE__WITH__LOCAL__IP__ with ip address from above
# start db & prisma service
make start-with-prisma
# ensure prisma is installed globally
sudo npm i prisma
# deploy prisma config to prisma server
npx prisma deploy
- Postgres Database | port 5432
- GraphQL Playground | port 4466
This command starts the database and starts a prisma service that provides a GraphQL API to the data.
Note 1: Use Node v10.
Note 2: Before you can successfully deploy the prisma services, you must update the host defination inside the PRISMA_CONFIG environment variable stored in the docker-compose.prisma.yml. To accomplish this, simply replace the placeholder __REPLACE__WITH__LOCAL__IP__ (near line 18) with the IP of your local machine so that the prisma container can reach the DB container.
The foundation of this project is a .csv dataset titled, Medicare Provider Utilization and Payment Data: Physician and Other Supplier PUF CY2016.
You can find a full description of the dataset on their site, but for convenience, a summarized description is below:
The Centers for Medicare & Medicaid Services (CMS) has prepared a public data set that provides information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals. The dataset contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This data represents calendar year 2016 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.
Within the database, the data is organized into 9 tables (5 source data tables & 4 analytic data tables).
-
servicescontains all HCPCS Services represented in the source file. -
providers(and related tablesproviders_individuals&providers_organizations) contains identity information & address for each provider (distinguished byproviders.entity_type). -
provider_performancecontains performance statistics for each HCPCS service provided by each provider.
-
service_performanceaggregates performance statistics for each HCPCS service across all providers. -
service_provider_performanceextends the data contained inprovider_performance, estimates cumulative amounts charged & paid (n_of_svcs*amt), and also generates performance rankings by compares each provider's performance attributes against all other providers that also perform the service. -
service_provider_performance_summaryaggregates each provider's performance statistics across all HCPCS services. The summary also estimates cumulative amounts charged & paid (n_of_svcs*amt) and generates performance rankings by comparing each provider's performance attributes against all other providers. -
service_provider_performance_summary_typeis used to define different ways of generating summaries (provider groupings and/or service subsets) withinservice_provider_performance_summary.Currently summaries supported:
- all providers + all services
- all providers + non-drug HSPCS services
- all providers + drug HSPCS services
- planned feature: enterprise providers + all services
The next piece in this larger CMS project is a GraphQL API that will leverage this db along with two other data sources -- the NPI Registry Public Search and an Elastic instance seeded with records from this database.
There are several benefits of using Elastic in an application, but the most important to me is its ability to power geographic searches, which would allow users to search by proximity for providers that perform specific HCPCS services.
The value here is simple -- while being able to lookup the average costs for HCPCS services nationally is useful, being able to see what the costs are for the providers in your market has much greater utility.
Implementing proximity search is pretty easy using native features in Elastic, but before I could implement that with this dataset... I'd have to add Geo-Coordinates to over 500k unique addresses.
I had little appitite for setting up a job that would batch these addresses against the geocoding services I've used before (google, mapquest, census) -- as those are throttled and would require considerable time & effort.
Instead, I set out to find a geocoding service that would be ideal for a (large) one-time batch job. Pricing was very important, as it always is. As was speed -- as I recently started at a startup and have little time to hack.
After a few searches, I found a service that fit the bill: Geocodio.
Their pricing page included a pricing calculator with a toggle for one-time/recurring (which was exactly what I was looking for) and I found the price reasonable: ~$235 for 525k unique addresses. I was also pretty happy that the estimated processing time was 2 hours. Decision made *.
I exported the unique addresses from the dataset to a csv, uploaded to their service, and added the results to the original provider records in the providers table.
And now that I've got the geocoordinates available in the db, I can seed an Elastic instance in the manner that will power autocomplete, suggestion and geo search. For more detail on how that process works, checkout the cms-utilization-search repo.
*NOTE: After selecting Geocodio as my vendor for geocoding services -- I reached out to them to see if they'd be willing to beat the estimate provided by their pricing calculator, because of the open-source nature of this project. Specifically, I offered them a mention here for a discounted rate and they agreed. #Victory
The first planned consumer of this DB was a GraphQL service.
That feature uses prisma to provide a GraphQL API to the data. Unfortunately, when developing that feature, I noticed that prisma currently doesn't support composite keys (although the feature is proposed for Datamodel v2).
To resolve this in the meantime, new auto incrementing primary keys were defined (as prisma_id). The original keys still exist, but are no long the primary key for their respective tables.
GraphQL queries can be pretty overwhelming to the newly initiated.
In order to speed adoption (and ease developers into the data), I've built out a few GraphQL queries and saved them on GraphQL Bin -- which is like a pastebin for these kinds of things.
I'll add to this list regularly, so be sure to check back. Also, for anyone wanting to contribute... this is some low hanging fruit.
make run
This command starts the DB service, and mounts several of the local directories (scripts, sql, tmp) in the container.
# @docker-compose stop
make stop
# @docker-compose down --remove-orphan
make clean
These commands stop all containers & remove all containers.
make db-dump-sql
This command connects to the container running the DB service and generates a sql export via pg_dump.
NOTE 1: This sql.dump file is large (1gb) and as such, is not stored in the repo (.gitignore). To generate this file (for inspection or when publishing a new docker image),
you must generate this file, and have it locally, before baking the docker image.
make db-load-via-sql
This command connects to the container running the DB service and restores the db using the sql.dump file generated above.
make db-load-via-etl
This command builds the database through an ETL process that downloads the original .csv, loads it into temp tables, then builds prod tables and populates them from the temp tables.
Anyone interested in validating the integrity of the data provided by the published docker image for project, should focus their attention here.
NOTE 1: ETL process takes 2.5+ hours.
NOTE 2: ETL process downloads the original CSV from the government’s servers and stores it inside the container at /tmp/cms-utilization.csv
If you're running the project in development mode (make run), this file should be available to you at /volumes/tmp/cms-utilization.csv
make enter
These command is a shortcut for entering the running cms-utilization-db_cms-utilization-postgres_1 container. You're welcome.
I've been interested in healthcare utilization & payment data since I read an article by Steven Brill in Time Magazine in 2013.
In the years since, I've seen dozens of smaller, similar reports in various media, but I've yet to see anything as comprehensive and detailed as the original piece.
In appreciation of Brill's work, and to direct interested parties towards information that is probably relevant to their projects, I'm publishing links to some of those resources here.
- Original Time Article
- Title: Bitter Pill: Why Medical Bills Are Killing Us
- Author: Steven Brill
- Date: Apr 04, 2013
- Minutes from Congressional Hearing
- Event: Senate Hearing before the Committee on Finance
- Title: High Prices, Low Transparency: The Bitter Pill of Health Care Costs
- Date: June 18, 2013
- Note: Specific attention to Brill's the opening statement on page 5-6 (pdf page 9-10).
- Federal Register Volume 83, Number 192
- Section X: Requirements for Hospitals To Make Public a List of Their Standard Charges via the Internet
- CMS FAQs about 2019 Pricing Transparency Requirements
Similar projects can be built using data from these additional data stores:
- CMS Data Catalog
- Open Payments: Overview
- Open Payments: Data
- Medicare.gov: Data
- Medicaid & CHIP Open Data
- DATA.GOV: Health Data Catalog
- HealthData.gov
- Chronic Conditions Data Warehouse
We use SemVer for versioning. For the versions available, see the tags on this repository.
This project is licensed under the MIT License - see the LICENSE.md file for details

