CMS Utilization & Payment Database

The purpose of the project is to provide a containerized database service loaded with CMS data in order to increase the utility of the dataset and reduce the technical overhead associated with consuming it.

Stakeholders pursuing healthcare price transparency can use this database to learn about costs & volumes of HCPCS services, provider performance (as reported in source data and additional derived analytic data), and understand finer details related to healthcare costs.

Examples of detailed healthcare costs:

variance between charges & payments
variance in payments between providers for the same service
average costs nationally for individual healthcare services

Examples of derived analytic data:

Ranking of Providers by Total Services Provided
Ranking of Providers by HCPCS Services Provided
Estimated Sum Amount charged & received by Provider
Estimated Sum Amount charged & received by Provider & HCPCS Service
Estimated Sum Amount charged & received by HCPCS Service
Ranking of Providers by Total $ Charged to CMS
Ranking of Providers by Total $ Paid from CMS

For background knowledge see :

Quick Start
Start with Prisma for GraphQL API
Source data
Schema
- Source Data Tables
- Analytic Data Tables
Geocoding Addresses
Prisma IDs
GraphQL Queries
- HCPCS Service Performance & Leaders
- Provider Identiy & Service Performance
Development
Project Origin & Inspiration
Additional Data Sets
Versioning
License

Quick Start

make start

Postgres Database | port 5432
Credentials
- database: govdata
- user: dbuser
- password: dbpassword

This command starts the database and restores the schema (structure & data) from a db.dump that is baked into the docker image.

This initiation process will take 10+ mins (sample time report provided below) to complete on the first run. Subsequent starts rely on persistent data.

:: real	11m5.184s
:: user	0m15.680s
:: sys	0m2.800s

Start with Prisma for GraphQL API

Use this command OR `make start`. Not Both.

# get host ip address # this is the ip the prisma container will call to reach the db containers
ifconfig | grep inet

# update PRISMA_CONFIG in docker-compose.prisma.yml
# by replacing __REPLACE__WITH__LOCAL__IP__ with ip address from above

# start db & prisma service
make start-with-prisma

# ensure prisma is installed globally
sudo npm i prisma

# deploy prisma config to prisma server
npx prisma deploy

Postgres Database | port 5432
GraphQL Playground | port 4466

This command starts the database and starts a prisma service that provides a GraphQL API to the data.

Note 1: Use Node v10.

Note 2: Before you can successfully deploy the prisma services, you must update the host defination inside the PRISMA_CONFIG environment variable stored in the docker-compose.prisma.yml. To accomplish this, simply replace the placeholder __REPLACE__WITH__LOCAL__IP__ (near line 18) with the IP of your local machine so that the prisma container can reach the DB container.

Source data

The foundation of this project is a .csv dataset titled, Medicare Provider Utilization and Payment Data: Physician and Other Supplier PUF CY2016.

You can find a full description of the dataset on their site, but for convenience, a summarized description is below:

The Centers for Medicare & Medicaid Services (CMS) has prepared a public data set that provides information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals. The dataset contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This data represents calendar year 2016 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.

Schema

Within the database, the data is organized into 9 tables (5 source data tables & 4 analytic data tables).

Source Data Tables

services contains all HCPCS Services represented in the source file.
providers (and related tables providers_individuals & providers_organizations) contains identity information & address for each provider (distinguished by providers.entity_type).
provider_performance contains performance statistics for each HCPCS service provided by each provider.

Analytic Data Tables

service_performance aggregates performance statistics for each HCPCS service across all providers.
service_provider_performance extends the data contained in provider_performance, estimates cumulative amounts charged & paid (n_of_svcs * amt), and also generates performance rankings by compares each provider's performance attributes against all other providers that also perform the service.
service_provider_performance_summary aggregates each provider's performance statistics across all HCPCS services. The summary also estimates cumulative amounts charged & paid (n_of_svcs * amt) and generates performance rankings by comparing each provider's performance attributes against all other providers.
service_provider_performance_summary_type is used to define different ways of generating summaries (provider groupings and/or service subsets) within service_provider_performance_summary.

Currently summaries supported:
- all providers + all services
- all providers + non-drug HSPCS services
- all providers + drug HSPCS services
- planned feature: enterprise providers + all services

Geocoding Provider Addresses

The next piece in this larger CMS project is a GraphQL API that will leverage this db along with two other data sources -- the NPI Registry Public Search and an Elastic instance seeded with records from this database.

There are several benefits of using Elastic in an application, but the most important to me is its ability to power geographic searches, which would allow users to search by proximity for providers that perform specific HCPCS services.

The value here is simple -- while being able to lookup the average costs for HCPCS services nationally is useful, being able to see what the costs are for the providers in your market has much greater utility.

Implementing proximity search is pretty easy using native features in Elastic, but before I could implement that with this dataset... I'd have to add Geo-Coordinates to over 500k unique addresses.

I had little appitite for setting up a job that would batch these addresses against the geocoding services I've used before (google, mapquest, census) -- as those are throttled and would require considerable time & effort.

Instead, I set out to find a geocoding service that would be ideal for a (large) one-time batch job. Pricing was very important, as it always is. As was speed -- as I recently started at a startup and have little time to hack.

After a few searches, I found a service that fit the bill: Geocodio.

Their pricing page included a pricing calculator with a toggle for one-time/recurring (which was exactly what I was looking for) and I found the price reasonable: ~$235 for 525k unique addresses. I was also pretty happy that the estimated processing time was 2 hours. Decision made *.

I exported the unique addresses from the dataset to a csv, uploaded to their service, and added the results to the original provider records in the providers table.

And now that I've got the geocoordinates available in the db, I can seed an Elastic instance in the manner that will power autocomplete, suggestion and geo search. For more detail on how that process works, checkout the cms-utilization-search repo.

*NOTE: After selecting Geocodio as my vendor for geocoding services -- I reached out to them to see if they'd be willing to beat the estimate provided by their pricing calculator, because of the open-source nature of this project. Specifically, I offered them a mention here for a discounted rate and they agreed. #Victory

Prisma IDs

The first planned consumer of this DB was a GraphQL service.

That feature uses prisma to provide a GraphQL API to the data. Unfortunately, when developing that feature, I noticed that prisma currently doesn't support composite keys (although the feature is proposed for Datamodel v2).

To resolve this in the meantime, new auto incrementing primary keys were defined (as prisma_id). The original keys still exist, but are no long the primary key for their respective tables.

GraphQL Queries

GraphQL queries can be pretty overwhelming to the newly initiated.

In order to speed adoption (and ease developers into the data), I've built out a few GraphQL queries and saved them on GraphQL Bin -- which is like a pastebin for these kinds of things.

I'll add to this list regularly, so be sure to check back. Also, for anyone wanting to contribute... this is some low hanging fruit.

GraphQL Bin Stored Queries

Development

Run Service

make run

This command starts the DB service, and mounts several of the local directories (scripts, sql, tmp) in the container.

Container Maintenance

# @docker-compose stop
make stop
	
# @docker-compose down --remove-orphan
make clean

These commands stop all containers & remove all containers.

Generate DB Export

make db-dump-sql

This command connects to the container running the DB service and generates a sql export via pg_dump.

NOTE 1: This sql.dump file is large (1gb) and as such, is not stored in the repo (.gitignore). To generate this file (for inspection or when publishing a new docker image), you must generate this file, and have it locally, before baking the docker image.

Restore DB from Export

make db-load-via-sql

This command connects to the container running the DB service and restores the db using the sql.dump file generated above.

Build & Load DB via ETL Process

make db-load-via-etl

This command builds the database through an ETL process that downloads the original .csv, loads it into temp tables, then builds prod tables and populates them from the temp tables.

Anyone interested in validating the integrity of the data provided by the published docker image for project, should focus their attention here.

NOTE 1: ETL process takes 2.5+ hours.

NOTE 2: ETL process downloads the original CSV from the government’s servers and stores it inside the container at /tmp/cms-utilization.csv If you're running the project in development mode (make run), this file should be available to you at /volumes/tmp/cms-utilization.csv

Container Entry Shortcut

make enter

These command is a shortcut for entering the running cms-utilization-db_cms-utilization-postgres_1 container. You're welcome.

Project Origin & Inspiration

I've been interested in healthcare utilization & payment data since I read an article by Steven Brill in Time Magazine in 2013.

In the years since, I've seen dozens of smaller, similar reports in various media, but I've yet to see anything as comprehensive and detailed as the original piece.

In appreciation of Brill's work, and to direct interested parties towards information that is probably relevant to their projects, I'm publishing links to some of those resources here.

Original Time Article
- Title: Bitter Pill: Why Medical Bills Are Killing Us
- Author: Steven Brill
- Date: Apr 04, 2013
Minutes from Congressional Hearing
- Event: Senate Hearing before the Committee on Finance
- Title: High Prices, Low Transparency: The Bitter Pill of Health Care Costs
- Date: June 18, 2013
- Note: Specific attention to Brill's the opening statement on page 5-6 (pdf page 9-10).
Federal Register Volume 83, Number 192
- Section X: Requirements for Hospitals To Make Public a List of Their Standard Charges via the Internet
CMS FAQs about 2019 Pricing Transparency Requirements

Additional Data Sets

Similar projects can be built using data from these additional data stores:

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets/img/readme		assets/img/readme
docker		docker
etl		etl
volumes/db		volumes/db
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
datamodel.prisma		datamodel.prisma
docker-compose.development.yml		docker-compose.development.yml
docker-compose.prisma.yml		docker-compose.prisma.yml
docker-compose.yml		docker-compose.yml
prisma.yml		prisma.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CMS Utilization & Payment Database

The purpose of the project is to provide a containerized database service loaded with CMS data in order to increase the utility of the dataset and reduce the technical overhead associated with consuming it.

Table of Contents

Quick Start

Start with Prisma for GraphQL API

Use this command OR `make start`. Not Both.

Source data

Schema

Source Data Tables

Analytic Data Tables

Geocoding Provider Addresses

Prisma IDs

GraphQL Queries

GraphQL Bin Stored Queries

Development

Run Service

Container Maintenance

Generate DB Export

Restore DB from Export

Build & Load DB via ETL Process

Container Entry Shortcut

Project Origin & Inspiration

Additional Data Sets

Versioning

License

About

Uh oh!

Releases

Packages

Languages

License

sudowing/cms-utilization-db

Folders and files

Latest commit

History

Repository files navigation

CMS Utilization & Payment Database

The purpose of the project is to provide a containerized database service loaded with CMS data in order to increase the utility of the dataset and reduce the technical overhead associated with consuming it.

Table of Contents

Quick Start

Start with Prisma for GraphQL API

Use this command OR make start. Not Both.

Source data

Schema

Source Data Tables

Analytic Data Tables

Geocoding Provider Addresses

Prisma IDs

GraphQL Queries

GraphQL Bin Stored Queries

Development

Run Service

Container Maintenance

Generate DB Export

Restore DB from Export

Build & Load DB via ETL Process

Container Entry Shortcut

Project Origin & Inspiration

Additional Data Sets

Versioning

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Use this command OR `make start`. Not Both.

Packages