NOTE: This project is still very much a work in progress, with much of the larger model restructuring still to come, see the TODO file for more info.
First and foremost, this project is based off of the dbt GA4 Package by Velir, but has been modified and refactored for internal purposes. This project uses Google Analytics 4 BigQuery Exports as its source data, and offers useful base transformations to provide report-ready dimension & fact models that can be used for reporting purposes, blending with other data, and/or feature engineering for ML models.
Find more info about Google Analytics 4 BigQuery Exports here.
Features Overview:
- Four final tables—
ga4__events,ga4__pages,ga4__sesssions, andga4__users—that are completed unnested to be wide & denomalized for easy querying by the end-user. - Conversion of the the day-shared
events_YYYYMMDD&events_intraday_YYYYMMDDtables into singular date-partitionioned incremental base models. - Dynamically flattens
event_paramsinto their own individual columns. - Dynamically flattens
user_propsinto their own individual columns. - Dynamically extracts & flattens URL
query_params(e.g.,gclid,fbclid,_ga) into their own individual columns. - Custom
Variables. See here for more info. - Custom
Marcros. See here for more info.
This project and any future projects that may be based off of this intial dbt_ga4_project, will be following This Project's Style Guide...IN PROGRESS, which borrows ideals from the following Style Guides:
DAG Overview
NOTE: This DAG Image is NOT current & will continue to CHANGE until all models are finalized.
| Model Name | Description |
|---|---|
| ga4__events | This is the table for event-level metrics & dimensions, that has been transformed to be wide & denomalized for easier quering. |
| ga4__pages | This is the table for page-level metrics & dimensions, such as page_views, exits, and users. This table is grouped by page_title, event_date, and page_path. |
| ga4__sessions | This is the table for session-level metrics & dimensions, such as is_engaged_session, engagement_duration, and page_views. This table is grouped by both session_key and user_key. |
| ga4__users | This is the table for user-level metrics & dimensions, such as first & last_seen_date, geo, and traffic_source. This table is grouped by the hashed user_key dimension, which is based on user_id, or user_pseudo_id if one doesn't exist. |
| Model Name | Description |
|---|---|
| stg_ga4__events | Creates a table with event data that is enhanced with useful event_keys, page_keys, session keys, and user_keys. |
| stg_ga4__event_params | Creates a table that unnests all of the event parameters specific to each event (e.g. page_view, click, or scroll), except for those marked in the dbt_project.yml file. |
| stg_ga4__traffic_sources | Creates a table that designates a default_channel_grouping via the source, medium, campaign columns. |
| stg_ga4__user_props | Creates a table that unnests the user_properties, except for those marked in the dbt_project.yml file. |
| stg_ga4__query_params | Maps any and all query parameters (e.g. gclid, fbclid, etc.) contained in each event's page_location. |
| stg_ga4__conversions | Creates a table for the events that you mark as a conversion_event in the dbt_project.yml file. |
| int_ga4__events_joined | ...[TO DO]... |
| int_ga4__pages_grouped | ...[TO DO]... |
| int_ga4__sessions_grouped | ...[TO DO]... |
| int_ga4__users_grouped | ...[TO DO]... |
NOTE: These Macros are also not finalized & are likely to change.
get_first(by_column_name,from_column_name) source
This macro returns the FIRST position of a specified from_column_name, which is partioned by the by_column_name.
Args:
by_column_name(required): The name of the column which you want to partition your selction by.from_column_name(required): The name of the column to get the first value of.
Usage:
{{ get_first('<by_column_name>', '<from_column_name>') }}Example: Get the landing_page of a corresponding Session by selecting the first page_path using that Session's session_key.
SELECT
{{ get_first('session_key', 'page_path') }} AS landing_page
...get_last(by_column_name,from_column_name) source
This macro returns the LAST position of a specified from_column_name, which is partioned by the by_column_name.
Args:
by_column_name(required): The name of the column which you want to partition your selction by.from_column_name(required): The name of the column to get the last value of.
Usage:
{{ get_last('<by_column_name>', '<from_column_name>') }}Example: Get the last event_key for a corresponding Session using that Session's session_key.
SELECT
{{ get_last('session_key', 'event_key') }} AS last_session_event_key,
...extract_hostname_from_url(url) source
This macro extracts the hostname from a column containing a url.
Args:
url(required): The column containting URLs.
Usage:
{{ extract_hostname_from_url('<url>') }}Example: Extract the hostname from the page_location column.
SELECT
{{ extract_hostname_from_url('page_location') }} AS page_hostname,
...extract_query_string_from_url(url) source
This macro extracts the query_string from a column containing a url.
Args:
url(required): The column containting URLs.
Usage:
{{ extract_query_string_from_url('<url>') }}Example: Extract the query_string from the page_location column.
SELECT
{{ extract_query_string_from_url('page_location') }} AS page_query_string,
...remove_query_parameters(url, [parameters]) source
This macro removes the specified parameters from a column containing a url.
Args:
url(required): The column containting URLs.parameters(required, default=[]): A list of query parameters to remove from the URL.
Usage:
{{ remove_query_parameters('<url>', '[parameters]') }}Example: Remove the parameters: gclid, fbclid, and _ga from the page_location column.
{% set parameters = ['gclid','fbclid','_ga'] %}
SELECT
{{ remove_query_parameters('page_location', parameters) }} AS clean_page_location,
...unnest_by_key(column_to_unnest, key_to_extract, value_type = "string") source
This macro unnests a single key's value from an array. This macro will dynamically alias the sub-query with the name of the column_to_unnest.
Args:
column_to_unnest(required): The array column to unnest the key's value from.key_to_extract(required): The key by which to get the corresponding value for.value_type(optional, default="string"): The data type of the key's value column.
Usage:
{{ unnest_by_key('<column_to_unnest>', '<key_to_extract>', '<value_type>') }}Example: Unnest the corresponding values for the keys: page_location and ga_session_number from the nested event_params column.
SELECT
-- Unnest the default STRING value type
{{ unnest_by_key('event_params', 'page_location') }},
-- Unnest the INT value type
{{ unnest_by_key('event_params', 'ga_session_number', 'int') }},
...unnest_by_key_alt(column_to_unnest, key_to_extract, value_type = "string") source
This macro unnests a single key's value from an array. This macro allows for a custom alias named sub-query.
Args:
column_to_unnest(required): The array column to unnest the key's value from.key_to_extract(required): The key by which to get the corresponding value for.value_type(optional, default="string"): The data type of the key's value column.
Usage:
{{ unnest_by_key_alt('<column_to_unnest>', '<key_to_extract>', '<value_type>') }} AS <custom_alias_name>,Example: Unnest the corresponding values for the keys: page_location and ga_session_number from the nested event_params column.
SELECT
-- Unnest the default STRING value type & use a custom alias
{{ unnest_by_key_alt('event_params', 'page_location') }} AS url,
-- Unnest the INT value type & use a custom alias
{{ unnest_by_key_alt('event_params', 'ga_session_number', 'int') }} AS session_number,
...get_event_params() source
This macro will dynamically return all of the keys and their corresponding value_types found in the event_params array column.
- This macro will exclude event_params added to the
excluded_event_paramsvariable, which is specified in thedbt_project.ymlfile.
Usage / Example:
SELECT
{% for event_param in get_event_params() -%}
{{ unnest_by_key('event_params', event_param['event_param_key'], event_param['event_param_value']) }}
{{- "," if not loop.last }}
{% endfor %}
...default_channel_grouping(source, medium, source_category) source
This macro determines the default_channel_grouping and will result in one the following classifications:
DirectPaid SocialOraginc SocialEmailAffiliatesPaid ShoppingPaid SearchDisplayOther AdvertisingOrganic SearchOrganic VideoOrganic ShoppingAudioSMS(Other)
Args:
source(required): The source column used in determining the default channel grouping.medium(required): The medium column used in determining the default channel grouping.source_category(required): The source category column used in determining the default channel grouping. These are desiganted in thega4_source_categories.csvseed file.
Usage:
{{ default_channel_grouping('<source>', '<medium>', '<source_category>') }}Example:
SELECT
{{ default_channel_grouping('source', 'medium', 'source_category') }} AS default_channel_grouping,
...| Seed File | Description |
|---|---|
| ga4_source_categories.csv | Google's mapping between source and source_category. More info and the download can be found here. |
Make sure to run dbt seed before running dbt run.
...[TO DO]...
This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is located using the following variables which must be set in your dbt_project.yml file.
vars:
project: '<gcp_project>' # Set your Project ID here.
dataset: '<ga4_dataset>' # Set your Dataset name here.
start_date: 'YYYYMMDD' # Set the start date that you want to retrieve data from.
frequency: 'daily' # daily|streaming|daily+streaming Match to the type of export configured in GA4; daily+streaming appends today's intraday data to daily data.If you don't have any GA4 data of your own, you can connect to Google's public data set with the following settings:
vars:
project: 'bigquery-public-data'
dataset: 'ga4_obfuscated_sample_ecommerce'
start_date: '20210120'Find more info about the GA4 obfuscated dataset here.
NOTE: These Variables are also NOT finalized & are LIKELY to change.
Setting any query_parameter_exclusions will remove query string parameters from the page_location field for all downstream processing. Original parameters are captured in a new original_page_location field. Ex:
vars:
query_parameter_exclusions: ['gclid', 'fbclid', '_ga'] Specific events can be set as conversions with the conversion_events variable in your dbt_project.yml file. These events will be counted against each session and included in the final mart models. Ex:
vars:
conversion_events: ['purchase', 'download']Specific events can be set as considerations with the conversion_events variable in your dbt_project.yml file. These events will be counted against each session and included in the final mart models. Ex:
vars:
consideration_events: ['cta_click', 'view_search_results']Set specific events to be stages in a funnel.
vars:
funnel_stages: ['begin_checkout', 'add_shipping_info', 'add_payment_info', 'purchase']Exclude specific events from the final tables.
vars:
excluded__events: ['session_start']Exclude specific event parameters from the final tables.
vars:
excluded__event_params: ['ga_session_id', 'page_location', 'ga_session_number', 'session_engaged', 'engagement_time_msec', 'entrances', 'page_title', 'page_referrer', 'source', 'medium', 'campaign', 'debug_mode', 'term', 'clean_event', 'value', 'tax', 'coupon', 'promotion_name', 'transaction_id']Exclude specific default columns from the final tables.
vars:
excluded__columns: ['event_previous_timestamp', 'event_bundle_sequence_id', 'event_server_timestamp_offset', 'user_id', 'user_pseudo_id', 'stream_id', 'ga_session_id', 'privacy_info', 'event_dimensions', 'app_info']Exclude specific user properties from the final tables.
vars:
excluded__user_props: ['logged_in']Include specific query parameters to be in the final tables.
vars:
included__query_params: ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term', 'gclid', 'fbclid', 'gclsrc', '_ga']- GA4 Resources:
- SQL & BigQuery Resources:
- dbt Resources:
- Project References:
- GA4 dbt Package
- Stacktonic dbt Example Project
- Also inspired by this
