Direct Data API is a new class of API that provides high-speed read-only data access to Vault. Direct Data API is a reliable, easy-to-use, timely, and consistent API for extracting Vault data. It is designed for organizations that wish to replicate large amounts of Vault data to an external database, data warehouse, or data lake.
The Direct Data accelerators are discrete groups of Python scripts, intended to facilitate the loading of data from Vault to external systems. Each accelerator provides a working example of using Direct Data API to connect Vault to an object storage system and a target data system.
This project provides accelerator implementations that facilitate the loading of data from Vault to the following systems:
- Vault -> AWS S3 -> Snowflake
- Vault -> AWS S3 -> Databricks
- Vault -> AWS S3 -> Redshift
- Vault -> Azure Blob Storage -> Azure SQL Database
- Vault -> Azure Blob Storage -> Microsoft Fabric Warehouse
These accelerators perform the following fundamental processes:
- Download Direct Data files from Vault and upload them to object storage (currently AWS S3 or Azure Blob Storage)
- Extract content from the archived Direct Data file
- Optionally convert the extracted CSV files to Parquet
- Load the data into the target data system
- Optionally extract document source content from Vault and upload to object storage
- Optionally retrieve document version text from Vault and upload to object storage
The architecture of the accelerators is designed to be easily extendable, so that they can be custom fit to individual developer needs and systems. The core components of each accelerator are below.
Classes:
The four fundamental classes that are being leveraged by each accelerator are:
VaultService: This class handles all interactions with Vault. This primarily consists of executing API calls (e.g., authentication, listing and downloading direct Data files, extracting document source content, and downloading document text).ObjectStorageService: This class interacts with the object storage system that stores the files extracted from Vault. This includes uploading and downloading files, as well as managing file paths. Currently, this service supports AWS S3 and Azure Blob Storage.DatabaseService: This class handles interactions with the target database. This includes loading Full, Incremental, and Log files. Table schemas are managed here as well.DatabaseConnection: This class handles connecting to the target database, activating a database cursor, and executing SQL commands. This is utilized by theDatabaseServiceexecute the specific database SQL commands.
These classes, except for the VaultService class, can be extended to support any target system.
Configuration Files:
Each accelerator includes two configuration files that include the required parameters for connecting to Vault and the external systems. Each accelerator’s config file has examples of required parameters for that specific implementation. The files are described below.
vapil_settings.json: This file contains the required parameters to authenticate to Vault, and will vary depending on if a Basic username/password or Oauth security policy is used in the target Vault.
{
"authenticationType": "BASIC",
"idpOauthAccessToken": "",
"idpOauthScope": "openid",
"idpUsername": "",
"idpPassword": "",
"vaultUsername": "[email protected]",
"vaultPassword": "Password123",
"vaultDNS": "cholecap.veevavault.com",
"vaultSessionId": "",
"vaultClientId": "Cholecap-Vault-",
"vaultOauthClientId": "",
"vaultOauthProfileId": "",
"logApiErrors": true,
"httpTimeout": null,
"validateSession": true
}connector_config.json: This file contains the required parameters to connect to the external object storage and data system. The parameters will vary depending on the systems being connected. Below is an example for the Redshift Connector. Review the sample files for each accelerator implementation for additional examples.
{
"convert_to_parquet": false,
"extract_document_content" : true,
"retrieve_document_text" : true,
"direct_data": {
"start_time": "2000-01-01T00:00Z",
"stop_time": "2025-04-09T00:00Z",
"extract_type": "incremental"
},
"s3": {
"iam_role_arn": "arn:aws:iam::123456789:role/Direct-Data-Role",
"bucket_name": "vault-direct-data-bucket",
"direct_data_folder": "direct-data",
"archive_filepath": "direct-data/201287-20250409-0000-F.tar.gz",
"extract_folder": "201287-20250409-0000-F",
"document_content_folder": "extracted_doc_content",
"document_text_folder": "extracted_doc_text"
},
"redshift": {
"host": "direct-data.123GUID.us-east-1.redshift.amazonaws.com",
"port": "5439",
"user": "user",
"password": "password",
"database": "database",
"schema": "direct_data",
"iam_redshift_s3_read": "arn:aws:iam::123456789:role/RedshiftS3Read"
}
}Scripts:
The logic that moves and transforms data between systems is handled in the included scripts.
accelerator.py: This is the entry point for the program. It’s used to instantiate the required classes and pass them into the following scripts that contain the core logic. Below is an exampleaccelerator.pyscript for the Redshift Accelerator.
import sys
from accelerators.redshift.services.redshift_service import RedshiftService
sys.path.append('.')
from common.scripts import (direct_data_to_object_storage, download_and_unzip_direct_data_files,
extract_doc_content, load_data, retrieve_doc_text)
from common.services.aws_s3_service import AwsS3Service
from common.services.vault_service import VaultService
from common.utilities import read_json_file
def main():
config_filepath: str = "path/to/connector_config.json"
vapil_settings_filepath: str = "path/to/vapil_settings.json"
config_params: dict = read_json_file(config_filepath)
direct_data_params: dict = config_params['direct_data']
s3_params: dict = config_params['s3']
redshift_params: dict = config_params['redshift']
extract_document_content: bool = config_params.get('extract_document_content')
retrieve_document_text: bool = config_params.get('retrieve_document_text')
object_storage_root: str = f's3://{s3_params["bucket_name"]}'
s3_params['convert_to_parquet'] = config_params['convert_to_parquet']
redshift_params['convert_to_parquet'] = config_params['convert_to_parquet']
redshift_params['object_storage_root'] = object_storage_root
s3_service: AwsS3Service = AwsS3Service(s3_params)
redshift_service: RedshiftService = RedshiftService(redshift_params)
vault_service: VaultService = VaultService(vapil_settings_filepath)
direct_data_to_object_storage.run(vault_service=vault_service,
object_storage_service=s3_service,
direct_data_params=direct_data_params)
download_and_unzip_direct_data_files.run(object_storage_service=s3_service)
load_data.run(object_storage_service=s3_service,
database_service=redshift_service,
direct_data_params=direct_data_params)
if extract_document_content:
extract_doc_content.run(object_storage_service=s3_service,
vault_service=vault_service)
if retrieve_document_text:
retrieve_doc_text.run(object_storage_service=s3_service,
vault_service=vault_service)
if __name__ == "__main__":
main()direct_data_to_object_storage.py: This script handles downloading a designated Direct Data file from Vault, and uploading the Direct Data tar.gz file to an object storage system. This script handles multiple file parts natively.
download_and_unzip_direct_data_files.py: This script downloads the Direct Data file from object storage, unzips the content, optionally converts the CSV files to parquet format, and uploads the unzipped content back to the object storage system.
load_data.py: This script facilitates loading the direct data extracts from object storage into tables in the target data system. Logic is included to handle Full, Incremental, and Log file types.
extract_doc_content.py: This script retrieves the doc version IDs from the Full or Incremental document_version_sys.csv file, calls the Export Document Versions endpoint to export document content to File Staging, then downloads the content from File Staging and uploads to object storage.
retrieve_doc_text.py: This script retrieves the doc version IDs from the Full or Incremental document_version_sys.csv file, calls the Retrieve Document Version Text endpoint, then saves the content to a text file in object storage.
To use the Direct Data accelerators, see the following prerequisites. Individual implementations will have additional prerequisites.
The following accelerator implementations are currently available as working examples.
This accelerator leverages the ability to integrate Snowflake with S3 and seamlessly load data directly from S3 into Snowflake. This process utilizes the COPY INTO command that allows for loading directly from a file in addition to inferring the table schema from the same file. Inferring schemas is generally recommended only when dealing with Full files.
Pre-requisites
- AWS CLI
- S3 to Snowflake integration
- S3 Stage
- The following must be present:
Supported File Formats
- CSV
- PARQUET
There are several ways to handle and load data into Databricks. The DataBricks accelerator utilizes the COPY INTO command to load data directly from S3 to Delta Lake.
Pre-requisites
- AWS CLI
- Configure data access for ingestion
- The following must be present:
Supported File Formats
- CSV
Considerations
- The data that gets loaded into Delta Lake tables are loaded as a String data type.
Similar to the other accelerators, the Redshift accelerator leverages the COPY command to load data into tables from S3.
Pre-requisites
Supported File Formats
- CSV
Considerations
- Amazon Redshift character type has a limit of 65535 bytes.
- Due to this limit, some Rich Text field data may be truncated.
The SQL Database accelerator leverages the BULK INSERT command to load data into tables from Blob Storage.
Pre-requisites
- Azure Storage Account
- Blob Container
- Master Key
- Storage Access Signature
- Database Scoped Credential
- External Data Source
Supported File Formats
- CSV
Considerations
- Boolean values from Vault are loaded as an NVARCHAR data type.
The Fabric accelerator leverages the COPY INTO command to load data into tables from Blob Storage.
Pre-requisites
Supported File Formats
- CSV
- PARQUET











