A flexible and powerful data generation tool built in Go that generates synthetic data based on YAML manifest definitions.
- 🚀 High-performance batch data generation
- 📝 YAML-based manifest configuration
- 🔄 Support for table dependencies and relationships
- ✨ Rich data type support including JSON
- 🎯 Customizable data patterns and distributions
- âś… Data validation and constraints
- đź§° Powerful expression-based rules engine
- 🔄 Support for Cassandra-specific data types
- Define your data schema in a YAML manifest file:
tables:
- name: users
priority: 1
columns:
- name: id
pattern: "USER####"
parent: true
validation:
unique: true
- name: name
type: string
value: ["John Doe", "Jane Doe"]
- name: metadata
type: json
json_config:
min_keys: 2
max_keys: 4
fields: ["age", "email", "preferences"]
types: ["int", "email", "string"]
- name: created_at
type: timestamp
format: "2006-01-02 15:04:05"- Run the generator:
go run main.go -manifest manifest/application.yaml -count 1000The Data Generator follows a modular architecture designed for flexibility and extensibility:
- Generator: The main orchestrator that reads manifest files and coordinates data generation
- ValueGenerator: Interface for type-specific data generation
- DataSink: Interface for storing generated data in different formats (CSV, etc.)
- Rules Engine: Expression-based system for conditional data generation
- pkg/: Core package containing the data generation logic
- types/: Data structure definitions for YAML manifest
- sink/: Data sink implementations (CSV, etc.)
- manifest/: Example manifest files
- Schema parsing from YAML manifest
- Table dependency resolution and sorting
- Generation of records with type-specific generators
- Application of rules and validation
- Output to configured data sink
string: Basic string valuesint: Integer values with range supportdecimal: Decimal numbers with precisiontimestamp: Date and time with format and rangebool: Boolean valuesuuid: Unique identifierssentence: Random sentence generationpattern: Custom pattern-based strings (e.g., "ABC#####")json: Nested JSON objects with configurable fields
The generator supports Cassandra-specific data types for generating data that matches Cassandra's data model:
- Map Type:
- name: user_preferences
type: map
key_type: string
value_type: string
map_config:
min_entries: 2
max_entries: 5
keys: ["theme", "language", "notifications"]
values: ["dark", "light", "en", "fr", "on", "off"]- Set Type:
- name: tags
type: set
element_type: string
set_config:
min_elements: 1
max_elements: 3
values: ["urgent", "important", "normal", "low"]- User Defined Type (UDT):
- name: address
type: udt
udt_config:
name: address_type
fields:
- name: street
type: string
- name: city
type: string
- name: state
type: string
- name: zip_code
type: string
pattern: "#####"- List Type:
- name: phone_numbers
type: list
element_type: string
list_config:
min_elements: 1
max_elements: 3
pattern: "+1-###-###-####"- Tuple Type:
- name: coordinates
type: tuple
tuple_config:
elements:
- type: decimal
range:
min: -180.0
max: 180.0
- type: decimal
range:
min: -90.0
max: 90.0- Map Configuration:
map_config:
min_entries: 1 # Minimum number of key-value pairs
max_entries: 5 # Maximum number of key-value pairs
keys: # Optional predefined keys
- key1
- key2
values: # Optional predefined values
- value1
- value2
key_type: string # Type of keys (string, int, etc.)
value_type: string # Type of values (string, int, etc.)- Set Configuration:
set_config:
min_elements: 1 # Minimum number of elements
max_elements: 5 # Maximum number of elements
values: # Optional predefined values
- value1
- value2
element_type: string # Type of elements- UDT Configuration:
udt_config:
name: type_name # Name of the UDT
fields: # List of fields in the UDT
- name: field1
type: string
- name: field2
type: int- List Configuration:
list_config:
min_elements: 1 # Minimum number of elements
max_elements: 5 # Maximum number of elements
pattern: "pattern" # Optional pattern for elements
element_type: string # Type of elements- Tuple Configuration:
tuple_config:
elements: # List of element configurations
- type: string # Type of first element
- type: int # Type of second element
range: # Optional range for numeric types
min: 1
max: 100tables:
- name: table_name # Table name
priority: 1 # Processing priority (higher numbers = higher priority)
depends_on: other_table # Table dependency
validation:
min_records: 1 # Minimum records to generate
max_records: 1000 # Maximum records to generatecolumns:
- name: column_name # Column name
type: string # Data type
pattern: "ABC####" # Pattern for generated values
value: ["A", "B"] # Predefined values
mandatory: true # Required field
validation:
unique: true # Unique constraint
range: # Value range
min: 1
max: 100
format: "format_string" # Format specificationcolumns:
- name: metadata
type: json
json_config:
min_keys: 2 # Minimum number of keys in JSON
max_keys: 5 # Maximum number of keys in JSON
fields: # Predefined field names
- name
- age
- email
types: # Corresponding field types
- string
- int
- emailThe data generator features a powerful rule-based data generation system with expressions. Rules can be defined at both column and table levels.
rules:
# Time-based rules
- when: "fields.submitted_date <= fields.created_on || addDuration(fields.created_on, '2h') > fields.submitted_date"
then:
submitted_date: "${addDuration(fields.created_on, '2h')}"
# Conditional value setting
- when: "fields.salary > 50000"
then:
priority: "HIGH"
otherwise:
priority: "${fields.salary > 25000 ? 'MEDIUM' : 'LOW'}"The expression engine provides a rich set of helper functions and variables in its evaluation environment:
- All field values are accessible via the
fieldsobject - Helper functions for string, time, and math operations
- Support for dynamic evaluation and complex conditionals
Time Functions:
addDuration(time, duration): Add duration to time (e.g., '1h', '30m', '2h')parseTime(layout, value): Parse time string using layoutformat(time, layout): Format time using layoutnow(): Get current time
Math Functions:
min(a, b): Return minimum of two numbersmax(a, b): Return maximum of two numbers
String Functions:
contains(str, substr): Check if string contains substringhasPrefix(str, prefix): Check if string starts with prefixhasSuffix(str, suffix): Check if string ends with suffixlower(str): Convert string to lowercaseupper(str): Convert string to uppercasetrim(str): Remove leading/trailing whitespacelen(str): Get the length of a string
- Time Constraints:
- when: "fields.submitted_date <= fields.created_on"
then:
submitted_date: "${addDuration(fields.created_on, '2h')}"- Status-based Updates:
- when: 'fields.status == "COMPLETED"'
then:
completed_on: "${addDuration(fields.modified_on, '2h')}"- Complex Conditions:
- when: 'fields.status == "IN_PROGRESS" && fields.priority == "HIGH"'
then:
completed_on: "${addDuration(fields.modified_on, '1h')}"
modified_by: "John Doe"
otherwise:
completed_on: "${addDuration(fields.modified_on, '2h')}"
modified_by: "Jane Doe"Supported JSON field types:
string: Random wordsint: Integer numbers (0-1000)float: Floating point numbers (0-1000)bool: Boolean valuesdate: Date stringsemail: Email addressesurl: URLs
- Unique value constraints
- Min/max record counts
- Mandatory field validation
- Range validation for numeric and date fields
- Table dependencies
- Foreign key relationships
- Parent-child relationships
- Batch processing
- Configurable batch sizes
- Efficient memory usage
- Weighted random values
- Custom value distributions
- Range-based generation
- Configurable number of fields
- Predefined or random field names
- Multiple value types
- Nested structure support
- User Profile Generation:
- name: users
columns:
- name: id
pattern: "U####"
validation:
unique: true
- name: profile
type: json
json_config:
fields: ["age", "location", "interests"]
types: ["int", "string", "string"]- Related Tables:
- name: orders
depends_on: users
columns:
- name: order_id
pattern: "ORD####"
- name: user_id
foreign: "users.id"
- name: metadata
type: json
json_config:
fields: ["items", "total", "shipping"]The CSV sink allows you to output generated data to CSV files. Each table will be written to a separate CSV file in the specified output directory.
Example usage in your manifest:
sink:
type: csv
config:
output_dir: "./output"Features:
- Automatic header generation based on column names
- Support for all data types including JSON fields
- Proper escaping and formatting of values
- Multiple table support with separate files
- Automatic output directory creation
The CSV files will be named after the table names (e.g., users.csv, orders.csv). Each file will include a header row with column names followed by the data rows.
JSON fields are formatted in a readable string format: {key1:value1,key2:value2}.
The codebase follows a modular architecture for maintainability:
- Generator: Central component that orchestrates data generation process
- ValueGenerator: Interface implementation for different data types
- Expression Engine: Uses the
exprlibrary for rule evaluation with a shared environment - Data Sink: Interface for output in different formats
The expression evaluation environment is centralized in the initEnv function, which provides a consistent set of helper functions and variables for all expressions in the system. This ensures:
- Consistent behavior across all expression evaluations
- Single source of truth for environment initialization
- Easy extension with new helper functions
Contributions are welcome! Please feel free to submit a Pull Request.