Fast & Scalable Decision Tree (C4.5) in Go

Overview

This project is an implementation of the C4.5 Decision Tree classifier in Go. It is designed to be high-performance, scalable, and capable of handling large datasets with minimal memory overhead. The implementation supports categorical and numerical attributes, missing value handling, and parallelization for improved efficiency.

Features

Train & Predict: CLI commands for training a decision tree and making predictions.
Scalability: Efficient handling of large datasets with optimized memory usage.
Modular Design: Well-structured code for easy maintainability and extension.
Error Handling: Clear error messages and validation for input/output files.
JSON Model Serialization: Stores trained models in a JSON format.

Installation

Ensure you have Go 1.18+ installed. Then, clone the repository and build the executable:

git clone https://github.com/rodneyo1/decision-tree-go
cd decision-tree-go
go build

Usage

1. Training a Decision Tree

./dt -c train -i <input_data_file.csv> -t <target_column> -o <output_tree.dt>

Arguments:

-c train → Specifies the training command.
-i <input_data_file.csv> → Path to the training dataset (CSV).
-t <target_column> → Column name containing target labels.
-o <output_tree.dt> → Path to save the trained model (JSON format).

Example:

./dt -c train -i datasets/train.csv -t class -o model.dt

2. Making Predictions

./dt -c predict -i <prediction_data_file.csv> -m <model_file.dt> -o <predictions.csv>

Arguments:

-c predict → Specifies the prediction command.
-i <prediction_data_file.csv> → Path to the dataset for predictions.
-m <model_file.dt> → Path to the trained model file.
-o <predictions.csv> → Path to save predictions.

Example:

./dt -c predict -i datasets/test.csv -m model.dt -o predictions.csv

Input Requirements

The dataset must be in CSV format with a header row.
Feature columns may include categorical, numeric, date, or timestamp values.
The target column must be specified during training.
The trained model is saved in JSON format.
The test dataset for predictions should have the same feature columns as the training dataset.

Using Makefile

The project includes a Makefile for simplified build and execution:

make build    # Build the binary
make train    # Train a new model
make predict  # Run predictions
make clean    # Clean generated files
make all      # Run complete workflow (clean, build, train, predict)

Example workflow:

make all TRAIN_DATA="datasets/large_dataset.csv" TARGET_COLUMN="target"

Available Make variables:

TRAIN_DATA: Path to training dataset
TARGET_COLUMN: Name of the target column
MODEL_FILE: Output path for model (default: model.dt)
PREDICT_OUTPUT: Output path for predictions (default: predictions.csv)
PREDICT_DATA: Path to dataset to predict

Performance & Scalability

Memory Optimization

Efficient Data Structures: Uses optimized data structures to minimize memory overhead
Streaming Processing: Handles large datasets by processing rows in chunks
Memory Pool: Implements object pooling for frequently allocated structures

Performance Features

Parallel Processing: Utilizes goroutines for parallel node splitting and prediction

Missing Value Handling

Detection: Automatically detects and handles null/missing values
Strategies: Supports multiple imputation strategies:
- Mean/mode imputation for numerical features
- Most frequent value for categorical features
- Special split handling during tree construction

Optimization Flags

./dt -c train -i data.csv -t target -o model.dt

Testing

The project includes unit tests to validate correctness and performance.

Run the tests using:

go test ./...

Authors

Rodney Ochieng (GitHub)
Valeria Muhembele (GitHub)
Sheilla Juma (GitHub)
Moses Onyango (GitHub)
Thadeus Ogondola (GitHub)

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
algorithm		algorithm
datasets		datasets
models		models
utils		utils
Makefile		Makefile
README.md		README.md
go.mod		go.mod
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast & Scalable Decision Tree (C4.5) in Go

Overview

Features

Installation

Usage

1. Training a Decision Tree

2. Making Predictions

Input Requirements

Using Makefile

Performance & Scalability

Memory Optimization

Performance Features

Missing Value Handling

Optimization Flags

Testing

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

rodneyo1/decision-tree-go

Folders and files

Latest commit

History

Repository files navigation

Fast & Scalable Decision Tree (C4.5) in Go

Overview

Features

Installation

Usage

1. Training a Decision Tree

2. Making Predictions

Input Requirements

Using Makefile

Performance & Scalability

Memory Optimization

Performance Features

Missing Value Handling

Optimization Flags

Testing

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages