Skip to content

Conversation

@bmtcril
Copy link
Contributor

@bmtcril bmtcril commented Apr 22, 2025

Features / Changes

  • Adds new backends using asyncio for ClickHouse direct, Ralph, CSV, and CHDB/S3 using a new task management system
  • Adds an urwid based UI for displaying the load state as the various loads happen
  • Unpins clickhouse-connect as it's no longer needed
  • Adds type hinting across the application

Breaking Changes

Refactors several configuration variables:

  • num_batches is renamed to num_xapi_batches to clarify it's scope
  • num_workers is added and required
  • csv_load_from_s3_after is renamed to load_from_s3_after as it is now shared with the CHDB backend

Refactors backend names, the available backends are now:

  • chdb
  • csv
  • clickhouse
  • ralph

Removes the load_db_from_s3 command in favor of a new --load_db_only flag on load_db that fulfills the same behavior

Todo

  • Documentation
  • Improve test coverage, especially in the UI
  • Test in tutor-contrib-aspects
  • Update the example configs and make sure they all still work
  • Add chdb example config

Testing

  • Any existing config can be updated by changing the backend name, renaming num_batches to num_xapi_batches and adding num_workers: 4
  • To test the UI, use the new command xapi-db-load ui --config_file <path to your config>
  • The default config should also work
  • To test the command line interface xapi-db-load load_db --config_file <path to your config> should still work

The new CHDB backend writes lz4 compressed ClickHouse native files to S3 then optionally loads them to ClickHouse by running insert commands on ClickHouse, similar to how the CSV backend works. However it does not currently support writing the files anywhere but S3.

@bmtcril bmtcril force-pushed the bmtcril/async_base branch from db628d3 to fb1382b Compare April 22, 2025 20:52
@bmtcril bmtcril marked this pull request as draft April 22, 2025 20:54
@saraburns1
Copy link
Contributor

looks good - ran with ralph & clickhouse backends through tutor-contrib-aspects. will approve after conflicrts are resolved

@bmtcril bmtcril marked this pull request as ready for review May 2, 2025 17:05
@bmtcril bmtcril force-pushed the bmtcril/async_base branch from d48eefc to a100a1a Compare May 2, 2025 18:35
- Adds async backends using asyncio for ClickHouse direct, Ralph, CSV,
and CHDB/S3 using a new task management system
- Adds an urwid based UI for displaying the load state as the various
loads happen
- Refactors several configuration variables
@bmtcril bmtcril force-pushed the bmtcril/async_base branch from a100a1a to 87f752e Compare May 5, 2025 14:26
@bmtcril bmtcril merged commit dd8c495 into main May 5, 2025
4 checks passed
@bmtcril bmtcril deleted the bmtcril/async_base branch May 5, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants