Skip to content
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ coverage.xml
dist/
.venv
.dir-locals.el
private_configs/
logs/
259 changes: 217 additions & 42 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,36 +1,237 @@
Scripts for generating and loading test xAPI events
***************************************************

|pypi-badge| |ci-badge| |codecov-badge| |doc-badge| |pyversions-badge|
|license-badge| |status-badge|

Scripts for generating Aspects xAPI events
******************************************

Purpose
=======
This package generates a variety of test data used for integration and
performance testing of Open edX Aspects. Currently it populates the following
datasets:

Some test scripts to help make apples-to-apples comparisons of different
database backends for xAPI events. Supports direct database connections to
ClickHouse, and batch loading data to the Ralph Learning Record Store with the
ClickHouse backend. It also can create gzipped CSV files for bulk import to
other databases.
- xAPI statements, simulating those generated by event-routing-backends
- Course and learner data, simulating that generated by event-sink-clickhouse

xAPI events generated match the specifications of the Open edX
The xAPI events generated match the current specifications of the Open edX
event-routing-backends package, but are not yet maintained to advance alongside
them.
them so may be expected to fall out of sync over time. Almost all current
statements are simulated, but statements that not yet used in Aspects reporting
have been skipped.

Features
========
Once an appropriate database has been created using Aspects, data can be
generated in the following ways:

Ralph to ClickHouse
-------------------
Useful for testing configuration, integration, and permissions, this uses batch
POSTs to Ralph for xAPI statements, but still writes directly to ClickHouse for
course and actor data. This is the slowest method, but exercises the largest
surface area of the project.

Direct to ClickHouse
--------------------
Useful for getting a medium to large amount of data into the database to test
configuration and view reports. xAPI statements are batched, other data is
currently inserted one row at a time.

CSV files
---------
Useful for creating datasets that can be reused for checking performance
changes with the exact same data, and for extremely large tests. The files can
be generated locally or on any service supported by smart_open. They can then
optionally be imported to ClickHouse if written locally or to S3. They can also
be directly imported from S3 to ClickHouse at any time using the
``load-db-from-s3`` subcommand. This is by far the fastest method for large
scale tests.


Getting Started
===============

Usage
-----

Details of how to run the current version of the script can be found by
executing:
A configuration file is required to run a test. If no file is given, a small
test will be run using the `default_config.yaml` included in the project:

::

❯ xapi-db-load --help
❯ xapi-db-load load-db

To specify a config file:

::

❯ xapi-db-load load-db --config_file private_configs/my_huge_test.yaml

There is also a sub-command for just performing a load of previously generated
CSV data from S3:

::

❯ xapi-db-load load-db-from-s3 --config_file private_configs/my_s3_test.yaml


Configuration Format
--------------------
There are a number of different configuration options for tuning the output.
In addition to the documentation below, there are example settings files to
review in the ``example_configs`` directory.

Common Settings
^^^^^^^^^^^^^^^
These settings apply to all backends, and determine the size and makeup of the
test::

# Location where timing logs will be saved
log_dir: logs

# xAPI statements will be generated in batches, the total number of
# statements is ``num_batches * batch_size``. The batch size is the number
# of statements sent to the backend (Ralph POST, ClickHouse insert, etc.)
num_batches: 3
batch_size: 100

# Overall start and end date for the entire run. All xAPI statements
# will fall within these dates. Different courses will have different start
# and end dates between these days, based on course_length_days below.
start_date: 2014-01-01
end_date: 2023-11-27

# All courses will be this long, they will be fit between start_date and
# end_date, therefore this must be less than end_date - start_date days.
course_length_days: 120

# The number of organizations, courses will be evenly spread among these
num_organizations: 3

# The number of learners to create, random subsets of these will be
# "registered" for each course and have statements generated for them
# between their registration date and the end of the course
num_actors: 10

# How many of each size course to create. The sum of these is the total
# number of courses created for the test. The keys are arbitrary, you can
# name them whatever you like and have as many or few sizes as you like.
# The keys must exactly match the definitions in course_size_makeup below.
num_course_sizes:
small: 1
medium: 1
...

# Course type configurations, how many of each type of object are created
# for each course of this size. "actors" must be less than or equal to
# "num_actors". Keys here must exactly match the keys in num_course_sizes.
course_size_makeup:
small:
actors: 5
problems: 20
videos: 10
chapters: 3
sequences: 10
verticals: 20
forum_posts: 20
medium:
actors: 7
problems: 40
videos: 20
chapters: 4
sequences: 20
verticals: 30
forum_posts: 40
...

CSV Backend, Local Files
^^^^^^^^^^^^^^^^^^^^^^^^
Generates gzipped CSV files to a local directory::

backend: csv_file
csv_output_destination: logs/

CSV Backend, S3 Compatible Destination
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generates gzipped CSV files to remote location::

backend: csv_file
# This can be anything smart-open can handle (ex. a local directory or
# an S3 bucket etc.) but importing to ClickHouse using this tool only
# supports S3 or compatible services like MinIO right now.
# Note that this *must* be an s3:// link, https links will not work
# https://pypi.org/project/smart-open/
csv_output_destination: s3://openedx-aspects-loadtest/logs/large_test/

# These settings are shared with the ClickHouse backend
s3_key:
s3_secret:

CSV Backend, S3 Compatible Destination, Load to ClickHouse
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generates gzipped CSV files to a remote location, then automatically loads
them to ClickHouse::

backend: csv_file
# csv_output_destination can be anything smart_open can handle, a local
# directory or an S3 bucket etc., but importing to ClickHouse using this
# tool only supports S3 or compatible services (ex: MinIO) right now
# https://pypi.org/project/smart-open/
csv_output_destination: s3://openedx-aspects-loadtest/logs/large_test/
csv_load_from_s3_after: true

# Note that this *must* be an https link, s3:// links will not work,
# this must point to the same location as csv_output_destination.
s3_source_location: https://openedx-aspects-loadtest.s3.amazonaws.com/logs/large_test/

# This also requires all of the ClickHouse backend variables!

ClickHouse Backend
^^^^^^^^^^^^^^^^^^
Backend is only necessary if you are writing directly to ClickHouse, for
integrations with Ralph or CSV, use their ``backend`` instead::

backend: clickhouse

Variables necessary to connect to ClickHouse, whether directly, through Ralph, or
as part of loading CSV files::

# ClickHouse connection variables
db_host: localhost
# db_port is also used to determine the "secure" parameter. If the port
# ends in 443 or 440, the "secure" flag will be set on the connection.
db_port: 8443
db_username: ch_admin
db_password: secret

# Schema name for the xAPI schema
db_name: xapi

# Schema name for the event sink schema
db_event_sink_name: event_sink

# These S3 settings are shared with the CSV backend, but passed to
# ClickHouse when loading files from S3
s3_key: <...>
s3_secret: <...>

Ralph / ClickHouse Backend
^^^^^^^^^^^^^^^^^^^^^^^^^^
Variables necessary to send xAPI statements via Ralph::

backend: ralph_clickhouse
lrs_url: http://ralph.tutor-nightly-local.orb.local/xAPI/statements
lrs_username: ralph
lrs_password: secret

# This also requires all of the ClickHouse backend variables!

Load from S3 configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^
Variables necessary to run ``xapi-db-load load-db-from-s3``, which skips the
event generation process and just loads pre-existing CSV files from S3::

# Note that this must be an https link, s3:// links will not work
s3_source_location: https://openedx-aspects-loadtest.s3.amazonaws.com/logs/large_test/

# This also requires all of the ClickHouse backend variables!

Developing
----------
Expand Down Expand Up @@ -162,29 +363,3 @@ Reporting Security Issues
*************************

Please do not report security issues in public. Please email [email protected].

.. |pypi-badge| image:: https://img.shields.io/pypi/v/xapi-db-load.svg
:target: https://pypi.python.org/pypi/xapi-db-load/
:alt: PyPI

.. |ci-badge| image:: https://github.com/openedx/xapi-db-load/workflows/Python%20CI/badge.svg?branch=main
:target: https://github.com/openedx/xapi-db-load/actions
:alt: CI

.. |codecov-badge| image:: https://codecov.io/github/openedx/xapi-db-load/coverage.svg?branch=main
:target: https://codecov.io/github/openedx/xapi-db-load?branch=main
:alt: Codecov

.. |doc-badge| image:: https://readthedocs.org/projects/xapi-db-load/badge/?version=latest
:target: https://xapi-db-load.readthedocs.io/en/latest/
:alt: Documentation

.. |pyversions-badge| image:: https://img.shields.io/pypi/pyversions/xapi-db-load.svg
:target: https://pypi.python.org/pypi/xapi-db-load/
:alt: Supported Python versions

.. |license-badge| image:: https://img.shields.io/github/license/openedx/xapi-db-load.svg
:target: https://github.com/openedx/xapi-db-load/blob/main/LICENSE.txt
:alt: License

.. |status-badge| image:: https://img.shields.io/badge/Status-Experimental-yellow
100 changes: 100 additions & 0 deletions default_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# CSV backend configuration
# #########################
backend: csv_file
# This can be anything smart_open can handle, a local directory or
# an S3 bucket, etc., but importing to ClickHouse only supports S3 right now
# https://pypi.org/project/smart-open/
csv_output_destination: logs/ # s3://openedx-aspects-loadtest/logs/large_test/
csv_load_from_s3_after: false

# ClickHouse Backend configuration
# ################################
# backend: clickhouse
# db_host: localhost
# db_port: 8443
# db_name: xapi_lt
# db_event_sink_name: event_sink
# db_username: ch_admin
# db_password:
# s3_key:
# s3_secret:


# Ralph / ClickHouse backend configuration
# ########################################
# backend: ralph_clickhouse
# db_host: localhost
# db_port: null
# db_name: xapi
# db_username: ch_admin
# db_password: 7NRe69D4zWWT0rf2G7gWa7RB
# lrs_url: http://ralph.tutor-nightly-local.orb.local/xAPI/statements
# lrs_username: ralph
# lrs_password: sdtiqjqwixhzcboqzbiryrulzcpvfmsfvqqw

# Load from S3 configuration
# ##########################
# s3_source_location: https://openedx-aspects-loadtest.s3.amazonaws.com/logs/large_test/

# Run options
log_dir: logs
num_batches: 3
batch_size: 100

# Overall start and end date for the entire run
start_date: 2014-01-01
end_date: 2023-11-27

# All courses will be this long, and be fit into the start / end dates
# This must be less than end_date - start_date days.
course_length_days: 120

# The size of the test
num_organizations: 3
num_actors: 10

# How many of each size course to create. The sum of these is the total number
# of courses created for the test.
num_course_sizes:
small: 1
medium: 1
large: 1
huge: 1

# Course size configurations, how many of each type of object are created for
# each course of this size. "actors" must be less than or equal to "num_actors".
# For a course of this size to be created it needs to exist both here and in
# "num_course_sizes".
course_size_makeup:
small:
actors: 5
problems: 20
videos: 10
chapters: 3
sequences: 10
verticals: 20
forum_posts: 20
medium:
actors: 7
problems: 40
videos: 20
chapters: 4
sequences: 20
verticals: 30
forum_posts: 40
large:
actors: 10
problems: 80
videos: 30
chapters: 5
sequences: 40
verticals: 80
forum_posts: 200
huge:
actors: 10
problems: 160
videos: 40
chapters: 10
sequences: 50
verticals: 100
forum_posts: 1000
Loading