|
| 1 | +# Adding a Custom Task |
| 2 | + |
| 3 | +To add a new task, first either open an issue, to determine whether it will be |
| 4 | +integrated in the core evaluations of lighteval, in the extended tasks, or the |
| 5 | +community tasks, and add its dataset on the hub. |
| 6 | + |
| 7 | +- Core evaluations are evaluations that only require standard logic in their |
| 8 | + metrics and processing, and that we will add to our test suite to ensure non |
| 9 | + regression through time. They already see high usage in the community. |
| 10 | +- Extended evaluations are evaluations that require custom logic in their |
| 11 | + metrics (complex normalisation, an LLM as a judge, ...), that we added to |
| 12 | + facilitate the life of users. They already see high usage in the community. |
| 13 | +- Community evaluations are submissions by the community of new tasks. |
| 14 | + |
| 15 | +A popular community evaluation can move to become an extended or core evaluation over time. |
| 16 | + |
| 17 | +> [!TIP] |
| 18 | +> You can find examples of custom tasks in the <a href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a> directory. |
| 19 | +
|
| 20 | +## Step by step creation of a custom task |
| 21 | + |
| 22 | +> [!WARNING] |
| 23 | +> To contribute your custom metric to the lighteval repo, you would first need |
| 24 | +> to install the required dev dependencies by running `pip install -e .[dev]` |
| 25 | +> and then run `pre-commit install` to install the pre-commit hooks. |
| 26 | +
|
| 27 | +First, create a python file under the `community_tasks` directory. |
| 28 | + |
| 29 | +You need to define a prompt function that will convert a line from your |
| 30 | +dataset to a document to be used for evaluation. |
| 31 | + |
| 32 | +```python |
| 33 | +# Define as many as you need for your different tasks |
| 34 | +def prompt_fn(line, task_name: str = None): |
| 35 | + """Defines how to go from a dataset line to a doc object. |
| 36 | + Follow examples in src/lighteval/tasks/default_prompts.py, or get more info |
| 37 | + about what this function should do in the README. |
| 38 | + """ |
| 39 | + return Doc( |
| 40 | + task_name=task_name, |
| 41 | + query=line["question"], |
| 42 | + choices=[f" {c}" for c in line["choices"]], |
| 43 | + gold_index=line["gold"], |
| 44 | + instruction="", |
| 45 | + ) |
| 46 | +``` |
| 47 | + |
| 48 | +Then, you need to choose a metric, you can either use an existing one (defined |
| 49 | +in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)). |
| 50 | + |
| 51 | +```python |
| 52 | +custom_metric = SampleLevelMetric( |
| 53 | + metric_name="my_custom_metric_name", |
| 54 | + higher_is_better=True, |
| 55 | + category=MetricCategory.IGNORED, |
| 56 | + use_case=MetricUseCase.NONE, |
| 57 | + sample_level_fn=lambda x: x, # how to compute score for one sample |
| 58 | + corpus_level_fn=np.mean, # How to aggreagte the samples metrics |
| 59 | +) |
| 60 | +``` |
| 61 | + |
| 62 | +Then, you need to define your task. You can define a task with or without subsets. |
| 63 | +To define a task with no subsets: |
| 64 | + |
| 65 | +```python |
| 66 | +# This is how you create a simple task (like hellaswag) which has one single subset |
| 67 | +# attached to it, and one evaluation possible. |
| 68 | +task = LightevalTaskConfig( |
| 69 | + name="myothertask", |
| 70 | + prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py |
| 71 | + suite=["community"], |
| 72 | + hf_repo="", |
| 73 | + hf_subset="default", |
| 74 | + hf_avail_splits=[], |
| 75 | + evaluation_splits=[], |
| 76 | + few_shots_split=None, |
| 77 | + few_shots_select=None, |
| 78 | + metric=[], # select your metric in Metrics |
| 79 | +) |
| 80 | +``` |
| 81 | + |
| 82 | +If you want to create a task with multiple subset, add them to the |
| 83 | +`SAMPLE_SUBSETS` list and create a task for each subset. |
| 84 | + |
| 85 | +```python |
| 86 | +SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval |
| 87 | + |
| 88 | + |
| 89 | +class CustomSubsetTask(LightevalTaskConfig): |
| 90 | + def __init__( |
| 91 | + self, |
| 92 | + name, |
| 93 | + hf_subset, |
| 94 | + ): |
| 95 | + super().__init__( |
| 96 | + name=name, |
| 97 | + hf_subset=hf_subset, |
| 98 | + prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py |
| 99 | + hf_repo="", |
| 100 | + metric=[custom_metric], # select your metric in Metrics or use your custom_metric |
| 101 | + hf_avail_splits=[], |
| 102 | + evaluation_splits=[], |
| 103 | + few_shots_split=None, |
| 104 | + few_shots_select=None, |
| 105 | + suite=["community"], |
| 106 | + generation_size=-1, |
| 107 | + stop_sequence=None, |
| 108 | + output_regex=None, |
| 109 | + frozen=False, |
| 110 | + ) |
| 111 | +SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS] |
| 112 | +``` |
| 113 | + |
| 114 | +Here is a list of the parameters and their meaning: |
| 115 | + |
| 116 | +- `name` (str), your evaluation name |
| 117 | +- `suite` (list), the suite(s) to which your evaluation should belong. This |
| 118 | + field allows us to compare different task implementations and is used as a |
| 119 | + task selection to differentiate the versions to launch. At the moment, you'll |
| 120 | + find the keywords ["helm", "bigbench", "original", "lighteval", "community", |
| 121 | + "custom"]; for core evals, please choose `lighteval`. |
| 122 | +- `prompt_function` (Callable), the prompt function you defined in the step |
| 123 | + above |
| 124 | +- `hf_repo` (str), the path to your evaluation dataset on the hub |
| 125 | +- `hf_subset` (str), the specific subset you want to use for your evaluation |
| 126 | + (note: when the dataset has no subset, fill this field with `"default"`, not |
| 127 | + with `None` or `""`) |
| 128 | +- `hf_avail_splits` (list), all the splits available for your dataset (train, |
| 129 | + valid or validation, test, other...) |
| 130 | +- `evaluation_splits` (list), the splits you want to use for evaluation |
| 131 | +- `few_shots_split` (str, can be `null`), the specific split from which you |
| 132 | + want to select samples for your few-shot examples. It should be different |
| 133 | + from the sets included in `evaluation_splits` |
| 134 | +- `few_shots_select` (str, can be `null`), the method that you will use to |
| 135 | + select items for your few-shot examples. Can be `null`, or one of: |
| 136 | + - `balanced` select examples from the `few_shots_split` with balanced |
| 137 | + labels, to avoid skewing the few shot examples (hence the model |
| 138 | + generations) toward one specific label |
| 139 | + - `random` selects examples at random from the `few_shots_split` |
| 140 | + - `random_sampling` selects new examples at random from the |
| 141 | + `few_shots_split` for every new item, but if a sampled item is equal to |
| 142 | + the current one, it is removed from the available samples |
| 143 | + - `random_sampling_from_train` selects new examples at random from the |
| 144 | + `few_shots_split` for every new item, but if a sampled item is equal to |
| 145 | + the current one, it is kept! Only use this if you know what you are |
| 146 | + doing. |
| 147 | + - `sequential` selects the first `n` examples of the `few_shots_split` |
| 148 | +- `generation_size` (int), the maximum number of tokens allowed for a |
| 149 | + generative evaluation. If your evaluation is a log likelihood evaluation |
| 150 | + (multi-choice), this value should be -1 |
| 151 | +- `stop_sequence` (list), a list of strings acting as end of sentence tokens |
| 152 | + for your generation |
| 153 | +- `metric` (list), the metrics you want to use for your evaluation (see next |
| 154 | + section for a detailed explanation) |
| 155 | +- `output_regex` (str), A regex string that will be used to filter your |
| 156 | + generation. (Generative metrics will only select tokens that are between the |
| 157 | + first and the second sequence matched by the regex. For example, for a regex |
| 158 | + matching `\n` and a generation `\nModel generation output\nSome other text` |
| 159 | + the metric will only be fed with `Model generation output`) |
| 160 | +- `frozen` (bool), for now, is set to False, but we will steadily pass all |
| 161 | + stable tasks to True. |
| 162 | +- `trust_dataset` (bool), set to True if you trust the dataset. |
| 163 | + |
| 164 | + |
| 165 | +Then you need to add your task to the `TASKS_TABLE` list. |
| 166 | + |
| 167 | +```python |
| 168 | +# STORE YOUR EVALS |
| 169 | + |
| 170 | +# tasks with subset: |
| 171 | +TASKS_TABLE = SUBSET_TASKS |
| 172 | + |
| 173 | +# tasks without subset: |
| 174 | +# TASKS_TABLE = [task] |
| 175 | +``` |
| 176 | + |
| 177 | +Finally, you need to add a module logic to convert your task to a dict for lighteval. |
| 178 | + |
| 179 | +```python |
| 180 | +# MODULE LOGIC |
| 181 | +# You should not need to touch this |
| 182 | +# Convert to dict for lighteval |
| 183 | +if __name__ == "__main__": |
| 184 | + print(t.name for t in TASKS_TABLE) |
| 185 | + print(len(TASKS_TABLE)) |
| 186 | +``` |
| 187 | + |
| 188 | +Once your file is created you can then run the evaluation with the following command: |
| 189 | + |
| 190 | +```bash |
| 191 | +lighteval accelerate \ |
| 192 | + --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \ |
| 193 | + --tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \ |
| 194 | + --custom_tasks {path_to_your_custom_task_file} \ |
| 195 | + --output_dir "./evals" |
| 196 | +``` |
0 commit comments