Skip to content

Commit 4a78793

Browse files
albertvillanovaclefourrieralvarobartt
authored
Set up docs (#403)
* Add docs * Add wiki to docs * Adapt wiki as docs * Force docs build * Fix link in _toctree * Add titles to docs pages * Update docs/source/evaluate-the-model-on-a-server-or-container.mdx Co-authored-by: Alvaro Bartolome <[email protected]> --------- Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]>
1 parent 681c2d9 commit 4a78793

16 files changed

+2407
-0
lines changed

.github/workflows/doc-build.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
name: Build Documentation
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
- doc-builder*
8+
- v*-release
9+
10+
jobs:
11+
build:
12+
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
13+
with:
14+
commit_sha: ${{ github.sha }}
15+
package: lighteval
16+
secrets:
17+
token: ${{ secrets.HUGGINGFACE_PUSH }}
18+
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}

.github/workflows/doc-pr-build.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: Build PR Documentation
2+
3+
on:
4+
pull_request:
5+
6+
concurrency:
7+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
8+
cancel-in-progress: true
9+
10+
jobs:
11+
build:
12+
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
13+
with:
14+
commit_sha: ${{ github.event.pull_request.head.sha }}
15+
pr_number: ${{ github.event.number }}
16+
package: lighteval

docs/source/_toctree.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
- sections:
2+
- local: index
3+
title: 🤗 Lighteval
4+
- local: installation
5+
title: Installation
6+
- local: quicktour
7+
title: Quicktour
8+
title: Getting started
9+
- sections:
10+
- local: saving-and-reading-results
11+
title: Save and read results
12+
- local: using-the-python-api
13+
title: Use the Python API
14+
- local: adding-a-custom-task
15+
title: Add a custom task
16+
- local: adding-a-new-metric
17+
title: Add a custom metric
18+
- local: use-vllm-as-backend
19+
title: Use VLLM as backend
20+
- local: evaluate-the-model-on-a-server-or-container
21+
title: Evaluate on Server
22+
- local: contributing-to-multilingual-evaluations
23+
title: Contributing to multilingual evaluations
24+
title: Guides
25+
- sections:
26+
- local: metric-list
27+
title: Available Metrics
28+
- local: available-tasks
29+
title: Available Tasks
30+
title: API
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Adding a Custom Task
2+
3+
To add a new task, first either open an issue, to determine whether it will be
4+
integrated in the core evaluations of lighteval, in the extended tasks, or the
5+
community tasks, and add its dataset on the hub.
6+
7+
- Core evaluations are evaluations that only require standard logic in their
8+
metrics and processing, and that we will add to our test suite to ensure non
9+
regression through time. They already see high usage in the community.
10+
- Extended evaluations are evaluations that require custom logic in their
11+
metrics (complex normalisation, an LLM as a judge, ...), that we added to
12+
facilitate the life of users. They already see high usage in the community.
13+
- Community evaluations are submissions by the community of new tasks.
14+
15+
A popular community evaluation can move to become an extended or core evaluation over time.
16+
17+
> [!TIP]
18+
> You can find examples of custom tasks in the <a href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a> directory.
19+
20+
## Step by step creation of a custom task
21+
22+
> [!WARNING]
23+
> To contribute your custom metric to the lighteval repo, you would first need
24+
> to install the required dev dependencies by running `pip install -e .[dev]`
25+
> and then run `pre-commit install` to install the pre-commit hooks.
26+
27+
First, create a python file under the `community_tasks` directory.
28+
29+
You need to define a prompt function that will convert a line from your
30+
dataset to a document to be used for evaluation.
31+
32+
```python
33+
# Define as many as you need for your different tasks
34+
def prompt_fn(line, task_name: str = None):
35+
"""Defines how to go from a dataset line to a doc object.
36+
Follow examples in src/lighteval/tasks/default_prompts.py, or get more info
37+
about what this function should do in the README.
38+
"""
39+
return Doc(
40+
task_name=task_name,
41+
query=line["question"],
42+
choices=[f" {c}" for c in line["choices"]],
43+
gold_index=line["gold"],
44+
instruction="",
45+
)
46+
```
47+
48+
Then, you need to choose a metric, you can either use an existing one (defined
49+
in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)).
50+
51+
```python
52+
custom_metric = SampleLevelMetric(
53+
metric_name="my_custom_metric_name",
54+
higher_is_better=True,
55+
category=MetricCategory.IGNORED,
56+
use_case=MetricUseCase.NONE,
57+
sample_level_fn=lambda x: x, # how to compute score for one sample
58+
corpus_level_fn=np.mean, # How to aggreagte the samples metrics
59+
)
60+
```
61+
62+
Then, you need to define your task. You can define a task with or without subsets.
63+
To define a task with no subsets:
64+
65+
```python
66+
# This is how you create a simple task (like hellaswag) which has one single subset
67+
# attached to it, and one evaluation possible.
68+
task = LightevalTaskConfig(
69+
name="myothertask",
70+
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
71+
suite=["community"],
72+
hf_repo="",
73+
hf_subset="default",
74+
hf_avail_splits=[],
75+
evaluation_splits=[],
76+
few_shots_split=None,
77+
few_shots_select=None,
78+
metric=[], # select your metric in Metrics
79+
)
80+
```
81+
82+
If you want to create a task with multiple subset, add them to the
83+
`SAMPLE_SUBSETS` list and create a task for each subset.
84+
85+
```python
86+
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval
87+
88+
89+
class CustomSubsetTask(LightevalTaskConfig):
90+
def __init__(
91+
self,
92+
name,
93+
hf_subset,
94+
):
95+
super().__init__(
96+
name=name,
97+
hf_subset=hf_subset,
98+
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
99+
hf_repo="",
100+
metric=[custom_metric], # select your metric in Metrics or use your custom_metric
101+
hf_avail_splits=[],
102+
evaluation_splits=[],
103+
few_shots_split=None,
104+
few_shots_select=None,
105+
suite=["community"],
106+
generation_size=-1,
107+
stop_sequence=None,
108+
output_regex=None,
109+
frozen=False,
110+
)
111+
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
112+
```
113+
114+
Here is a list of the parameters and their meaning:
115+
116+
- `name` (str), your evaluation name
117+
- `suite` (list), the suite(s) to which your evaluation should belong. This
118+
field allows us to compare different task implementations and is used as a
119+
task selection to differentiate the versions to launch. At the moment, you'll
120+
find the keywords ["helm", "bigbench", "original", "lighteval", "community",
121+
"custom"]; for core evals, please choose `lighteval`.
122+
- `prompt_function` (Callable), the prompt function you defined in the step
123+
above
124+
- `hf_repo` (str), the path to your evaluation dataset on the hub
125+
- `hf_subset` (str), the specific subset you want to use for your evaluation
126+
(note: when the dataset has no subset, fill this field with `"default"`, not
127+
with `None` or `""`)
128+
- `hf_avail_splits` (list), all the splits available for your dataset (train,
129+
valid or validation, test, other...)
130+
- `evaluation_splits` (list), the splits you want to use for evaluation
131+
- `few_shots_split` (str, can be `null`), the specific split from which you
132+
want to select samples for your few-shot examples. It should be different
133+
from the sets included in `evaluation_splits`
134+
- `few_shots_select` (str, can be `null`), the method that you will use to
135+
select items for your few-shot examples. Can be `null`, or one of:
136+
- `balanced` select examples from the `few_shots_split` with balanced
137+
labels, to avoid skewing the few shot examples (hence the model
138+
generations) toward one specific label
139+
- `random` selects examples at random from the `few_shots_split`
140+
- `random_sampling` selects new examples at random from the
141+
`few_shots_split` for every new item, but if a sampled item is equal to
142+
the current one, it is removed from the available samples
143+
- `random_sampling_from_train` selects new examples at random from the
144+
`few_shots_split` for every new item, but if a sampled item is equal to
145+
the current one, it is kept! Only use this if you know what you are
146+
doing.
147+
- `sequential` selects the first `n` examples of the `few_shots_split`
148+
- `generation_size` (int), the maximum number of tokens allowed for a
149+
generative evaluation. If your evaluation is a log likelihood evaluation
150+
(multi-choice), this value should be -1
151+
- `stop_sequence` (list), a list of strings acting as end of sentence tokens
152+
for your generation
153+
- `metric` (list), the metrics you want to use for your evaluation (see next
154+
section for a detailed explanation)
155+
- `output_regex` (str), A regex string that will be used to filter your
156+
generation. (Generative metrics will only select tokens that are between the
157+
first and the second sequence matched by the regex. For example, for a regex
158+
matching `\n` and a generation `\nModel generation output\nSome other text`
159+
the metric will only be fed with `Model generation output`)
160+
- `frozen` (bool), for now, is set to False, but we will steadily pass all
161+
stable tasks to True.
162+
- `trust_dataset` (bool), set to True if you trust the dataset.
163+
164+
165+
Then you need to add your task to the `TASKS_TABLE` list.
166+
167+
```python
168+
# STORE YOUR EVALS
169+
170+
# tasks with subset:
171+
TASKS_TABLE = SUBSET_TASKS
172+
173+
# tasks without subset:
174+
# TASKS_TABLE = [task]
175+
```
176+
177+
Finally, you need to add a module logic to convert your task to a dict for lighteval.
178+
179+
```python
180+
# MODULE LOGIC
181+
# You should not need to touch this
182+
# Convert to dict for lighteval
183+
if __name__ == "__main__":
184+
print(t.name for t in TASKS_TABLE)
185+
print(len(TASKS_TABLE))
186+
```
187+
188+
Once your file is created you can then run the evaluation with the following command:
189+
190+
```bash
191+
lighteval accelerate \
192+
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
193+
--tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
194+
--custom_tasks {path_to_your_custom_task_file} \
195+
--output_dir "./evals"
196+
```
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Adding a New Metric
2+
3+
First, check if you can use one of the parametrized functions in
4+
[src.lighteval.metrics.metrics_corpus]() or
5+
[src.lighteval.metrics.metrics_sample]().
6+
7+
If not, you can use the `custom_task` system to register your new metric:
8+
9+
> [!TIP]
10+
> To see an example of a custom metric added along with a custom task, look at <a href="">the IFEval custom task</a>.
11+
12+
13+
> [!WARNING]
14+
> To contribute your custom metric to the lighteval repo, you would first need
15+
> to install the required dev dependencies by running `pip install -e .[dev]`
16+
> and then run `pre-commit install` to install the pre-commit hooks.
17+
18+
19+
- Create a new Python file which should contain the full logic of your metric.
20+
- The file also needs to start with these imports
21+
22+
```python
23+
from aenum import extend_enum
24+
from lighteval.metrics import Metrics
25+
```
26+
27+
You need to define a sample level metric:
28+
29+
```python
30+
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
31+
response = predictions[0]
32+
return response == formatted_doc.choices[formatted_doc.gold_index]
33+
```
34+
35+
Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
36+
37+
```python
38+
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
39+
response = predictions[0]
40+
return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
41+
```
42+
43+
Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`.
44+
45+
```python
46+
def agg_function(items):
47+
flat_items = [item for sublist in items for item in sublist]
48+
score = sum(flat_items) / len(flat_items)
49+
return score
50+
```
51+
52+
Finally, you can define your metric. If it's a sample level metric, you can use the following code:
53+
54+
```python
55+
my_custom_metric = SampleLevelMetric(
56+
metric_name={custom_metric_name},
57+
higher_is_better={either True or False},
58+
category={MetricCategory},
59+
use_case={MetricUseCase},
60+
sample_level_fn=custom_metric,
61+
corpus_level_fn=agg_function,
62+
)
63+
```
64+
65+
If your metric defines multiple metrics per sample, you can use the following code:
66+
67+
```python
68+
custom_metric = SampleLevelMetricGrouping(
69+
metric_name={submetric_names},
70+
higher_is_better={n: {True or False} for n in submetric_names},
71+
category={MetricCategory},
72+
use_case={MetricUseCase},
73+
sample_level_fn=custom_metric,
74+
corpus_level_fn={
75+
"accuracy": np.mean,
76+
"other_metric": agg_function,
77+
},
78+
)
79+
```
80+
81+
To finish, add the following, so that it adds your metric to our metrics list
82+
when loaded as a module.
83+
84+
```python
85+
# Adds the metric to the metric list!
86+
extend_enum(Metrics, "metric_name", metric_function)
87+
if __name__ == "__main__":
88+
print("Imported metric")
89+
```
90+
91+
You can then give your custom metric to lighteval by using `--custom-tasks
92+
path_to_your_file` when launching it.
93+

0 commit comments

Comments
 (0)