Skip to content

Commit 6cb86db

Browse files
authored
Merge pull request #56 from e06084/main
feat: add zh readme
2 parents 2c06f11 + 80b4040 commit 6cb86db

File tree

2 files changed

+325
-92
lines changed

2 files changed

+325
-92
lines changed

README.md

Lines changed: 95 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,86 +1,88 @@
11
# WebMainBench
22

3-
WebMainBench 是一个专门用于端到端评测网页正文抽取质量的基准测试工具。
3+
[简体中文](README_zh.md) | English
44

5-
## 功能特点
5+
WebMainBench is a specialized benchmark tool for end-to-end evaluation of web main content extraction quality.
66

7-
### 🎯 **核心功能**
8-
- **多抽取器支持**: 支持 trafilatura,resiliparse 等多种抽取工具
9-
- **全面的评测指标**: 包含文本编辑距离、表格结构相似度(TEDS)、公式抽取质量等多维度指标
10-
- **人工标注支持**: 评测数据集100%人工标注
7+
## Features
118

12-
#### 指标详细说明
9+
### 🎯 **Core Features**
10+
- **Multiple Extractor Support**: Supports various extraction tools such as trafilatura, resiliparse, and more
11+
- **Comprehensive Evaluation Metrics**: Includes multi-dimensional metrics such as text edit distance, table structure similarity (TEDS), formula extraction quality, etc.
12+
- **Manual Annotation Support**: 100% manually annotated evaluation dataset
1313

14-
| 指标名称 | 计算方式 | 取值范围 | 说明 |
14+
#### Metric Details
15+
16+
| Metric Name | Calculation Method | Value Range | Description |
1517
|---------|----------|----------|------|
16-
| `overall` | 所有成功指标的平均值 | 0.0-1.0 | 综合质量评分,分数越高质量越好 |
17-
| `text_edit` | `1 - (编辑距离 / 最大文本长度)` | 0.0-1.0 | 纯文本相似度,分数越高质量越好 |
18-
| `code_edit` | `1 - (编辑距离 / 最大代码长度)` | 0.0-1.0 | 代码内容相似度,分数越高质量越好 |
19-
| `table_TEDS` | `1 - (树编辑距离 / 最大节点数)` | 0.0-1.0 | 表格结构相似度,分数越高质量越好 |
20-
| `table_edit` | `1 - (编辑距离 / 最大表格长度)` | 0.0-1.0 | 表格内容相似度,分数越高质量越好 |
21-
| `formula_edit` | `1 - (编辑距离 / 最大公式长度)` | 0.0-1.0 | 公式内容相似度,分数越高质量越好 |
18+
| `overall` | Average of all successful metrics | 0.0-1.0 | Comprehensive quality score, higher is better |
19+
| `text_edit` | `1 - (edit distance / max text length)` | 0.0-1.0 | Plain text similarity, higher is better |
20+
| `code_edit` | `1 - (edit distance / max code length)` | 0.0-1.0 | Code content similarity, higher is better |
21+
| `table_TEDS` | `1 - (tree edit distance / max nodes)` | 0.0-1.0 | Table structure similarity, higher is better |
22+
| `table_edit` | `1 - (edit distance / max table length)` | 0.0-1.0 | Table content similarity, higher is better |
23+
| `formula_edit` | `1 - (edit distance / max formula length)` | 0.0-1.0 | Formula content similarity, higher is better |
2224

2325

24-
### 🏗️ **系统架构**
26+
### 🏗️ **System Architecture**
2527

2628
![WebMainBench Architecture](docs/assets/arch.png)
2729

28-
### 🔧 **核心模块**
29-
1. **data 模块**: 评测集文件和结果的读写管理
30-
2. **extractors 模块**: 各种抽取工具的统一接口
31-
3. **metrics 模块**: 评测指标的计算实现
32-
4. **evaluator 模块**: 评测任务的执行和结果输出
30+
### 🔧 **Core Modules**
31+
1. **data module**: Read/write management of evaluation sets and results
32+
2. **extractors module**: Unified interface for various extraction tools
33+
3. **metrics module**: Implementation of evaluation metrics calculation
34+
4. **evaluator module**: Execution and result output of evaluation tasks
3335

3436

35-
## 快速开始
37+
## Quick Start
3638

37-
### 安装
39+
### Installation
3840

3941
```bash
40-
# 基础安装
42+
# Basic installation
4143
pip install webmainbench
4244

43-
# 安装所有可选依赖
45+
# Install with all optional dependencies
4446
pip install webmainbench[all]
4547

46-
# 开发环境安装
48+
# Development environment installation
4749
pip install webmainbench[dev]
4850
```
4951

50-
### 基本使用
52+
### Basic Usage
5153

5254
```python
5355
from webmainbench import DataLoader, Evaluator, ExtractorFactory
5456

55-
# 1. 加载评测数据集
57+
# 1. Load evaluation dataset
5658
dataset = DataLoader.load_jsonl("your_dataset.jsonl")
5759

58-
# 2. 创建抽取器
60+
# 2. Create extractor
5961
extractor = ExtractorFactory.create("trafilatura")
6062

61-
# 3. 运行评测
63+
# 3. Run evaluation
6264
evaluator = Evaluator()
6365
result = evaluator.evaluate(dataset, extractor)
6466

65-
# 4. 查看结果
67+
# 4. View results
6668
print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
6769
```
6870

69-
### 数据格式
71+
### Data Format
7072

71-
评测数据集应包含以下字段:
73+
Evaluation datasets should contain the following fields:
7274

7375
```jsonl
7476
{
7577
"track_id": "0b7f2636-d35f-40bf-9b7f-94be4bcbb396",
76-
"html": "<html><body><h1 cc-select=\"true\">这是标题</h1></body></html>", # 人工标注带cc-select="true" 属性
78+
"html": "<html><body><h1 cc-select=\"true\">This is a title</h1></body></html>", # Manually annotated with cc-select="true" attribute
7779
"url": "https://orderyourbooks.com/product-category/college-books-p-u/?products-per-page=all",
78-
"main_html": "<h1 cc-select=\"true\">这是标题</h1>", # 从html中剪枝得到的正文html
79-
"convert_main_content": "# 这是标题", # 从main_html+html2text转化来
80-
"groundtruth_content": "# 这是标题", # 人工校准的markdown(部分提供)
80+
"main_html": "<h1 cc-select=\"true\">This is a title</h1>", # Main content HTML pruned from html
81+
"convert_main_content": "# This is a title", # Converted from main_html + html2text
82+
"groundtruth_content": "# This is a title", # Manually calibrated markdown (partially provided)
8183
"meta": {
82-
"language": "en", # 网页的语言
83-
"style": "artical", # 网页的文体
84+
"language": "en", # Web page language
85+
"style": "artical", # Web page style
8486
"table": [], # [], ["layout"], ["data"], ["layout", "data"]
8587
"equation": [], # [], ["inline"], ["interline"], ["inline", "interline"]
8688
"code": [], # [], ["inline"], ["interline"], ["inline", "interline"]
@@ -89,73 +91,73 @@ print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
8991
}
9092
```
9193

92-
## 支持的抽取器
94+
## Supported Extractors
9395

94-
- **trafilatura**: trafilatura抽取器
95-
- **resiliparse**: resiliparse抽取器
96-
- **llm-webkit**: llm-webkit 抽取器
97-
- **magic-html**: magic-html 抽取器
98-
- **自定义抽取器**: 通过继承 `BaseExtractor` 实现
96+
- **trafilatura**: trafilatura extractor
97+
- **resiliparse**: resiliparse extractor
98+
- **llm-webkit**: llm-webkit extractor
99+
- **magic-html**: magic-html extractor
100+
- **Custom extractors**: Implement by inheriting from `BaseExtractor`
99101

100-
## 评测榜单
102+
## Evaluation Leaderboard
101103

102-
| extractor | extractor_version | dataset | total_samples | overallmacro avg | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
104+
| extractor | extractor_version | dataset | total_samples | overall (macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
103105
|-----------|-------------------|---------|---------------|---------------------|-----------|--------------|------------|-----------|-----------|
104106
| llm-webkit | 4.1.1 | WebMainBench1.0 | 545 | 0.8256 | 0.9093 | 0.9399 | 0.7388 | 0.678 | 0.8621 |
105107
| magic-html | 0.1.5 | WebMainBench1.0 | 545 | 0.5141 | 0.4117 | 0.7204 | 0.3984 | 0.2611 | 0.7791 |
106108
| trafilatura_md | 2.0.0 | WebMainBench1.0 | 545 | 0.3858 | 0.1305 | 0.6242 | 0.3203 | 0.1653 | 0.6887 |
107109
| trafilatura_txt | 2.0.0 | WebMainBench1.0 | 545 | 0.2657 | 0 | 0.6162 | 0 | 0 | 0.7126 |
108110
| resiliparse | 0.14.5 | WebMainBench1.0 | 545 | 0.2954 | 0.0641 | 0.6747 | 0 | 0 | 0.7381 |
109111

110-
## 高级功能
112+
## Advanced Features
111113

112-
### 多抽取器对比评估
114+
### Multi-Extractor Comparison
113115

114116
```python
115-
# 对比多个抽取器
117+
# Compare multiple extractors
116118
extractors = ["trafilatura", "resiliparse"]
117119
results = evaluator.compare_extractors(dataset, extractors)
118120

119121
for name, result in results.items():
120122
print(f"{name}: {result.overall_metrics['overall']:.4f}")
121123
```
122124

123-
#### 具体示例
125+
#### Detailed Example
124126

125127
```python
126128
python examples/multi_extractor_compare.py
127129
```
128130

129-
这个例子演示了如何:
131+
This example demonstrates how to:
130132

131-
1. **加载测试数据集**:使用包含代码、公式、表格、文本等多种内容类型的样本数据
132-
2. **创建多个抽取器**
133-
- `magic-html`:基于 magic-html 库的抽取器
134-
- `trafilatura`:基于 trafilatura 库的抽取器
135-
- `resiliparse`:基于 resiliparse 库的抽取器
136-
3. **批量评估对比**:使用 `evaluator.compare_extractors()` 同时评估所有抽取器
137-
4. **生成对比报告**:自动保存多种格式的评估结果
133+
1. **Load test dataset**: Use sample data containing multiple content types such as code, formulas, tables, text, etc.
134+
2. **Create multiple extractors**:
135+
- `magic-html`: Extractor based on magic-html library
136+
- `trafilatura`: Extractor based on trafilatura library
137+
- `resiliparse`: Extractor based on resiliparse library
138+
3. **Batch evaluation comparison**: Use `evaluator.compare_extractors()` to evaluate all extractors simultaneously
139+
4. **Generate comparison report**: Automatically save evaluation results in multiple formats
138140

139-
#### 输出文件说明
141+
#### Output File Description
140142

141-
评估完成后会在 `results/` 目录下生成三个重要文件:
143+
After evaluation is complete, three important files will be generated in the `results/` directory:
142144

143-
| 文件名 | 格式 | 内容描述 |
145+
| File Name | Format | Content Description |
144146
|--------|------|----------|
145-
| `leaderboard.csv` | CSV | **排行榜文件**:包含各抽取器的整体排名和分项指标对比,便于快速查看性能差异 |
146-
| `evaluation_results.json` | JSON | **详细评估结果**:包含每个抽取器的完整评估数据、指标详情和元数据信息 |
147-
| `dataset_with_results.jsonl` | JSONL | **增强数据集**:原始测试数据加上所有抽取器的提取结果,便于人工检查和分析 |
147+
| `leaderboard.csv` | CSV | **Leaderboard file**: Contains overall rankings and sub-metric comparisons for each extractor, for quick performance comparison |
148+
| `evaluation_results.json` | JSON | **Detailed evaluation results**: Contains complete evaluation data, metric details and metadata for each extractor |
149+
| `dataset_with_results.jsonl` | JSONL | **Enhanced dataset**: Original test data plus extraction results from all extractors, for manual inspection and analysis |
148150

149151

150-
`leaderboard.csv` 内容示例:
152+
`leaderboard.csv` content example:
151153
```csv
152154
extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
153155
magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
154156
resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
155157
trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746
156158
```
157159

158-
### 自定义指标
160+
### Custom Metrics
159161

160162
```python
161163
from webmainbench.metrics import BaseMetric, MetricResult
@@ -165,30 +167,30 @@ class CustomMetric(BaseMetric):
165167
pass
166168

167169
def _calculate_score(self, predicted, groundtruth, **kwargs):
168-
# 实现自定义评测逻辑
170+
# Implement custom evaluation logic
169171
score = your_calculation(predicted, groundtruth)
170172
return MetricResult(
171173
metric_name=self.name,
172174
score=score,
173175
details={"custom_info": "value"}
174176
)
175177

176-
# 添加到评测器
178+
# Add to evaluator
177179
evaluator.metric_calculator.add_metric("custom", CustomMetric("custom"))
178180
```
179181

180-
### 自定义抽取器
182+
### Custom Extractors
181183

182184
```python
183185
from webmainbench.extractors import BaseExtractor, ExtractionResult
184186

185187
class MyExtractor(BaseExtractor):
186188
def _setup(self):
187-
# 初始化抽取器
189+
# Initialize extractor
188190
pass
189191

190192
def _extract_content(self, html, url=None):
191-
# 实现抽取逻辑
193+
# Implement extraction logic
192194
content = your_extraction_logic(html)
193195

194196
return ExtractionResult(
@@ -197,34 +199,35 @@ class MyExtractor(BaseExtractor):
197199
success=True
198200
)
199201

200-
# 注册自定义抽取器
202+
# Register custom extractor
201203
ExtractorFactory.register("my-extractor", MyExtractor)
202204
```
203205

204-
## 项目架构
206+
## Project Architecture
205207

206208
```
207209
webmainbench/
208-
├── data/ # 数据处理模块
209-
│ ├── dataset.py # 数据集类
210-
│ ├── loader.py # 数据加载器
211-
│ └── saver.py # 数据保存器
212-
├── extractors/ # 抽取器模块
213-
│ ├── base.py # 基础接口
214-
│ ├── factory.py # 工厂模式
215-
│ └── ... # 具体实现
216-
├── metrics/ # 指标模块
217-
│ ├── base.py # 基础接口
218-
│ ├── text_metrics.py # 文本指标
219-
│ ├── table_metrics.py # 表格指标
220-
│ └── calculator.py # 指标计算器
221-
├── evaluator/ # 评估器模块
222-
│ └── evaluator.py # 主评估器
223-
└── utils/ # 工具模块
224-
└── helpers.py # 辅助函数
210+
├── data/ # Data processing module
211+
│ ├── dataset.py # Dataset class
212+
│ ├── loader.py # Data loader
213+
│ └── saver.py # Data saver
214+
├── extractors/ # Extractor module
215+
│ ├── base.py # Base interface
216+
│ ├── factory.py # Factory pattern
217+
│ └── ... # Specific implementations
218+
├── metrics/ # Metrics module
219+
│ ├── base.py # Base interface
220+
│ ├── text_metrics.py # Text metrics
221+
│ ├── table_metrics.py # Table metrics
222+
│ └── calculator.py # Metric calculator
223+
├── evaluator/ # Evaluator module
224+
│ └── evaluator.py # Main evaluator
225+
└── utils/ # Utility module
226+
└── helpers.py # Helper functions
225227
```
226228

227229

228-
## 许可证
230+
## License
231+
232+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
229233

230-
本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情。

0 commit comments

Comments
 (0)