11# WebMainBench
22
3- WebMainBench 是一个专门用于端到端评测网页正文抽取质量的基准测试工具。
3+ [ 简体中文 ] ( README_zh.md ) | English
44
5- ## 功能特点
5+ WebMainBench is a specialized benchmark tool for end-to-end evaluation of web main content extraction quality.
66
7- ### 🎯 ** 核心功能**
8- - ** 多抽取器支持** : 支持 trafilatura,resiliparse 等多种抽取工具
9- - ** 全面的评测指标** : 包含文本编辑距离、表格结构相似度(TEDS)、公式抽取质量等多维度指标
10- - ** 人工标注支持** : 评测数据集100%人工标注
7+ ## Features
118
12- #### 指标详细说明
9+ ### 🎯 ** Core Features**
10+ - ** Multiple Extractor Support** : Supports various extraction tools such as trafilatura, resiliparse, and more
11+ - ** Comprehensive Evaluation Metrics** : Includes multi-dimensional metrics such as text edit distance, table structure similarity (TEDS), formula extraction quality, etc.
12+ - ** Manual Annotation Support** : 100% manually annotated evaluation dataset
1313
14- | 指标名称 | 计算方式 | 取值范围 | 说明 |
14+ #### Metric Details
15+
16+ | Metric Name | Calculation Method | Value Range | Description |
1517| ---------| ----------| ----------| ------|
16- | ` overall ` | 所有成功指标的平均值 | 0.0-1.0 | 综合质量评分,分数越高质量越好 |
17- | ` text_edit ` | ` 1 - (编辑距离 / 最大文本长度 ) ` | 0.0-1.0 | 纯文本相似度,分数越高质量越好 |
18- | ` code_edit ` | ` 1 - (编辑距离 / 最大代码长度 ) ` | 0.0-1.0 | 代码内容相似度,分数越高质量越好 |
19- | ` table_TEDS ` | ` 1 - (树编辑距离 / 最大节点数 ) ` | 0.0-1.0 | 表格结构相似度,分数越高质量越好 |
20- | ` table_edit ` | ` 1 - (编辑距离 / 最大表格长度 ) ` | 0.0-1.0 | 表格内容相似度,分数越高质量越好 |
21- | ` formula_edit ` | ` 1 - (编辑距离 / 最大公式长度 ) ` | 0.0-1.0 | 公式内容相似度,分数越高质量越好 |
18+ | ` overall ` | Average of all successful metrics | 0.0-1.0 | Comprehensive quality score, higher is better |
19+ | ` text_edit ` | ` 1 - (edit distance / max text length ) ` | 0.0-1.0 | Plain text similarity, higher is better |
20+ | ` code_edit ` | ` 1 - (edit distance / max code length ) ` | 0.0-1.0 | Code content similarity, higher is better |
21+ | ` table_TEDS ` | ` 1 - (tree edit distance / max nodes ) ` | 0.0-1.0 | Table structure similarity, higher is better |
22+ | ` table_edit ` | ` 1 - (edit distance / max table length ) ` | 0.0-1.0 | Table content similarity, higher is better |
23+ | ` formula_edit ` | ` 1 - (edit distance / max formula length ) ` | 0.0-1.0 | Formula content similarity, higher is better |
2224
2325
24- ### 🏗️ ** 系统架构 **
26+ ### 🏗️ ** System Architecture **
2527
2628![ WebMainBench Architecture] ( docs/assets/arch.png )
2729
28- ### 🔧 ** 核心模块 **
29- 1 . ** data 模块 ** : 评测集文件和结果的读写管理
30- 2 . ** extractors 模块 ** : 各种抽取工具的统一接口
31- 3 . ** metrics 模块 ** : 评测指标的计算实现
32- 4 . ** evaluator 模块 ** : 评测任务的执行和结果输出
30+ ### 🔧 ** Core Modules **
31+ 1 . ** data module ** : Read/write management of evaluation sets and results
32+ 2 . ** extractors module ** : Unified interface for various extraction tools
33+ 3 . ** metrics module ** : Implementation of evaluation metrics calculation
34+ 4 . ** evaluator module ** : Execution and result output of evaluation tasks
3335
3436
35- ## 快速开始
37+ ## Quick Start
3638
37- ### 安装
39+ ### Installation
3840
3941``` bash
40- # 基础安装
42+ # Basic installation
4143pip install webmainbench
4244
43- # 安装所有可选依赖
45+ # Install with all optional dependencies
4446pip install webmainbench[all]
4547
46- # 开发环境安装
48+ # Development environment installation
4749pip install webmainbench[dev]
4850```
4951
50- ### 基本使用
52+ ### Basic Usage
5153
5254``` python
5355from webmainbench import DataLoader, Evaluator, ExtractorFactory
5456
55- # 1. 加载评测数据集
57+ # 1. Load evaluation dataset
5658dataset = DataLoader.load_jsonl(" your_dataset.jsonl" )
5759
58- # 2. 创建抽取器
60+ # 2. Create extractor
5961extractor = ExtractorFactory.create(" trafilatura" )
6062
61- # 3. 运行评测
63+ # 3. Run evaluation
6264evaluator = Evaluator()
6365result = evaluator.evaluate(dataset, extractor)
6466
65- # 4. 查看结果
67+ # 4. View results
6668print (f " Overall Score: { result.overall_metrics[' overall' ]:.4f } " )
6769```
6870
69- ### 数据格式
71+ ### Data Format
7072
71- 评测数据集应包含以下字段:
73+ Evaluation datasets should contain the following fields:
7274
7375``` jsonl
7476{
7577 "track_id" : " 0b7f2636-d35f-40bf-9b7f-94be4bcbb396" ,
76- "html" : " <html><body><h1 cc-select=\" true\" >这是标题 </h1></body></html>" , # 人工标注带cc -select="true" 属性
78+ "html" : " <html><body><h1 cc-select=\" true\" >This is a title </h1></body></html>" , # Manually annotated with cc -select="true" attribute
7779 "url" : " https://orderyourbooks.com/product-category/college-books-p-u/?products-per-page=all" ,
78- "main_html" : " <h1 cc-select=\" true\" >这是标题 </h1>" , # 从html中剪枝得到的正文html
79- "convert_main_content" : " # 这是标题 " , # 从main_html+html2text转化来
80- "groundtruth_content" : " # 这是标题 " , # 人工校准的markdown(部分提供)
80+ "main_html" : " <h1 cc-select=\" true\" >This is a title </h1>" , # Main content HTML pruned from html
81+ "convert_main_content" : " # This is a title " , # Converted from main_html + html2text
82+ "groundtruth_content" : " # This is a title " , # Manually calibrated markdown (partially provided)
8183 "meta" : {
82- "language" : " en" , # 网页的语言
83- "style" : " artical" , # 网页的文体
84+ "language" : " en" , # Web page language
85+ "style" : " artical" , # Web page style
8486 "table" : [], # [], ["layout"], ["data"], ["layout", "data"]
8587 "equation" : [], # [], ["inline"], ["interline"], ["inline", "interline"]
8688 "code" : [], # [], ["inline"], ["interline"], ["inline", "interline"]
@@ -89,73 +91,73 @@ print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
8991}
9092```
9193
92- ## 支持的抽取器
94+ ## Supported Extractors
9395
94- - ** trafilatura** : trafilatura抽取器
95- - ** resiliparse** : resiliparse抽取器
96- - ** llm-webkit** : llm-webkit 抽取器
97- - ** magic-html** : magic-html 抽取器
98- - ** 自定义抽取器 ** : 通过继承 ` BaseExtractor ` 实现
96+ - ** trafilatura** : trafilatura extractor
97+ - ** resiliparse** : resiliparse extractor
98+ - ** llm-webkit** : llm-webkit extractor
99+ - ** magic-html** : magic-html extractor
100+ - ** Custom extractors ** : Implement by inheriting from ` BaseExtractor `
99101
100- ## 评测榜单
102+ ## Evaluation Leaderboard
101103
102- | extractor | extractor_version | dataset | total_samples | overall( macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
104+ | extractor | extractor_version | dataset | total_samples | overall ( macro avg) | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
103105| -----------| -------------------| ---------| ---------------| ---------------------| -----------| --------------| ------------| -----------| -----------|
104106| llm-webkit | 4.1.1 | WebMainBench1.0 | 545 | 0.8256 | 0.9093 | 0.9399 | 0.7388 | 0.678 | 0.8621 |
105107| magic-html | 0.1.5 | WebMainBench1.0 | 545 | 0.5141 | 0.4117 | 0.7204 | 0.3984 | 0.2611 | 0.7791 |
106108| trafilatura_md | 2.0.0 | WebMainBench1.0 | 545 | 0.3858 | 0.1305 | 0.6242 | 0.3203 | 0.1653 | 0.6887 |
107109| trafilatura_txt | 2.0.0 | WebMainBench1.0 | 545 | 0.2657 | 0 | 0.6162 | 0 | 0 | 0.7126 |
108110| resiliparse | 0.14.5 | WebMainBench1.0 | 545 | 0.2954 | 0.0641 | 0.6747 | 0 | 0 | 0.7381 |
109111
110- ## 高级功能
112+ ## Advanced Features
111113
112- ### 多抽取器对比评估
114+ ### Multi-Extractor Comparison
113115
114116``` python
115- # 对比多个抽取器
117+ # Compare multiple extractors
116118extractors = [" trafilatura" , " resiliparse" ]
117119results = evaluator.compare_extractors(dataset, extractors)
118120
119121for name, result in results.items():
120122 print (f " { name} : { result.overall_metrics[' overall' ]:.4f } " )
121123```
122124
123- #### 具体示例
125+ #### Detailed Example
124126
125127``` python
126128python examples/ multi_extractor_compare.py
127129```
128130
129- 这个例子演示了如何:
131+ This example demonstrates how to:
130132
131- 1 . ** 加载测试数据集 ** :使用包含代码、公式、表格、文本等多种内容类型的样本数据
132- 2 . ** 创建多个抽取器 ** :
133- - ` magic-html ` :基于 magic-html 库的抽取器
134- - ` trafilatura ` :基于 trafilatura 库的抽取器
135- - ` resiliparse ` :基于 resiliparse 库的抽取器
136- 3 . ** 批量评估对比 ** :使用 ` evaluator.compare_extractors() ` 同时评估所有抽取器
137- 4 . ** 生成对比报告 ** :自动保存多种格式的评估结果
133+ 1 . ** Load test dataset ** : Use sample data containing multiple content types such as code, formulas, tables, text, etc.
134+ 2 . ** Create multiple extractors ** :
135+ - ` magic-html ` : Extractor based on magic-html library
136+ - ` trafilatura ` : Extractor based on trafilatura library
137+ - ` resiliparse ` : Extractor based on resiliparse library
138+ 3 . ** Batch evaluation comparison ** : Use ` evaluator.compare_extractors() ` to evaluate all extractors simultaneously
139+ 4 . ** Generate comparison report ** : Automatically save evaluation results in multiple formats
138140
139- #### 输出文件说明
141+ #### Output File Description
140142
141- 评估完成后会在 ` results/ ` 目录下生成三个重要文件:
143+ After evaluation is complete, three important files will be generated in the ` results/ ` directory:
142144
143- | 文件名 | 格式 | 内容描述 |
145+ | File Name | Format | Content Description |
144146| --------| ------| ----------|
145- | ` leaderboard.csv ` | CSV | ** 排行榜文件 ** :包含各抽取器的整体排名和分项指标对比,便于快速查看性能差异 |
146- | ` evaluation_results.json ` | JSON | ** 详细评估结果 ** :包含每个抽取器的完整评估数据、指标详情和元数据信息 |
147- | ` dataset_with_results.jsonl ` | JSONL | ** 增强数据集 ** :原始测试数据加上所有抽取器的提取结果,便于人工检查和分析 |
147+ | ` leaderboard.csv ` | CSV | ** Leaderboard file ** : Contains overall rankings and sub-metric comparisons for each extractor, for quick performance comparison |
148+ | ` evaluation_results.json ` | JSON | ** Detailed evaluation results ** : Contains complete evaluation data, metric details and metadata for each extractor |
149+ | ` dataset_with_results.jsonl ` | JSONL | ** Enhanced dataset ** : Original test data plus extraction results from all extractors, for manual inspection and analysis |
148150
149151
150- ` leaderboard.csv ` 内容示例:
152+ ` leaderboard.csv ` content example:
151153``` csv
152154extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
153155magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
154156resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
155157trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746
156158```
157159
158- ### 自定义指标
160+ ### Custom Metrics
159161
160162``` python
161163from webmainbench.metrics import BaseMetric, MetricResult
@@ -165,30 +167,30 @@ class CustomMetric(BaseMetric):
165167 pass
166168
167169 def _calculate_score (self , predicted , groundtruth , ** kwargs ):
168- # 实现自定义评测逻辑
170+ # Implement custom evaluation logic
169171 score = your_calculation(predicted, groundtruth)
170172 return MetricResult(
171173 metric_name = self .name,
172174 score = score,
173175 details = {" custom_info" : " value" }
174176 )
175177
176- # 添加到评测器
178+ # Add to evaluator
177179evaluator.metric_calculator.add_metric(" custom" , CustomMetric(" custom" ))
178180```
179181
180- ### 自定义抽取器
182+ ### Custom Extractors
181183
182184``` python
183185from webmainbench.extractors import BaseExtractor, ExtractionResult
184186
185187class MyExtractor (BaseExtractor ):
186188 def _setup (self ):
187- # 初始化抽取器
189+ # Initialize extractor
188190 pass
189191
190192 def _extract_content (self , html , url = None ):
191- # 实现抽取逻辑
193+ # Implement extraction logic
192194 content = your_extraction_logic(html)
193195
194196 return ExtractionResult(
@@ -197,34 +199,35 @@ class MyExtractor(BaseExtractor):
197199 success = True
198200 )
199201
200- # 注册自定义抽取器
202+ # Register custom extractor
201203ExtractorFactory.register(" my-extractor" , MyExtractor)
202204```
203205
204- ## 项目架构
206+ ## Project Architecture
205207
206208```
207209webmainbench/
208- ├── data/ # 数据处理模块
209- │ ├── dataset.py # 数据集类
210- │ ├── loader.py # 数据加载器
211- │ └── saver.py # 数据保存器
212- ├── extractors/ # 抽取器模块
213- │ ├── base.py # 基础接口
214- │ ├── factory.py # 工厂模式
215- │ └── ... # 具体实现
216- ├── metrics/ # 指标模块
217- │ ├── base.py # 基础接口
218- │ ├── text_metrics.py # 文本指标
219- │ ├── table_metrics.py # 表格指标
220- │ └── calculator.py # 指标计算器
221- ├── evaluator/ # 评估器模块
222- │ └── evaluator.py # 主评估器
223- └── utils/ # 工具模块
224- └── helpers.py # 辅助函数
210+ ├── data/ # Data processing module
211+ │ ├── dataset.py # Dataset class
212+ │ ├── loader.py # Data loader
213+ │ └── saver.py # Data saver
214+ ├── extractors/ # Extractor module
215+ │ ├── base.py # Base interface
216+ │ ├── factory.py # Factory pattern
217+ │ └── ... # Specific implementations
218+ ├── metrics/ # Metrics module
219+ │ ├── base.py # Base interface
220+ │ ├── text_metrics.py # Text metrics
221+ │ ├── table_metrics.py # Table metrics
222+ │ └── calculator.py # Metric calculator
223+ ├── evaluator/ # Evaluator module
224+ │ └── evaluator.py # Main evaluator
225+ └── utils/ # Utility module
226+ └── helpers.py # Helper functions
225227```
226228
227229
228- ## 许可证
230+ ## License
231+
232+ This project is licensed under the MIT License - see the [ LICENSE] ( LICENSE ) file for details.
229233
230- 本项目采用 MIT 许可证 - 查看 [ LICENSE] ( LICENSE ) 文件了解详情。
0 commit comments