-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
🧨 Describe the Bug
Hi, so the docs wasn't clear about how to save output but I assumed I needed to use from marker.output import save_output so it works well when using PdfConverter but not when I use ExtractionConverter (I use it for structured extraction). I'm getting an error AttributeError: 'ExtractionOutput' object has no attribute 'metadata'. On top of that, all my attempts to use ollama to do structured extraction have failed while it works well with gemini but that's another issue I guess (PS: I've found this closed issue that is exactly my second issue with marker + ollama but I wonder why its closed because its still happening #785 )
📄 Input Document
It happens with any pdf but here's a short 3 pages pdf to test.
hal.pdf
📤 Output Trace / Stack Trace
Click to expand
Running page extraction: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.42s/it]
Traceback (most recent call last):
File "/home/kp276129/Documents/ontoflow/pdf_analysis/test.py", line 32, in <module>
save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")
File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/marker/output.py", line 97, in save_output
f.write(json.dumps(rendered.metadata, indent=2))
^^^^^^^^^^^^^^^^^
File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/pydantic/main.py", line 1026, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'ExtractionOutput' object has no attribute 'metadata
⚙️ Environment
Please fill in all relevant details:
- Marker version: marker-pdf 1.10.1
- Surya version: 0.17.0
- Python version: 3.12.3
- PyTorch version: 2.9.0+cu126
- Transformers version: 4.57.1
- Operating System :
- Distributor ID: Ubuntu
- Description: Ubuntu 24.04.3 LTS
- Release: 24.04
- Codename: noble
✅ Expected Behavior
I expected Marker to output hal_extracted_structured.json in OUTPUT_DIR without any error.
📟 Command or Code Used
Click to expand
# https://github.com/datalab-to/marker?tab=readme-ov-file#structured-extraction-beta
from pathlib import Path
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from marker.converters.extraction import ExtractionConverter
from marker.output import save_output
from templates import PaperMetadata
INPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/input")
OUTPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
schema = PaperMetadata.model_json_schema()
config_parser = ConfigParser({
"page_schema": schema,
"use_llm": True,
"disable_image_extraction": True,
"ollama_base_url": "http://localhost:11434",
"ollama_model": "gemma3",
"llm_service": "marker.services.ollama.OllamaService",
})
converter = ExtractionConverter(
artifact_dict=create_model_dict(),
config=config_parser.generate_config_dict(),
llm_service=config_parser.get_llm_service(),
)
rendered = converter(str(INPUT_DIR / "hal.pdf"))
save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")Oops, forgot to include my PaperMetadata template:
from __future__ import annotations
from typing import List, Optional, Dict
from pydantic import BaseModel, Field
class Figure(BaseModel):
"""Représente une figure, un diagramme ou une image dans le document."""
caption: Optional[str] = Field(
None, description="La légende exacte de la figure, si elle existe."
)
description: str = Field(
...,
description=(
"Une description textuelle détaillée de ce que l'image montre."
),
)
page_number: Optional[int] = Field(
None, description="Le numéro de la page où se trouve la figure."
)
class PaperMetadata(BaseModel):
"""Modèle de métadonnées pour un article scientifique / rapport technique."""
title: str = Field(..., description="Titre de l'article")
authors: List[str] = Field(
default_factory=list, description="Liste d'auteurs, ordre conservé"
)
affiliations: Optional[List[str]] = Field(
default=None, description="Liste d'affiliations"
)
abstract: Optional[str] = Field(None, description="Résumé / abstract")
keywords: Optional[List[str]] = Field(default=None, description="Mots-clés")
doi: Optional[str] = Field(None, description="DOI si présent")
publication_date: Optional[str] = Field(
None,
description=(
"Date de publication (ISO 'YYYY-MM-DD' préférée). "
"Formats acceptés: 'YYYY-MM-DD', '25 Jul 2017', 'Submitted on 25 Jul 2017' — "
),
)
journal: Optional[str] = Field(
None, description="Nom du journal / conférence"
)
volume: Optional[str] = Field(None, description="Volume")
issue: Optional[str] = Field(None, description="Numéro")
pages: Optional[str] = Field(None, description="Pages, ex: '123-135'")
figures: Optional[List[Figure]] = Field(
default_factory=list,
description="Liste de toutes les figures, diagrammes et images trouvés dans le document.",
)