Skip to content

[BUG: Breaking] AttributeError: 'ExtractionOutput' object has no attribute 'metadata' #939

@kipavy

Description

@kipavy

🧨 Describe the Bug

Hi, so the docs wasn't clear about how to save output but I assumed I needed to use from marker.output import save_output so it works well when using PdfConverter but not when I use ExtractionConverter (I use it for structured extraction). I'm getting an error AttributeError: 'ExtractionOutput' object has no attribute 'metadata'. On top of that, all my attempts to use ollama to do structured extraction have failed while it works well with gemini but that's another issue I guess (PS: I've found this closed issue that is exactly my second issue with marker + ollama but I wonder why its closed because its still happening #785 )

📄 Input Document

It happens with any pdf but here's a short 3 pages pdf to test.
hal.pdf

📤 Output Trace / Stack Trace

Click to expand
Running page extraction: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.42s/it]
Traceback (most recent call last):
  File "/home/kp276129/Documents/ontoflow/pdf_analysis/test.py", line 32, in <module>
    save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")
  File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/marker/output.py", line 97, in save_output
    f.write(json.dumps(rendered.metadata, indent=2))
                       ^^^^^^^^^^^^^^^^^
  File "/nobackup/kp276129/envs/ontoflow/lib/python3.12/site-packages/pydantic/main.py", line 1026, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'ExtractionOutput' object has no attribute 'metadata

⚙️ Environment

Please fill in all relevant details:

  • Marker version: marker-pdf 1.10.1
  • Surya version: 0.17.0
  • Python version: 3.12.3
  • PyTorch version: 2.9.0+cu126
  • Transformers version: 4.57.1
  • Operating System :
    • Distributor ID: Ubuntu
    • Description: Ubuntu 24.04.3 LTS
    • Release: 24.04
    • Codename: noble

✅ Expected Behavior

I expected Marker to output hal_extracted_structured.json in OUTPUT_DIR without any error.

📟 Command or Code Used

Click to expand
# https://github.com/datalab-to/marker?tab=readme-ov-file#structured-extraction-beta
from pathlib import Path
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from marker.converters.extraction import ExtractionConverter
from marker.output import save_output
from templates import PaperMetadata

INPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/input")
OUTPUT_DIR = Path("/home/kp276129/Documents/ontoflow/pdf_analysis/output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


schema = PaperMetadata.model_json_schema()

config_parser = ConfigParser({
    "page_schema": schema,
    "use_llm": True,
    "disable_image_extraction": True,
    "ollama_base_url": "http://localhost:11434",
    "ollama_model": "gemma3",
    "llm_service": "marker.services.ollama.OllamaService",
})

converter = ExtractionConverter(
    artifact_dict=create_model_dict(),
    config=config_parser.generate_config_dict(),
    llm_service=config_parser.get_llm_service(),
)

rendered = converter(str(INPUT_DIR / "hal.pdf"))
save_output(rendered, output_dir=OUTPUT_DIR, fname_base="hal_extracted_structured")

Oops, forgot to include my PaperMetadata template:

from __future__ import annotations

from typing import List, Optional, Dict
from pydantic import BaseModel, Field


class Figure(BaseModel):
    """Représente une figure, un diagramme ou une image dans le document."""

    caption: Optional[str] = Field(
        None, description="La légende exacte de la figure, si elle existe."
    )
    description: str = Field(
        ...,
        description=(
            "Une description textuelle détaillée de ce que l'image montre."
        ),
    )
    page_number: Optional[int] = Field(
        None, description="Le numéro de la page où se trouve la figure."
    )


class PaperMetadata(BaseModel):
    """Modèle de métadonnées pour un article scientifique / rapport technique."""

    title: str = Field(..., description="Titre de l'article")
    authors: List[str] = Field(
        default_factory=list, description="Liste d'auteurs, ordre conservé"
    )
    affiliations: Optional[List[str]] = Field(
        default=None, description="Liste d'affiliations"
    )
    abstract: Optional[str] = Field(None, description="Résumé / abstract")
    keywords: Optional[List[str]] = Field(default=None, description="Mots-clés")
    doi: Optional[str] = Field(None, description="DOI si présent")
    publication_date: Optional[str] = Field(
        None,
        description=(
            "Date de publication (ISO 'YYYY-MM-DD' préférée). "
            "Formats acceptés: 'YYYY-MM-DD', '25 Jul 2017', 'Submitted on 25 Jul 2017' — "
        ),
    )
    journal: Optional[str] = Field(
        None, description="Nom du journal / conférence"
    )
    volume: Optional[str] = Field(None, description="Volume")
    issue: Optional[str] = Field(None, description="Numéro")
    pages: Optional[str] = Field(None, description="Pages, ex: '123-135'")

    figures: Optional[List[Figure]] = Field(
        default_factory=list,
        description="Liste de toutes les figures, diagrammes et images trouvés dans le document.",
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug: breakingCrashes, errors, anything that stops execution or is runtime-breaking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions