Skip to content

DropSpecifiedFields processor implemetation #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/src/sdp/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,9 @@ Miscellaneous
.. autodata:: sdp.processors.ipl.ipl_processors.InferenceCommandGenerator
:annotation:

.. autodata:: sdp.processors.DropSpecifiedFields
:annotation:

.. _sdp-base-classes:

Base classes
Expand Down
1 change: 1 addition & 0 deletions sdp/processors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@
RenameFields,
SortManifest,
SplitOnFixedDuration,
DropSpecifiedFields,
)
from sdp.processors.modify_manifest.create_manifest import (
CreateCombinedManifests,
Expand Down
35 changes: 35 additions & 0 deletions sdp/processors/modify_manifest/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,3 +401,38 @@ def process(self):
with open(self.output_manifest_file, "wt", encoding="utf8") as fout:
for _, line in m3.iterrows():
fout.write(json.dumps(dict(line), ensure_ascii=False) + "\n")


class DropSpecifiedFields(BaseProcessor):
"""
A processor that removes specified fields from each data entry in the manifest.

This processor reads an input manifest line by line, drops the fields listed in `fields_to_drop`
from each JSON entry, and writes the cleaned entries to the output manifest.

Args:
fields_to_drop (List[str]): A list of keys to remove from each manifest entry.
**kwargs: Additional arguments passed to the BaseProcessor (e.g., input/output manifest paths).

Returns:
Copy link
Collaborator

@lilithgrigoryan lilithgrigoryan Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally please add example section to the docstring

    Example:
        .. code-block:: yaml

            - _target_: sdp.processors.modify_manifest.common.DuplicateFields
               input_manifest_file: ${workspace_dir}/test1.json
               output_manifest_file: ${workspace_dir}/test2.json
               duplicate_fields: {"text":"answer"}

A line-delimited JSON manifest, where each entry is the same as the input,
but with the specified fields removed.
"""

def __init__(self, fields_to_drop: List[str], **kwargs):
super().__init__(**kwargs)
self.fields_to_drop = fields_to_drop

def process(self):
# Open the input and output manifest files
with open(self.input_manifest_file, "rt", encoding="utf8") as fin, open(
self.output_manifest_file, "wt", encoding="utf8"
) as fout:
# Iterate over each line (entry) in the input manifest
for line in tqdm(fin):
# Parse JSON entry from the current line
entry = json.loads(line)
# Create a new entry by excluding the specified fields
new_line = {field: entry[field] for field in entry if field not in self.fields_to_drop}
# Write the cleaned entry to the output manifest
fout.write(json.dumps(new_line, ensure_ascii=False) + "\n")
Loading