Chunk by chunk iteration made easier
pip install pybite
Return a iterator of chunks for the iterable
iterable: iter
Any iterable data e.g. list, tuple, dict, iter, ...
chunk_size: int
The size of each chunk
map: callable optional, defaults None
A map function for transform the data before dividing in chunks
persist_header: bool optional, defaults False
Whether to persist a header on each chunk or not. If persist_header is True and header is None, the firt line will be read as the actual header.
header: str optional, defaults None
A header to be written at the start of each file. If persist_header is True and header is None, the firt line will be read as the actual header.
iter
New iterable for the chunked data
>>> iterate_by([1, 2, 3, 4, 5], 2)
iter([[1, 2], [3, 4], [5]])
>>> iterate_by([1, 2, 3, 4, 5], 2, map=lambda x: x * 2)
iter([[2, 4], [6, 8], [10]])
>>> iterate_by(["numbers", 1, 2, 3, 4, 5], 2, persist_header=True)
iter([["numbers", 1, 2], ["numbers", 3, 4], ["numbers", 5]])
Return a iterator of file lines.
Each line is read on-demand untill there is no more line to be read.
iterable: iter
Any iterable data e.g. list, tuple, dict, iter, ...
file_stream : str, io.StringIO
A file path or io.StringIO instance to the file to be read.
encoding : str optional, defaults None
The input file encoding. The chunks will be saved using the same encoding.
strip_end : bool optional, defaults False
Flag for strip the end of each line.
Same as calling line.rstrip()
.
iter
New iterable for the chunked data
with open("test.txt", "w", encoding="utf-8") as f:
f.write("Symbols\nAyp\nBx\nCC\nDt")
>>> iterate_file_by_lines("test.txt", encoding="utf-8")
iter(["Symbols\n", "Ayp\n", "Bx\n", "CC\n", "Dt"])
>>> iterate_file_by_lines("test.txt", encoding="utf-8", strip_end=True)
iter(["Symbols", "Ayp", "Bx", "CC", "Dt"])
Split a file into multiple files and store them in the output path.
The file is divided b lines into chunk files that later can be read individually or joined back.
file_stream: str, io.StringIO
A file path or io.StringIO instance to the file you want to split.
output_path: str
The path to the directory where the chunks will be stored.
lines: int
The quantity of lines on each chunk file. If included the header the total will be lines + 1.
encoding: str optional, defaults None
The input file encoding. The chunks will be saved using the same encoding.
persist_header: bool optional, defaults False
Whether to persist a header on each chunk or not. If persist_header is True and header is None, the firt line will be read as the actual header.
header: str optional, defaults None
A header to be written at the start of each file. If persist_header is True and header is None, the firt line will be read as the actual header.
chunk_name_format: str optional, defaults "04d"
The format string for the chunk numbers in the output file names.
list
List of paths to the written files.
with open("test.txt", "w", encoding="utf-8") as f:
f.write("Symbols\nAyp\nBx\nCC\nDt")
>>> split_by_lines("test.txt", "out", 2, encoding="utf-8")
["out/test.chunk0000.txt", "out/test.chunk0001.txt",
"out/test.chunk0002.txt"]
>>> split_by_lines("test.txt", "out", 2, encoding="utf-8", persist_header=True)
["out/test.chunk0000.txt", "out/test.chunk0001.txt"]
Join chunk files into a single line stream.
Join the files created by split_by_lines into a iterable of str lines and read ordered by name. Chunks not found will throw an error if ignore_missing_chunks is not False.
Avoid saving different files chunks into the same directory.
files_path: str
The path to a directory containing the chunk files or a list of the path to the files.
encoding: str optional, defaults None
The input file encoding. The chunks will be saved using the same encoding.
persisted_header: bool optional, defaults False
Whether a header was persisted on each chunk or not. If persisted_header is True and header is None, the firt line of teh first file will be read as the actual header.
header: str optional, defaults None
A header to be read at the start of the first file. If persisted_header is True and header is None, the firt line of teh first file will be read as the actual header.
ignore_missing_chunks: bool optional, defaults "False
Flag to ignore missing chunks.
iter
An iterable to the lines of the files (read in the order).
Slice file by lines into an iterable of the sliced lines.
Returns only cut lines in iterable str format. It does not work with negative numbers.
files_stream: str
The path and file to be cut.
start: int
Initial position of the cut. Use positive Number.
end: int
End position of the cut. Use Positive Number.
encoding: str optional, defaults None
The input file encoding. The chunks will be saved using the same encoding.
persisted_header: bool optional, defaults False
Whether a header was persisted on each chunk or not. If persisted_header is True and header is None, the firt line of teh first file will be read as the actual header.
header: str optional, defaults None
A header to be read at the start of the first file. If persisted_header is True and header is None, the firt line of teh first file will be read as the actual header.
iter
An iterable with the cut portion of the file
test_data = [
"Name;Age;",
"Test;00;",
"Test;11;",
"John;22;",
"Test;33;",
"Test;44;",
"Test;55;",
"Test;66;",
"Test;77;",
"Test;88;",
]
with open("./text.txt", "w") as f:
f.write("\n".join(test_data))
>>> slice_file_by_lines(input_path, 1,2, persist_header=True)
iter(["Name;Age;\n", "Test;11;\n", "John;22;\n"])
>>> slice_file_by_lines(input_path, 1,2)
iter(["Test;11;\n", "John;22;\n"])
Test are handled by PyTest and are included inside the folder test
.
To run the test execute the command:
pytest