-
Notifications
You must be signed in to change notification settings - Fork 34
Description
FileInput and derived classes like StringFileInput can handle lists of files from directory and glob.glob parameters. Still all file content is read/passed as a single Packet. Also .zip files are handled by a dedicated class ZipFileInput.
It should be possible to generalize FileInput to have derived classes read from files no matter if files came from directory structures, glob.glob expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.
A second aspect is file chunking: a FileInput may split up a single file into Packets containing data structures extracted from that file. For example, FileInputs like XmlElementStreamerFileInput and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read(). Currently these classes implement this fully
within their read() function, but the generic pattern is that they
maintain a "context" for the open/parsed file.
So all in all this issue addresses two general aspects:
- handle any
file-specs: directories, maps,Globbing, zip-files and any mix of these - handle fine-grained file-chunking: on each invoke()/read() may supply part of a file: a line an XML element etc.
See also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).