tables_io.io_utils.iterator
IO Iterator Read Functions for tables_io
Functions
|
Iterates through the data in a given file. The data is yielded (along with |
|
Iterates through the data in a given file. The data is yielded (along with |
|
Opens the given file and gets the length of data in that file. If the data is multi-dimensional, the |
|
Open an HDF5 file and return the size of a group |
|
Iterates through an HDF5 file, yielding one chunk of data at a time |
|
iterator for sending chunks of data in hdf5. |
|
Iterates through a parquet file, yielding one chunk of data at a time. |
|
Iterates through a CSV file, yielding one chunk of data at a time. |
|
Open a Parquet file and return the size of a group |
|
Open a CSV file and return the number of rows in it |
|
Iterator for sending chunks of data in parquet |
|
Open a dataset and return the number of rows in a group |
|
Iterate through a list of tasks, yielding ones this process is responsible for. |
|
Split a number of rows by process. |
Module Contents
- iterator(filepath: str, tType: str | int, fmt: str | None = None, chunk_size: int | None = 100000, rank: int | None = 0, parallel_size: int | None = 1, **kwargs)[source]
Iterates through the data in a given file. The data is yielded (along with the start and stop index) as a Table-like object. The data will be read in as the tabular format given by tType. Uses
iterator_native()to iterate through data, and converts it as it is yielded.For a given file type, there are additional arguments that can be supplied to the native file reader. The main arguments that can be supplied are groupname for HDF5 files, and columns for parquet files. Other arguments for reading parquet files can be found in the documentation of pyarrow.parquet.read_table or pyarrow.dataset.dataset.
This function currently only works for the following file types: numpyHDF5, pandasParquet, pyarrowParquet, pyarrowHDF5
Accepted tabular types:
Format string
Format integer
“astropyTable”
0
“numpyDict”
1
“numpyRecarray”
2
“pandasDataFrame”
3
“pyarrowTable”
4
- Parameters:
filepath (str) – File to load
tType (int or None) – Table type, if None this will use readNative
fmt (str or None) – File format, if None it will be taken from the file extension
chunk_size (int, by default 100,000) – The size of data chunk to iterate over
rank (int, by default 0) – The rank of this process under MPI
parallel_size (int, by default 1) – The number of processes under MPI
- Returns:
start (int) – The starting index for the data.
stop (int) – The end index for the data.
data (Table-like) – The data from [start:stop]. The format will be the native tabular format for the file if no tType is given. Otherwise, the data will be in the tabular format tType.
Optional kwargs
- groupnamestr or None, by default None
For HDF5 files, the group name where the data is.
- columnslist of str or None, by default None
For parquet files, the names of the columns to read. None will read all the columns.
- iterator_native(filepath: str, fmt: str | None = None, chunk_size: int | None = 100000, rank: int | None = 0, parallel_size: int | None = 1, **kwargs)[source]
Iterates through the data in a given file. The data is yielded (along with the start and stop index) as a Table-like object that has the default format for the given file type.
This function currently only works for the following file types: numpyHDF5, pandasParquet, pyarrowParquet, pyarrowHDF5
Any kwargs are passed to the specific iterator function for the file type.
- Parameters:
filepath (str) – File to load
fmt (str or None) – File format, if None it will be taken from the file extension. By default None.
chunk_size (int, by default 100,000) – The size of data chunk to iterate over
rank (int, by default 0) – The rank of this process under MPI
parallel_size (int, by default 1) – The number of processes under MPI
- Returns:
start (int) – Data start index
stop (int) – Data ending index
data (Table-like) – The data in the native type for that file, from [start:stop]
Optional kwargs
- groupnamestr or None
For HDF5 files, the group name where the data is
- columnslist of str or None
For parquet files, the names of the columns to read. None will read all the columns
- get_input_data_length(filepath: str, fmt: str | None = None, **kwargs)[source]
Opens the given file and gets the length of data in that file. If the data is multi-dimensional, the function will give the length of the first axis of the data, which is typically the axis that you want to iterate over.
- Parameters:
filepath (str) – File to load
fmt (str or None) – File format, if None it will be taken from the file extension.
- Returns:
nrows – The length of the data
- Return type:
int
Notes
The kwargs are passed to the specific iterator type.
- get_input_data_length_HDF5(filepath: str, groupname: str | None = None) int[source]
Open an HDF5 file and return the size of a group
- Parameters:
filepath (str) – The input filepath.
groupname (str or None) – The group name where the data is, by default None.
- Returns:
length – The length of the data. In the case of a multi-dimensional array, this is the length of the first axis.
- Return type:
int
Notes
For a multi-D array this returns the length of the first axis and not the total size of the array.
Normally that is what you want to be iterating over.
- iter_HDF5_to_dict(filepath: str, groupname: str | None = None, chunk_size: int = 100000, rank: int = 0, parallel_size: int = 1) collections.abc.Iterator[int, int, Mapping][source]
Iterates through an HDF5 file, yielding one chunk of data at a time as an OrderedDict of np.array objects.
- Parameters:
filepath (str) – The input filepath
groupname (str) – The group name where the data is, by default None.
chunk_size (int, by default 100,000) – The size of data chunk to iterate over
rank (int, by default 0) – The rank of this process under MPI
parallel_size (int, by default 1) – The number of processes under MPI
- Yields:
start (int) – Data start index
end (int) – Data ending index
data (dict) – OrderedDict of np.array of all data from start:end
- iter_H5_to_dataframe(filepath: str, chunk_size: int | None = 100000, groupname=None, rank: int | None = 0, parallel_size: int | None = 1)[source]
iterator for sending chunks of data in hdf5.
- Parameters:
filepath (str) – input file name
chunk_size (int) – size of chunk to iterate over
- Returns:
iterator chunk
- Return type:
output
- iter_pq_to_dataframe(filepath: str, chunk_size: int = 100000, columns: List[str] | None = None, **kwargs)[source]
Iterates through a parquet file, yielding one chunk of data at a time.
- Parameters:
filepath (str) – path to input file
chunk_size (int, by default = 100,000) – The maximum chunk size of the data
columns (list (str) or None) – Names of the columns to read, None will read all the columns
kwargs (additional arguments to pass to the parquet read_table function)
- Yields:
start (int) – Data start index
end (int) – Data ending index
data (pandas.DataFrame) – DataFrame of all data from start:end
- iter_csv_to_dataframe(filepath: str, chunk_size: int = 100000, **kwargs)[source]
Iterates through a CSV file, yielding one chunk of data at a time.
- Parameters:
filepath (str) – path to input file
chunk_size (int, by default = 100,000) – The maximum chunk size of the data
kwargs (additional arguments to pass to the pandas read_csv function)
- Yields:
data (pandas.DataFrame) – DataFrame of all data from start:end
- get_input_data_length_pq(filepath: str, columns: List[str] | None = None, **kwargs) int[source]
Open a Parquet file and return the size of a group
- Parameters:
filepath (str) – Path to input file
columns (List[str] or None) – The groupname for the data
kwargs (additional arguments to pass to the pyarrow.parquet.read_table function)
- Returns:
nrow – The length of the data
- Return type:
int
Notes
For a multi-D array this return the length of the first axis and not the total size of the array.
Normally that is what you want to be iterating over.
- get_input_data_length_csv(source: str, **kwargs) int[source]
Open a CSV file and return the number of rows in it
- Parameters:
source (str) – Path to input file
- Returns:
nrows – The length of the data
- Return type:
int
- iter_ds_to_table(source, columns: List[str] | None = None, chunk_size: int = 100000, **kwargs)[source]
Iterator for sending chunks of data in parquet
- Parameters:
source (str) – input file name
columns (List[str], default None) – The list of columns to use
chunk_size (int) – The maximum size of the batches to be read in
kwargs (additional arguments to pass to the pyarrow.dataset.to_batches() function)
- Yields:
start (int) – Data start index
end (int) – Data ending index
data (pyarrow.Table) – table of all data from start:end
- get_input_data_length_ds(source, **kwargs) int[source]
Open a dataset and return the number of rows in a group
- Parameters:
source (str) – Path to input file or directory
kwargs – kwargs are passed to pyarrow.dataset.dataset()
- Returns:
nrows – The length of the data
- Return type:
int
- split_tasks_by_rank(tasks: collections.abc.Iterable[int], parallel_size: int, rank: int) collections.abc.Iterator[int][source]
Iterate through a list of tasks, yielding ones this process is responsible for.
Tasks are allocated in a round-robin way.
- Parameters:
tasks (iterator) – Tasks to split up
parallel_size (int) – The number of processes under MPI
rank (int) – The rank of this process under MPI
- Yields:
task (int) – The number of the first task for this process
- data_ranges_by_rank(n_rows: int, chunk_rows: int, parallel_size: int, rank: int) collections.abc.Iterator[int, int][source]
Split a number of rows by process.
Given a total number of rows to read and a chunk size, yield the ranges within them that this process should handle.
- Parameters:
n_rows (int) – Total number of rows to split up
chunk_rows (int) – Size of each chunk to be read
parallel_size (int) – The number of processes under MPI
rank (int) – The rank of this process under MPI
- Yields:
start (int) – Data start index
end (int) – Data ending index