tables_io.io_utils.read
IO Read Functions for tables_io
Functions
|
Reads in a given file to either a Table-like format if there is one table within the file, |
|
Reads in a file to its corresponding default tabular format. |
|
Returns the file object. This allows you to |
|
Read the file column names from file and ensure that it contains at least |
|
Reads astropy.table.Table objects into an OrderedDict TableDict-like object from a FITS file. |
|
Reads np.recarray objects into an OrderedDict TableDict-like object from a FITS file. |
|
Reads astropy.table.Table objects into an OrderedDict TableDict-like object from an hdf5 file. |
|
Read and return the requested group and file object from an hdf5 file. If no group is provided, returns the h5py.File object twice. |
|
Reads numpy.array objects from an open hdf5 file object. If given a dataset, returns a numpy.array of that dataset. |
|
Read and return the list of group names from one level of an hdf5 file. |
|
Reads numpy.array objects into an OrderedDict from an hdf5 file. If a list of keys is given, |
|
Reads all or part of a hdf5 dataset into a numpy.array |
|
Reads pandas.DataFrame objects from an 'h5' file (a pandas hdf5 file). |
|
Open an h5 (pandas hdf5) file and and return an OrderedDict of pandas.DataFrame objects |
|
Reads a pandas.DataFrame object from a parquet file. |
|
Reads pandas.DataFrame objects from an parquet file. |
|
Open a parquet file and return an OrderedDict of numpy.array objects |
|
Open an h5 file and and return an OrderedDict of numpy.array objects. |
|
Read in h5py hdf5 data, return a dictionary of all of the keys |
|
Reads pyarrow.Table objects from an hdf5 file. |
|
Open an HDF5 file and and return an OrderedDict of pyarrow.Table |
|
Reads a pyarrow.Table object from an parquet file. |
|
Reads pyarrow.Table objects from a parquet file into an OrderedDict. |
|
Tries to parse a string into a numpy array or a JSON object. |
|
Reads pandas.DataFrame objects from a csv file into an OrderedDict. |
|
Generates an error message to be printed out if a file cannot be read in by read_native. |
Module Contents
- read(filepath: str, tType: int | str | None = None, fmt: str | None = None, keys: List[str] | None = None, allow_missing_keys: bool = False, slice_dict: dict[str, slice | int] | None = None, **kwargs)[source]
Reads in a given file to either a Table-like format if there is one table within the file, or a TableDict-like format if there are multiple tables or files. Uses
read_native()to read the file.The TableDict-like format is an OrderedDict of Table-like objects. The Table-like objects currently supported are: astropyTable, numpyRecarray, numpyDict (dict of numpy arrays), pandasDataFrame, and pyarrowTable.
If given just the filepath, the function will read any tables in the file to its default Table-like format in memory. If given a specific tabular type, the function will read in the file to the default type and then convert to the requested type.
The keys argument is required when reading in multi-dataset parquet files, to specify which dataset files to read in. Otherwise, the only required argument is the filepath.
Accepted tabular types:
Format string
Format integer
“astropyTable”
0
“numpyDict”
1
“numpyRecarray”
2
“pandasDataFrame”
3
“pyarrowTable”
4
“jsonString”
5
- Parameters:
filepath (str) – Full path to the file to load
tType (int, str or None) – Table type, if None the default table type will be used.
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False) – If False will raise FileNotFoundError if a key is missing from the given file.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
kwargs – Additional arguments to pass to the native file reader
- Returns:
data – The data
- Return type:
OrderedDict ( str -> Table-like )
Example
For a single Table-like object, we can read it in as follows:
>>> import tables_io >>> df = tables_io.read('filename.h5') >>> print(df) col1 col2 0 1 3 1 2 4
Notice that it has been automatically read in as the default tabular type for h5 files, a pandasDataFrame.
For a TableDict-like object, we read it in as follows:
>>> table_dict = tables_io.read('filename.hdf5', tType='astropyTable') >>> table_dict OrderedDict({'tab_1': <Table length=2> x y int64 int64 ----- ----- 2 1 4 3, 'tab_2': <Table length=2> a b int64 int64 ----- ----- 5 3 7 4})
Notice that the resulting OrderedDict has astropyTable objects as the values.
- read_native(filepath: str, fmt: str | None = None, keys: List[str] | None = None, allow_missing_keys: bool = False, slice_dict: dict[str, slice | int] | None = None, **kwargs)[source]
Reads in a file to its corresponding default tabular format.
The format of the file is either given by fmt, or determined based on the suffix of the file path. This determines what tabular format the file is read in as. In all cases, the data from the file is returned as an OrderedDict or TableDict-like object, with str keys and Table-like values. The Table-like values can be astropyTable, numpyRecarray, numpyDict (dict of numpy arrays), pandasDataFrame, and pyarrowTable.
- Parameters:
filepath (str) – Full path of the file to load
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False.) – If False will raise FileNotFoundError if a key is missing from the given file.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
kwargs – Additional arguments to pass to the native file reader
- Returns:
data – The data
- Return type:
OrderedDict ( str -> Table-like )
Example
Reading in a file that is in NUMPY_HDF5 format:
>>> import tables_io >>> tab = tables_io.read_native('filename.hdf5') >>> print(tab) OrderedDict({'tab_1': OrderedDict({'col_1': array([0., 2.]), 'col_2': array([2., 3.])}), 'tab_2': OrderedDict({'col_a': array([1., 1.]), 'col_b': array([3., 3.])})})
- io_open(filepath: str, fmt: str | None = None, **kwargs)[source]
Returns the file object. This allows you to open large files without reading the whole file into memory.
It opens the file object with different packages depending on the file type. It uses astropy to open FITS files (astropy.io.fits.open()), h5py for any HDF5 files (h5py.File()), or pyarrow parquet for any parquet files (pyarrow.parquet.ParquetFile()). You can specify which file type you are supplying via the fmt argument, or it will automatically determine the file type from its suffix.
If the given file is not one of the supported types, it will raise a TypeError.
- Parameters:
filepath (str) – The path to the file to load.
fmt (str or None) – The file format, if None it will be taken from the file extension.
- Return type:
File object. One of pyarrow.parquet.ParquetFile, h5py.File or astropy.io.fits.HDUList.
Example
For example, to read in a sample fits file:
>>> import tables_io >>> hdul = tables_io.io_open("./data/test.fits", "fits") >>> hdul.info() No. Name Ver Type Cards Dimensions Format 0 PRIMARY 1 PrimaryHDU 4 () 1 DF 1 BinTableHDU 37 10R x 14C [K, E, E, E, E, E, E, E, E, E, E, E, E, D]
- check_columns(filepath: str, columns_to_check: List[str], fmt: str | None = None, parent_groupname: str | None = None, **kwargs)[source]
Read the file column names from file and ensure that it contains at least the columns specified in a provided list. If not, an error will be raised.
For FITS files, columns across all extensions will be checked at one time.
For HDF5 files, only columns within a single level of the specified parent_groupname will be checked.
Note: If more columns are available in the file than specified in the list, the file will still pass the check.
- Parameters:
filepath (str) – File name for the file to read. If there’s no suffix, it will be applied based on the object type.
columns_to_check (list) – A list of columns to be compared with the data
fmt (str or None) – The input file format, If None this will use io_open
parent_groupname (str or None) – For hdf5 files, the groupname for the data
- read_fits_to_ap_tables(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Reads astropy.table.Table objects into an OrderedDict TableDict-like object from a FITS file. If a list of keys is given, will read only those tables.
- Parameters:
filepath (str) – Path to input file
keys (list or None) – A list of which tables to read, in lower case.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
tables – Keys will be HDU names, values will be tables
- Return type:
OrderedDict of astropy.table.Table
- read_fits_to_recarrays(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Reads np.recarray objects into an OrderedDict TableDict-like object from a FITS file. If a list of keys is given, will read only those tables.
- Parameters:
filepath (str) – Path to input file
keys (list or None) – A list of which HDU names to read, in lower case.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
tables – Keys will be HDU names, values will be tables
- Return type:
OrderedDict of np.recarray
- read_HDF5_to_ap_tables(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Reads astropy.table.Table objects into an OrderedDict TableDict-like object from an hdf5 file.
- Parameters:
filepath (str) – Path to input file
keys (list or None) – A list of which datasets to read in.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
tables – Keys will be ‘paths’, values will be tables
- Return type:
OrderedDict of astropy.table.Table
- read_HDF5_group(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None)[source]
Read and return the requested group and file object from an hdf5 file. If no group is provided, returns the h5py.File object twice.
- Parameters:
filepath (str) – File in question
groupname (str or None) – The name or path to the desired group.
read_slice (slice or int or None) – Slice of data to read
- Returns:
grp (h5py.Group or h5py.File) – The requested group
infp (h5py.File) – The input file (returned so that the user can explicitly close the file)
- read_HDF5_group_to_dict(hg, start: int | None = None, end: int | None = None)[source]
Reads numpy.array objects from an open hdf5 file object. If given a dataset, returns a numpy.array of that dataset. If given a group, it will read numpy.array objects into an OrderedDict for all of the keys in that group. If start and end are provided, it will only read in the given slice [start:end] of all the datasets.
- Parameters:
hg (hdf5 object) – The hdf5 object to read in, either a dataset or a group.
start (int or None) – Starting row of dataset(s) to read.
end (int or None) – Ending row of dataset(s) to read.
- Returns:
tables – Keys will be ‘paths’, values will be arrays in the case of an OrderedDict.
- Return type:
OrderedDict of numpy.array or a numpy.array
- read_HDF5_group_names(filepath: str, parent_groupname: str | None = None) List[str][source]
Read and return the list of group names from one level of an hdf5 file.
- Parameters:
filepath (str) – File in question
parent_groupname (str or None) – For hdf5 files, the parent groupname. All group names under this will be returned. If None, return the top level group names.
- Returns:
names – The names of the groups in the file
- Return type:
list of str
- read_HDF5_to_dicts(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Reads numpy.array objects into an OrderedDict from an hdf5 file. If a list of keys is given, will only read those specific datasets.
- Parameters:
filepath (str) – Path to input file
keys (list or None) – A list of which tables to read from the file.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
dicts – The data
- Return type:
OrderedDict, (str, OrderedDict, (str, numpy.array) )
- read_HDF5_dataset_to_array(dataset, start: int | None = None, end: int | None = None) numpy.array[source]
Reads all or part of a hdf5 dataset into a numpy.array
- Parameters:
dataset (h5py.Dataset) – The input dataset
start (int or None) – Starting row
end (int or None) – Ending row
- Returns:
out – Something that pandas can handle
- Return type:
numpy.array
- read_H5_to_dataframe(filepath: str, key: str | None = None, read_slice: slice | int | None = None)[source]
Reads pandas.DataFrame objects from an ‘h5’ file (a pandas hdf5 file).
- Parameters:
filepath (str) – Path to input file
key (str or None) – The key in the hdf5 file
read_slice (slice or int or None) – Slice of data to read
- Returns:
df – The dataframe
- Return type:
pandas.DataFrame
- read_H5_to_dataframes(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Open an h5 (pandas hdf5) file and and return an OrderedDict of pandas.DataFrame objects
- Parameters:
filepath (str) – Path to input file
keys (list or None) – A list of which tables to read.
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
tab – The data
- Return type:
OrderedDict (str : pandas.DataFrame)
Notes
We are using the file suffix ‘h5’ to specify ‘hdf5’ files written from DataFrames using pandas They have a different structure than ‘hdf5’ files written with h5py or astropy.table
- read_pq_to_dataframe(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs)[source]
Reads a pandas.DataFrame object from a parquet file.
- Parameters:
filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)
- Returns:
df – The data frame
- Return type:
pandas.DataFrame
- read_pq_to_dataframes(filepath: str, keys: List[str] | None = None, allow_missing_keys: bool = False, columns: List[str] | Mapping | None = None, slice_dict: dict[str, slice | int] | None = None, **kwargs) Mapping[source]
Reads pandas.DataFrame objects from an parquet file.
- Parameters:
filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing
columns (dict of list (str), list (str), or None) –
- Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
if a list, only the columns in the list will be loaded.
None will read all the columns
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)
- Returns:
tables – Keys will be taken from keys
- Return type:
OrderedDict of pandas.DataFrame
- read_pq_to_dict(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs) Mapping[source]
Open a parquet file and return an OrderedDict of numpy.array objects
- Parameters:
filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)
- Returns:
tab – The data
- Return type:
OrderedDict (str : numpy.array)
- read_H5_to_dict(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None) Mapping[source]
Open an h5 file and and return an OrderedDict of numpy.array objects.
- Parameters:
filepath (str) – Path to input file
groupname (str or None) – The name of the group with the data
read_slice (slice or int or None) – Slice of data to read
- Returns:
tab – The data
- Return type:
OrderedDict (str : numpy.array)
Notes
We are using the file suffix ‘h5’ to specify ‘hdf5’ files written from DataFrames using pandas They have a different structure than ‘hdf5’ files written with h5py or astropy.table
- read_HDF5_to_dict(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None) Mapping[source]
Read in h5py hdf5 data, return a dictionary of all of the keys
- Parameters:
filepath (str) – Path to input file
groupname (str or None) – The groupname for the data
read_slice (slice or int or None) – Slice of data to read
- Returns:
tab – The data
- Return type:
OrderedDict (str : numpy.array)
Notes
We are using the file suffix ‘hdf5’ to specify ‘hdf5’ files written with h5py or astropy.table They have a different structure than ‘h5’ files written panda
- read_HDF5_to_table(filepath: str, key: str | None = None, read_slice: slice | int | None = None)[source]
Reads pyarrow.Table objects from an hdf5 file.
- Parameters:
filepath (str) – Path to input file
key (str or None) – The key in the hdf5 file
read_slice (slice or int or None) – Slice of data to read
- Returns:
table – The table
- Return type:
pyarrow.Table
- read_HDF5_to_tables(filepath: str, keys: List[str] | None = None, slice_dict: dict[str, slice | int] | None = None) Mapping[source]
Open an HDF5 file and and return an OrderedDict of pyarrow.Table
- Parameters:
filepath (str) – Path to input file
keys (list or None) – Which tables to read
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
- Returns:
tab – The data
- Return type:
OrderedDict (str : pyarrow.Table)
- read_pq_to_table(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs)[source]
Reads a pyarrow.Table object from an parquet file.
- Parameters:
filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)
- Returns:
table – The table
- Return type:
pyarrow.Table
- read_pq_to_tables(filepath: str, keys: List[str] | None = None, allow_missing_keys: bool = False, columns: List[str] | Mapping | None = None, slice_dict: dict[str, slice | int] | None = None, **kwargs) Mapping[source]
Reads pyarrow.Table objects from a parquet file into an OrderedDict.
- Parameters:
filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing. By default False.
columns (dict of list (str), list (str), or None) –
- Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
if a list, only the columns in the list will be loaded.
None will read all the columns
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)
- Returns:
tables – Keys will be taken from keys
- Return type:
OrderedDict of pyarrow.Table
- try_parse(val) numpy.array | list | dict | str[source]
Tries to parse a string into a numpy array or a JSON object. This function attempts to convert a string representation of a numpy array or a JSON object
- Parameters:
val (str) – The string to parse
- Returns:
val – If the string is a valid numpy array or JSON object, it returns the parsed object. If parsing fails, it returns the original string.
- Return type:
numpy.array or list or dict or str
- read_csv_to_dataframes(filepath: str, keys: List[str] | None = None, allow_missing_keys: bool = False, columns: List[str] | Mapping | None = None, slice_dict: dict[str, slice | int] | None = None, **kwargs) Mapping[source]
Reads pandas.DataFrame objects from a csv file into an OrderedDict.
- Parameters:
filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing. By default False.
columns (dict of list (str), list (str), or None) –
- Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
if a list, only the columns in the list will be loaded.
None will read all the columns
slice_dict (dict[str, slice | int] or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)
- Returns:
tables – Keys will be taken from keys
- Return type:
OrderedDict of pandas.DataFrame
- read_native_error_message(filepath: str, fType: int, fmt: str | None, keys: List[str] | None, allow_missing_keys: bool, **kwargs) str[source]
Generates an error message to be printed out if a file cannot be read in by read_native.
- Parameters:
filepath (str) – Full path of the file to load
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False.) – If False will raise FileNotFoundError if a key is missing from the given file.
**kwargs (additional arguments to pass to the native file reader)
- Returns:
The error message string.
- Return type:
str