tables_io.io_utils.read

IO Read Functions for tables_io

Functions

`read`(filepath[, tType, fmt, keys, allow_missing_keys, ...])	Reads in a given file to either a Table-like format if there is one table within the file,
`read_native`(filepath[, fmt, keys, allow_missing_keys, ...])	Reads in a file to its corresponding default tabular format.
`io_open`(filepath[, fmt])	Returns the file object. This allows you to
`check_columns`(filepath, columns_to_check[, fmt, ...])	Read the file column names from file and ensure that it contains at least
`read_fits_to_ap_tables`(→ Mapping)	Reads astropy.table.Table objects into an OrderedDict TableDict-like object from a FITS file.
`read_fits_to_recarrays`(→ Mapping)	Reads np.recarray objects into an OrderedDict TableDict-like object from a FITS file.
`read_HDF5_to_ap_tables`(→ Mapping)	Reads astropy.table.Table objects into an OrderedDict TableDict-like object from an hdf5 file.
`read_HDF5_group`(filepath[, groupname, read_slice])	Read and return the requested group and file object from an hdf5 file. If no group is provided, returns the h5py.File object twice.
`read_HDF5_group_to_dict`(hg[, start, end])	Reads numpy.array objects from an open hdf5 file object. If given a dataset, returns a numpy.array of that dataset.
`read_HDF5_group_names`(→ List[str])	Read and return the list of group names from one level of an hdf5 file.
`read_HDF5_to_dicts`(→ Mapping)	Reads numpy.array objects into an OrderedDict from an hdf5 file. If a list of keys is given,
`read_HDF5_dataset_to_array`(→ numpy.array)	Reads all or part of a hdf5 dataset into a numpy.array
`read_H5_to_dataframe`(filepath[, key, read_slice])	Reads pandas.DataFrame objects from an 'h5' file (a pandas hdf5 file).
`read_H5_to_dataframes`(→ Mapping)	Open an h5 (pandas hdf5) file and and return an OrderedDict of pandas.DataFrame objects
`read_pq_to_dataframe`(filepath[, columns, read_slice])	Reads a pandas.DataFrame object from a parquet file.
`read_pq_to_dataframes`(→ Mapping)	Reads pandas.DataFrame objects from an parquet file.
`read_pq_to_dict`(→ Mapping)	Open a parquet file and return an OrderedDict of numpy.array objects
`read_H5_to_dict`(→ Mapping)	Open an h5 file and and return an OrderedDict of numpy.array objects.
`read_HDF5_to_dict`(→ Mapping)	Read in h5py hdf5 data, return a dictionary of all of the keys
`read_HDF5_to_table`(filepath[, key, read_slice])	Reads pyarrow.Table objects from an hdf5 file.
`read_HDF5_to_tables`(→ Mapping)	Open an HDF5 file and and return an OrderedDict of pyarrow.Table
`read_pq_to_table`(filepath[, columns, read_slice])	Reads a pyarrow.Table object from an parquet file.
`read_pq_to_tables`(→ Mapping)	Reads pyarrow.Table objects from a parquet file into an OrderedDict.
`try_parse`(→ Union[numpy.array, list, dict, str])	Tries to parse a string into a numpy array or a JSON object.
`read_csv_to_dataframes`(→ Mapping)	Reads pandas.DataFrame objects from a csv file into an OrderedDict.
`read_native_error_message`(→ str)	Generates an error message to be printed out if a file cannot be read in by read_native.

Module Contents

Reads in a given file to either a Table-like format if there is one table within the file, or a TableDict-like format if there are multiple tables or files. Uses read_native() to read the file.

The TableDict-like format is an OrderedDict of Table-like objects. The Table-like objects currently supported are: astropyTable, numpyRecarray, numpyDict (dict of numpy arrays), pandasDataFrame, and pyarrowTable.

If given just the filepath, the function will read any tables in the file to its default Table-like format in memory. If given a specific tabular type, the function will read in the file to the default type and then convert to the requested type.

The keys argument is required when reading in multi-dataset parquet files, to specify which dataset files to read in. Otherwise, the only required argument is the filepath.

Accepted tabular types:

Format string	Format integer
“astropyTable”	0
“numpyDict”	1
“numpyRecarray”	2
“pandasDataFrame”	3
“pyarrowTable”	4
“jsonString”	5

Parameters:

filepath (str) – Full path to the file to load
tType (int, str or None) – Table type, if None the default table type will be used.
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False) – If False will raise FileNotFoundError if a key is missing from the given file.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables
kwargs – Additional arguments to pass to the native file reader

Returns:

data – The data

Return type:

OrderedDict ( str -> Table-like )

Example

For a single Table-like object, we can read it in as follows:

>>> import tables_io
>>> df = tables_io.read('filename.h5')
>>> print(df)
   col1  col2
0     1     3
1     2     4

Notice that it has been automatically read in as the default tabular type for h5 files, a pandasDataFrame.

For a TableDict-like object, we read it in as follows:

>>> table_dict = tables_io.read('filename.hdf5', tType='astropyTable')
>>> table_dict
OrderedDict({'tab_1': <Table length=2>
  x     y
int64 int64
----- -----
    2     1
    4     3, 'tab_2': <Table length=2>
  a     b
int64 int64
----- -----
    5     3
    7     4})

Notice that the resulting OrderedDict has astropyTable objects as the values.

Reads in a file to its corresponding default tabular format.

The format of the file is either given by fmt, or determined based on the suffix of the file path. This determines what tabular format the file is read in as. In all cases, the data from the file is returned as an OrderedDict or TableDict-like object, with str keys and Table-like values. The Table-like values can be astropyTable, numpyRecarray, numpyDict (dict of numpy arrays), pandasDataFrame, and pyarrowTable.

Parameters:

filepath (str) – Full path of the file to load
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False.) – If False will raise FileNotFoundError if a key is missing from the given file.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables
kwargs – Additional arguments to pass to the native file reader

Returns:

data – The data

Return type:

OrderedDict ( str -> Table-like )

Example

Reading in a file that is in NUMPY_HDF5 format:

>>> import tables_io
>>> tab = tables_io.read_native('filename.hdf5')
>>> print(tab)
OrderedDict({'tab_1': OrderedDict({'col_1': array([0., 2.]), 'col_2': array([2., 3.])}),
'tab_2': OrderedDict({'col_a': array([1., 1.]), 'col_b': array([3., 3.])})})

io_open(filepath: str, fmt: str | None = None, **kwargs)[source]

Returns the file object. This allows you to open large files without reading the whole file into memory.

It opens the file object with different packages depending on the file type. It uses astropy to open FITS files (astropy.io.fits.open()), h5py for any HDF5 files (h5py.File()), or pyarrow parquet for any parquet files (pyarrow.parquet.ParquetFile()). You can specify which file type you are supplying via the fmt argument, or it will automatically determine the file type from its suffix.

If the given file is not one of the supported types, it will raise a TypeError.

Parameters:

filepath (str) – The path to the file to load.
fmt (str or None) – The file format, if None it will be taken from the file extension.

Return type:

File object. One of pyarrow.parquet.ParquetFile, h5py.File or astropy.io.fits.HDUList.

Example

For example, to read in a sample fits file:

>>> import tables_io
>>> hdul = tables_io.io_open("./data/test.fits", "fits")
>>> hdul.info()
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       4   ()
  1  DF            1 BinTableHDU     37   10R x 14C   [K, E, E, E, E, E, E, E, E, E, E, E, E, D]

check_columns(filepath: str, columns_to_check: List[str], fmt: str | None = None, parent_groupname: str | None = None, **kwargs)[source]

Read the file column names from file and ensure that it contains at least the columns specified in a provided list. If not, an error will be raised.

For FITS files, columns across all extensions will be checked at one time.
For HDF5 files, only columns within a single level of the specified parent_groupname will be checked.

Note: If more columns are available in the file than specified in the list, the file will still pass the check.

Parameters:

filepath (str) – File name for the file to read. If there’s no suffix, it will be applied based on the object type.
columns_to_check (list) – A list of columns to be compared with the data
fmt (str or None) – The input file format, If None this will use io_open
parent_groupname (str or None) – For hdf5 files, the groupname for the data

Reads astropy.table.Table objects into an OrderedDict TableDict-like object from a FITS file. If a list of keys is given, will read only those tables.

Parameters:

filepath (str) – Path to input file
keys (list or None) – A list of which tables to read, in lower case.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

tables – Keys will be HDU names, values will be tables

Return type:

OrderedDict of astropy.table.Table

Reads np.recarray objects into an OrderedDict TableDict-like object from a FITS file. If a list of keys is given, will read only those tables.

Parameters:

filepath (str) – Path to input file
keys (list or None) – A list of which HDU names to read, in lower case.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

tables – Keys will be HDU names, values will be tables

Return type:

OrderedDict of np.recarray

Reads astropy.table.Table objects into an OrderedDict TableDict-like object from an hdf5 file.

Parameters:

filepath (str) – Path to input file
keys (list or None) – A list of which datasets to read in.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

tables – Keys will be ‘paths’, values will be tables

Return type:

OrderedDict of astropy.table.Table

read_HDF5_group(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None)[source]

Read and return the requested group and file object from an hdf5 file. If no group is provided, returns the h5py.File object twice.

Parameters:

filepath (str) – File in question
groupname (str or None) – The name or path to the desired group.
read_slice (slice or int or None) – Slice of data to read

Returns:

grp (h5py.Group or h5py.File) – The requested group
infp (h5py.File) – The input file (returned so that the user can explicitly close the file)

read_HDF5_group_to_dict(hg, start: int | None = None, end: int | None = None)[source]

Reads numpy.array objects from an open hdf5 file object. If given a dataset, returns a numpy.array of that dataset. If given a group, it will read numpy.array objects into an OrderedDict for all of the keys in that group. If start and end are provided, it will only read in the given slice [start:end] of all the datasets.

Parameters:

hg (hdf5 object) – The hdf5 object to read in, either a dataset or a group.
start (int or None) – Starting row of dataset(s) to read.
end (int or None) – Ending row of dataset(s) to read.

Returns:

tables – Keys will be ‘paths’, values will be arrays in the case of an OrderedDict.

Return type:

OrderedDict of numpy.array or a numpy.array

read_HDF5_group_names(filepath: str, parent_groupname: str | None = None) → List[str][source]

Read and return the list of group names from one level of an hdf5 file.

Parameters:

filepath (str) – File in question
parent_groupname (str or None) – For hdf5 files, the parent groupname. All group names under this will be returned. If None, return the top level group names.

Returns:

names – The names of the groups in the file

Return type:

list of str

Reads numpy.array objects into an OrderedDict from an hdf5 file. If a list of keys is given, will only read those specific datasets.

Parameters:

filepath (str) – Path to input file
keys (list or None) – A list of which tables to read from the file.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

dicts – The data

Return type:

OrderedDict, (str, OrderedDict, (str, numpy.array) )

read_HDF5_dataset_to_array(dataset, start: int | None = None, end: int | None = None) → numpy.array[source]

Reads all or part of a hdf5 dataset into a numpy.array

Parameters:

dataset (h5py.Dataset) – The input dataset
start (int or None) – Starting row
end (int or None) – Ending row

Returns:

out – Something that pandas can handle

Return type:

numpy.array

read_H5_to_dataframe(filepath: str, key: str | None = None, read_slice: slice | int | None = None)[source]

Reads pandas.DataFrame objects from an ‘h5’ file (a pandas hdf5 file).

Parameters:

filepath (str) – Path to input file
key (str or None) – The key in the hdf5 file
read_slice (slice or int or None) – Slice of data to read

Returns:

df – The dataframe

Return type:

pandas.DataFrame

Open an h5 (pandas hdf5) file and and return an OrderedDict of pandas.DataFrame objects

Parameters:

filepath (str) – Path to input file
keys (list or None) – A list of which tables to read.
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

tab – The data

Return type:

OrderedDict (str : pandas.DataFrame)

Notes

We are using the file suffix ‘h5’ to specify ‘hdf5’ files written from DataFrames using pandas They have a different structure than ‘hdf5’ files written with h5py or astropy.table

read_pq_to_dataframe(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs)[source]

Reads a pandas.DataFrame object from a parquet file.

Parameters:

filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)

Returns:

df – The data frame

Return type:

pandas.DataFrame

Reads pandas.DataFrame objects from an parquet file.

Parameters:

filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing
columns (dict of list (str), list (str), or None) –
Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
  for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
- if a list, only the columns in the list will be loaded.
- None will read all the columns
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)

Returns:

tables – Keys will be taken from keys

Return type:

OrderedDict of pandas.DataFrame

read_pq_to_dict(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs) → Mapping[source]

Open a parquet file and return an OrderedDict of numpy.array objects

Parameters:

filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)

Returns:

tab – The data

Return type:

OrderedDict (str : numpy.array)

read_H5_to_dict(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None) → Mapping[source]

Open an h5 file and and return an OrderedDict of numpy.array objects.

Parameters:

filepath (str) – Path to input file
groupname (str or None) – The name of the group with the data
read_slice (slice or int or None) – Slice of data to read

Returns:

tab – The data

Return type:

OrderedDict (str : numpy.array)

Notes

We are using the file suffix ‘h5’ to specify ‘hdf5’ files written from DataFrames using pandas They have a different structure than ‘hdf5’ files written with h5py or astropy.table

read_HDF5_to_dict(filepath: str, groupname: str | None = None, read_slice: slice | int | None = None) → Mapping[source]

Read in h5py hdf5 data, return a dictionary of all of the keys

Parameters:

filepath (str) – Path to input file
groupname (str or None) – The groupname for the data
read_slice (slice or int or None) – Slice of data to read

Returns:

tab – The data

Return type:

OrderedDict (str : numpy.array)

Notes

We are using the file suffix ‘hdf5’ to specify ‘hdf5’ files written with h5py or astropy.table They have a different structure than ‘h5’ files written panda

read_HDF5_to_table(filepath: str, key: str | None = None, read_slice: slice | int | None = None)[source]

Reads pyarrow.Table objects from an hdf5 file.

Parameters:

filepath (str) – Path to input file
key (str or None) – The key in the hdf5 file
read_slice (slice or int or None) – Slice of data to read

Returns:

table – The table

Return type:

pyarrow.Table

Open an HDF5 file and and return an OrderedDict of pyarrow.Table

Parameters:

filepath (str) – Path to input file
keys (list or None) – Which tables to read
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables

Returns:

tab – The data

Return type:

OrderedDict (str : pyarrow.Table)

read_pq_to_table(filepath: str, columns: List[str] | None = None, read_slice: slice | int | None = None, **kwargs)[source]

Reads a pyarrow.Table object from an parquet file.

Parameters:

filepath (str) – Path to input file
columns (list (str) or None) – Names of the columns to read, None will read all the columns
read_slice (slice or int or None) – Slice of data to read
**kwargs (additional arguments to pass to the native file reader)

Returns:

table – The table

Return type:

pyarrow.Table

Reads pyarrow.Table objects from a parquet file into an OrderedDict.

Parameters:

filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing. By default False.
columns (dict of list (str), list (str), or None) –
Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
  for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
- if a list, only the columns in the list will be loaded.
- None will read all the columns
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)

Returns:

tables – Keys will be taken from keys

Return type:

OrderedDict of pyarrow.Table

try_parse(val) → numpy.array | list | dict | str[source]

Tries to parse a string into a numpy array or a JSON object. This function attempts to convert a string representation of a numpy array or a JSON object

Parameters:: val (str) – The string to parse
Returns:: val – If the string is a valid numpy array or JSON object, it returns the parsed object. If parsing fails, it returns the original string.
Return type:: numpy.array or list or dict or str

Reads pandas.DataFrame objects from a csv file into an OrderedDict.

Parameters:

filepath (str) – Path to input file
keys (list) – Keys for the input objects. Used to complete filepaths
allow_missing_keys (bool) – If False will raise FileNotFoundError if a key is missing. By default False.
columns (dict of list (str), list (str), or None) –
Names of the columns to read.
- if a dictionary, keys are the keys, and values are a list of string column names.
  for each keyed table, only the columns in the value list will be loaded. if the key is not found, all columns will be loaded.
- if a list, only the columns in the list will be loaded.
- None will read all the columns
slice_dict (dict[str, slice | int] or slice or int or None) – If provided, specfies which slices to read from which tables
**kwargs (additional arguments to pass to the native file reader)

Returns:

tables – Keys will be taken from keys

Return type:

OrderedDict of pandas.DataFrame

read_native_error_message(filepath: str, fType: int, fmt: str | None, keys: List[str] | None, allow_missing_keys: bool, **kwargs) → str[source]

Generates an error message to be printed out if a file cannot be read in by read_native.

Parameters:

filepath (str) – Full path of the file to load
fmt (str or None) – File format, if None it will be taken from the file extension.
keys (list or None) – This argument is required for reading multiple associated parquet files. The keys should be the unique identifiers for each dataset or file.
allow_missing_keys (bool, by default False.) – If False will raise FileNotFoundError if a key is missing from the given file.
**kwargs (additional arguments to pass to the native file reader)

Returns:

The error message string.

Return type:

str