API Reference

A simple, intuitive, pandas-based database.

Perfect for handling data such as time series, images, or any Python objects alongside their metadata. This tool encapsulates a pandas DataFrame containing metadata and Python objects. It provides an intuitive data and metadata retrieval syntax through keyword-arguments.

The Database class initializes a database with a pandas DataFrame containing metadata and various data types such as time series or images. Use Database.add_data_field() to incorporate a new dictionary mapping metadata columns in the DataFrame to arbitrary Python objects. Utilize Database.__call__() for metadata retrieval, specifying criteria in Python keyword arguments.”

class datanest.Mapping(df: DataFrame)

Create a dictionary map between any two columns of a dataframe.

__call__(left_col_name: str, right_col_name: str, row_selector: Callable[[Any, Any], bool] | None = None) → dict: Call self as a function.

exception datanest.ReservedSuffixCollisionWarning

Issued when a DataFrame column name collides with a reserved query suffix.

See Database.__call__() for the suffix-reservation rule.

class datanest.Database(data: DataFrame | str | Path, *, on_reserved_suffix: str = 'warn')

Manipulate data (signals/images/arbitrary Python objects) and their metadata stored in a CSV/pandas DataFrame.

A Database instance encapsulates a pandas DataFrame, and is designed for use with dictionary-like storage of data structures, such as time series data. It facilitates working with data (e.g., time series/images, etc.) alongside the metadata stored in a pandas DataFrame. This tool provides an intuitive and flexible way to retrieve relevant data from data structures based on metadata organized in the DataFrame. It was developed to address performance issues when storing arbitrary data objects, including NumPy arrays, in a pandas DataFrame.

Parameters:

data (Union[pd.DataFrame, str, Path]) –
- (str) Path to a CSV or Excel file. It is read as a pandas DataFrame. Make sure openpyxl is installed when working with excel files.
- (pd.DataFrame) Pass an already loaded pandas DataFrame.
on_reserved_suffix (str, keyword-only) – How to react when a column name collides with a reserved query suffix (_lim / _has / _any) — see Database.__call__(). One of "warn" (default — issue a ReservedSuffixCollisionWarning), "raise" (raise ValueError), "ignore" (silence the check), or "rename" (auto-rename colliding columns via the mapping _lim → _limits, _has → _contains, _any → _options; raises if a rename target already exists). A collision exists only when both <base> and <base><suffix> are present as columns, in which case kwargs like db(<base><suffix>=...) are ambiguous between the suffix predicate and literal equality on the suffixed column.

Variables:

data_fields (list) – Names of data dictionaries added using the Database.add_data_field() method.
data_key_names (dict) – Mapping from the names of the data fields contained in Database.data_fields to the names of columns. Each data field, e.g., heart_rate_data, maps a column from the DataFrame data, e.g., participant_id, to the time series containing heart rate values. data_key_names stores {‘heart_rate_data’: ‘participant_id’}, meaning that heart_rate_data is indexed using participant_id.

Examples

import datanest
# Load metadata from CSV file with columns: participant_id (int), age (float), surgery_performed (bool), notes (str)
db = datanest.Database(r'C:\data\participant_data.csv')
db = datanest.get_example_database()
# Add heart rate data to the database, indexed by participant_id
db.add_data_field('heart_rate', load_heart_rate_data(), 'participant_id')
# Retrieve heart rate time series data for participant 3
db.heart_rate(participant_id=3)
# Retrieve heart rate time series data from participants aged 50 to 60
db.heart_rate(age_lim=(50, 60))
# Retrieve heart rate time series data from participants where the notes string contains the word interesting
db.heart_rate(notes_has='interesting')

__call__(*args, **kwargs) → DataFrame

Select rows from the metadata in the DataFrame. It provides an intuitive python kwargs (keyword arguments) based syntax.

Keywords in kwargs can be any column name in the underlying DataFrame. Special keywords have the format <column_name>_<suffix>, where the suffix can be either ‘any’, ‘lim’, or ‘has’.

The any suffix is useful to to specify or conditions, for example, participant_id_any=(1,3) retrieves rows whose participant_id matches either 1 or 3
The lim suffix is useful to specify limits, for example, age_lim=(40,60) retrieves rows where age is between 40 and 60, both included.
The has suffix is useful when working with entries that have strings, such as notes_has=’interesting’, which will retrieve all rows where the word interesting is present in the notes entry.

The suffixes _lim / _has / _any are reserved: if a column named <base> exists alongside another named <base><suffix>, the kwarg <base><suffix>=v is ambiguous and the suffix branch wins. Database flags this at construction time — see on_reserved_suffix on Database. Fix by renaming the column at the source, or by passing on_reserved_suffix='rename' to auto-rename to the readable form (_lim → _limits, _has → _contains, _any → _options).

Arguments can be any column name of the underlying DataFrame containing boolean values. For example, passing an argument ‘surgery_performed’ is equivalent to passing a keyword argument surgery_performed=True.

Returns:: A DataFrame containing relevant rows.
Return type:: pd.DataFrame

Examples

# Returns row where participant_id is 1
db(participant_id=1)
# Returns rows for participants with id 1 and 4
db(participant_id_any=(1, 4))
# Returns rows of participants between ages 40 and 60
db(age_lim=(40,60))
# Returns rows of participants between ages 40 and 60 who have had surgery
db(age_lim=(40,60), surgery_performed=True)

get_df() → DataFrame: Get the underlying DataFrame.

__getitem__(key: Any): Convenience method to access DataFrame columns. Use the square brackets on a database object as if you would use them on the underlying DataFrame containing the metadata.

get(data_field_name: str, hdr: ~pandas.DataFrame = None, ret_type=<class 'list'>, isolate_single: bool = False, *args, **kwargs) → list | dict | Any

Core method to retrieve data structures stored with the Database.add_data_field() method. In practice, methods generated by the Database.add_data_field() method will use this method. Note that the defaults used by the Database.add_data_field() for ret_type is dict and isolate_single is True.

Parameters:

data_field_name (str) – Name of the data field. Example - heart_rate
hdr (pd.DataFrame, optional) – A DataFrame containing the rows of interest. Typically the output of Database.__call__(). Defaults to None.
ret_type (type, optional) – Either list or dict. For example, the former would return a list of heart rate data, and the latter would return a dictionary of {participant_id: participant_heart_rate_data} for the queried entries. Defaults to list.
isolate_single (bool, optional) – If the query results in only one data entry, then return just that data entry. Defaults to False.

Returns:

When isolate_single is set to True, then return type is Any because any data type can be stored in a data field.

Return type:

Union[list, dict, Any]

Example

Consider a database of motion capture data where the metadata contains values for cadence and percentage of preferred speed.

hdr = db(cadence=160, speedp=100)
db.get('ot', hdr)
# OR, use the shorter version
db.get('ot', cadence=160, speedp=100)

add_data_field(name: str, data: dict, data_key_name: str = 'id')

Add a data field to the database.

Example

Consider the following example:

db.add_data_field(name="heart_rate", data=heart_rate_data, data_key_name="participant_id")
# retrieve heart rate data from participants between 40 and 60 years of age who have not had surgery.
db.heart_rate(age_lim=(40,60), surgery_performed=False)

This method will set db._heart_rate = heart_rate_data, and create a method db.heart_rate which can retrieve specific heart_rate_data entries based on queries related to the metadata stored in the header. See Database.__call__() to learn more about query construction.

Parameters:

name (str) – Name of the data field (e.g. heart_rate). Should not be present in the database, and it should not be ‘records’
data (dict) – _description_
data_key_name (str, optional) – _description_. Defaults to ‘id’.

records(hdr: DataFrame = None, *args, **kwargs) → list[MutableMapping[Hashable, Any]]

Returns records similar to pandas.DataFrame.to_dict(orient=’records’). It will include all the entries from the fields in Database.data_fields.

Parameters:: hdr (pd.DataFrame, optional) – A DataFrame containing the rows of interest. Typically the output of Database.__call__(). Defaults to None.
Returns:: list[MutableMapping[Hashable, Any]]

class datanest.DatabaseContainer

Build hierarchical relationships between databases to enable flexible data retrieval based on metadata stored at multiple levels. :ivar _db: database (Database) :vartype _db: dict) - database_name (str :ivar _parents: parent_name

Example

dbc = DatabaseContainer() dbc.add(“subject”, subject_db) # add the top level database first dbc.add(“trial”, trial_db, “subject”, lambda trial_id: trial_id[:2]) # last argument is a function that converts trial_id to subject_id

Each database has columns (metadata) and data_fields. Each database is identified by a name <db_name>, and must have a column <db_name>_id

(e.g. a database with name trial must have a column called trial_id)

Within each container, there can only be one top-level database, and this should be added first. Each parent database can have multiple child databases, and each child in turn can be a parent to other databases.

Reserved-suffix collisions (_lim / _has / _any shadowing real column names) are detected at the Database boundary; see Database.__call__(). The container does not re-check the post-rename column space — overlapping-column renames in add() almost never produce new collisions in practice.

property all_db_names: list[str]: List of database names in the container.

property all_column_names: list[str]: Names of all columns in all databases.

property all_data_fields: list[str]: Names of data fields in all databases. For example, “heart_rate”, added through Database.add_data_field()

get_db_name_of_column(column_name: str) → str: Get which database a column is in.

get_db_name_of_data_field(data_field: str) → str: Get the database name containing the data field.

get_heritage(db_name: str) → list: Return parent, grandparent, … For example,

“trial” -> [“subject”] “action” -> [“trial”, “subject”]

add(child_name: str, db: Database, parent_name: str = None, child_to_parent_id: Callable = None) → None

Add a database to the container.

Parameters:

child_name (str) – Name of the database inside the container. Use a singular word, e.g. “trial” instead of “trials”
db (Database) – The database to be added to the container.
parent_name (str) – Name of the parent in the container. Set this to None for the top level database (default).
child_to_parent_id (Callable) – A function that maps a row in the child to that in a parent. For example, if subject_id = (1,1), and trial_id = (1, 1, 4), child_to_parent_id = lambda trial_id: trial_id[:2]

__call__(*args, **kwargs)

Generalize data search and retrieval across databases created at different levels - e.g. subject, trial, action create a temporary database to execute a search from attributes across multiple levels

See class docstring for examples.

datanest.get_example_database() → Database

Generate an example database.

Returns:: Database

datanest.get_example_data() → MutableMapping[int, Any]

Get example data for adding to a data field.

Example

db = datanest.get_example_database()
db.add_data_field('heart_rate', datanest.get_example_data(), 'participant_id')
db.heart_rate(age_lim=(40,50), surgery_performed=False)

Returns:: Fake time and heart rate values encapsulated in a python object
Return type:: MutableMapping[int, Any]

dill-backed file-cache decorators.

Two decorators that skip recomputation by storing the wrapped function’s return value on disk:

cache_me_if_you_can() — bypass the wrapped function entirely if its cache file already exists.
cache_me_if_you_can_incremental() — accumulate results across calls (e.g. building up a per-trial dictionary one trial at a time).

Both vary the cache filename per call via an optional suffix callable, which receives the wrapped function’s (*args, **kwargs) and returns a string inserted between the file’s stem and extension.

datanest.cache.cache_me_if_you_can(cache_fname: str | Path, *, suffix: Callable[[...], str] | None = None, verbose: bool = False)

Decorator: skip recomputation if cache_fname exists on disk.

Parameters:

cache_fname – Path to the cache file (dill-serialized).
suffix – Optional callable invoked with the wrapped function’s (*args, **kwargs) at call time; its return value is inserted between the file’s stem and its extension. Use this to vary the cache file by call-time inputs.
verbose – If True, print create/load messages.

Example

>>> @cache_me_if_you_can("results.pkl", suffix=lambda *a, **kw: "_" + a[0])
... def expensive(tag):
...     return heavy_computation(tag)
>>> expensive("alpha")  # computes, writes results_alpha.pkl
>>> expensive("alpha")  # loads from results_alpha.pkl
>>> expensive("beta")   # computes, writes results_beta.pkl

datanest.cache.cache_me_if_you_can_incremental(cache_fname: str | Path, return_name: str, return_default: Any, *, suffix: Callable[[...], str] | None = None, verbose: bool = False, force_save: bool = False)

Decorator: incrementally accumulate results in a dill-cached object.

Save the current state, and avoid repeating computations (e.g. when adding files to a database one after the other and extracting metrics from them). The wrapped function receives the running accumulator via the keyword named in return_name; it returns the (mutated) accumulator.

Parameters:

cache_fname – Path to the cache file (dill-serialized).
return_name – Name of the keyword argument injected into the wrapped function carrying the running accumulator.
return_default – Initial accumulator value when no cache exists.
suffix – Optional callable invoked with the wrapped function’s (*args, **kwargs) at call time; its return value is inserted between the file’s stem and its extension.
verbose – If True, print create/add messages.
force_save – If True, always rewrite the cache file even when the accumulator dict’s key set is unchanged.

Example

>>> @cache_me_if_you_can_incremental(
...     "trials.pkl", return_name="ret", return_default={})
... def process(trial_id, ret=None):
...     if trial_id not in ret:
...         ret[trial_id] = expensive_compute(trial_id)
...     return ret