API Reference
A simple, intuitive, pandas-based database.
Perfect for handling data such as time series, images, or any Python objects alongside their metadata. This tool encapsulates a pandas DataFrame containing metadata and Python objects. It provides an intuitive data and metadata retrieval syntax through keyword-arguments.
The Database class initializes a database with a pandas DataFrame containing metadata and various data types such as time series or images.
Use Database.add_data_field() to incorporate a new dictionary mapping metadata columns in the DataFrame to arbitrary Python objects.
Utilize Database.__call__() for metadata retrieval, specifying criteria in Python keyword arguments.”
- class datanest.Mapping(df: DataFrame)
Create a dictionary map between any two columns of a dataframe.
- __call__(left_col_name: str, right_col_name: str, row_selector: Callable[[Any, Any], bool] | None = None) dict
Call self as a function.
- exception datanest.ReservedSuffixCollisionWarning
Issued when a DataFrame column name collides with a reserved query suffix.
See
Database.__call__()for the suffix-reservation rule.
- class datanest.Database(data: DataFrame | str | Path, *, on_reserved_suffix: str = 'warn')
Manipulate data (signals/images/arbitrary Python objects) and their metadata stored in a CSV/pandas DataFrame.
A Database instance encapsulates a pandas DataFrame, and is designed for use with dictionary-like storage of data structures, such as time series data. It facilitates working with data (e.g., time series/images, etc.) alongside the metadata stored in a pandas DataFrame. This tool provides an intuitive and flexible way to retrieve relevant data from data structures based on metadata organized in the DataFrame. It was developed to address performance issues when storing arbitrary data objects, including NumPy arrays, in a pandas DataFrame.
- Parameters:
data (Union[pd.DataFrame, str, Path]) –
(str) Path to a CSV or Excel file. It is read as a pandas DataFrame. Make sure openpyxl is installed when working with excel files.
(pd.DataFrame) Pass an already loaded pandas DataFrame.
on_reserved_suffix (str, keyword-only) – How to react when a column name collides with a reserved query suffix (
_lim/_has/_any) — seeDatabase.__call__(). One of"warn"(default — issue aReservedSuffixCollisionWarning),"raise"(raiseValueError),"ignore"(silence the check), or"rename"(auto-rename colliding columns via the mapping_lim→_limits,_has→_contains,_any→_options; raises if a rename target already exists). A collision exists only when both<base>and<base><suffix>are present as columns, in which case kwargs likedb(<base><suffix>=...)are ambiguous between the suffix predicate and literal equality on the suffixed column.
- Variables:
data_fields (list) – Names of data dictionaries added using the
Database.add_data_field()method.data_key_names (dict) – Mapping from the names of the data fields contained in Database.data_fields to the names of columns. Each data field, e.g., heart_rate_data, maps a column from the DataFrame data, e.g., participant_id, to the time series containing heart rate values. data_key_names stores {‘heart_rate_data’: ‘participant_id’}, meaning that heart_rate_data is indexed using participant_id.
Examples
import datanest # Load metadata from CSV file with columns: participant_id (int), age (float), surgery_performed (bool), notes (str) db = datanest.Database(r'C:\data\participant_data.csv') db = datanest.get_example_database() # Add heart rate data to the database, indexed by participant_id db.add_data_field('heart_rate', load_heart_rate_data(), 'participant_id') # Retrieve heart rate time series data for participant 3 db.heart_rate(participant_id=3) # Retrieve heart rate time series data from participants aged 50 to 60 db.heart_rate(age_lim=(50, 60)) # Retrieve heart rate time series data from participants where the notes string contains the word interesting db.heart_rate(notes_has='interesting')
- __call__(*args, **kwargs) DataFrame
Select rows from the metadata in the DataFrame. It provides an intuitive python kwargs (keyword arguments) based syntax.
Keywords in kwargs can be any column name in the underlying DataFrame. Special keywords have the format <column_name>_<suffix>, where the suffix can be either ‘any’, ‘lim’, or ‘has’.
The any suffix is useful to to specify or conditions, for example, participant_id_any=(1,3) retrieves rows whose participant_id matches either 1 or 3
The lim suffix is useful to specify limits, for example, age_lim=(40,60) retrieves rows where age is between 40 and 60, both included.
The has suffix is useful when working with entries that have strings, such as notes_has=’interesting’, which will retrieve all rows where the word interesting is present in the notes entry.
The suffixes
_lim/_has/_anyare reserved: if a column named<base>exists alongside another named<base><suffix>, the kwarg<base><suffix>=vis ambiguous and the suffix branch wins.Databaseflags this at construction time — seeon_reserved_suffixonDatabase. Fix by renaming the column at the source, or by passingon_reserved_suffix='rename'to auto-rename to the readable form (_lim→_limits,_has→_contains,_any→_options).Arguments can be any column name of the underlying DataFrame containing boolean values. For example, passing an argument ‘surgery_performed’ is equivalent to passing a keyword argument surgery_performed=True.
- Returns:
A DataFrame containing relevant rows.
- Return type:
pd.DataFrame
Examples
# Returns row where participant_id is 1 db(participant_id=1) # Returns rows for participants with id 1 and 4 db(participant_id_any=(1, 4)) # Returns rows of participants between ages 40 and 60 db(age_lim=(40,60)) # Returns rows of participants between ages 40 and 60 who have had surgery db(age_lim=(40,60), surgery_performed=True)
- get_df() DataFrame
Get the underlying DataFrame.
- __getitem__(key: Any)
Convenience method to access DataFrame columns. Use the square brackets on a database object as if you would use them on the underlying DataFrame containing the metadata.
- get(data_field_name: str, hdr: ~pandas.DataFrame = None, ret_type=<class 'list'>, isolate_single: bool = False, *args, **kwargs) list | dict | Any
Core method to retrieve data structures stored with the
Database.add_data_field()method. In practice, methods generated by theDatabase.add_data_field()method will use this method. Note that the defaults used by theDatabase.add_data_field()for ret_type is dict and isolate_single is True.- Parameters:
data_field_name (str) – Name of the data field. Example - heart_rate
hdr (pd.DataFrame, optional) – A DataFrame containing the rows of interest. Typically the output of
Database.__call__(). Defaults to None.ret_type (type, optional) – Either list or dict. For example, the former would return a list of heart rate data, and the latter would return a dictionary of {participant_id: participant_heart_rate_data} for the queried entries. Defaults to list.
isolate_single (bool, optional) – If the query results in only one data entry, then return just that data entry. Defaults to False.
- Returns:
When isolate_single is set to True, then return type is Any because any data type can be stored in a data field.
- Return type:
Union[list, dict, Any]
Example
Consider a database of motion capture data where the metadata contains values for cadence and percentage of preferred speed.
hdr = db(cadence=160, speedp=100) db.get('ot', hdr) # OR, use the shorter version db.get('ot', cadence=160, speedp=100)
- add_data_field(name: str, data: dict, data_key_name: str = 'id')
Add a data field to the database.
Example
Consider the following example:
db.add_data_field(name="heart_rate", data=heart_rate_data, data_key_name="participant_id") # retrieve heart rate data from participants between 40 and 60 years of age who have not had surgery. db.heart_rate(age_lim=(40,60), surgery_performed=False)
This method will set db._heart_rate = heart_rate_data, and create a method db.heart_rate which can retrieve specific heart_rate_data entries based on queries related to the metadata stored in the header. See
Database.__call__()to learn more about query construction.- Parameters:
name (str) – Name of the data field (e.g. heart_rate). Should not be present in the database, and it should not be ‘records’
data (dict) – _description_
data_key_name (str, optional) – _description_. Defaults to ‘id’.
- records(hdr: DataFrame = None, *args, **kwargs) list[MutableMapping[Hashable, Any]]
Returns records similar to pandas.DataFrame.to_dict(orient=’records’). It will include all the entries from the fields in Database.data_fields.
- Parameters:
hdr (pd.DataFrame, optional) – A DataFrame containing the rows of interest. Typically the output of
Database.__call__(). Defaults to None.- Returns:
list[MutableMapping[Hashable, Any]]
- class datanest.DatabaseContainer
Build hierarchical relationships between databases to enable flexible data retrieval based on metadata stored at multiple levels. :ivar _db: database (Database) :vartype _db: dict) - database_name (str :ivar _parents: parent_name
Example
dbc = DatabaseContainer() dbc.add(“subject”, subject_db) # add the top level database first dbc.add(“trial”, trial_db, “subject”, lambda trial_id: trial_id[:2]) # last argument is a function that converts trial_id to subject_id
Each database has columns (metadata) and data_fields. Each database is identified by a name <db_name>, and must have a column <db_name>_id
(e.g. a database with name trial must have a column called trial_id)
Within each container, there can only be one top-level database, and this should be added first. Each parent database can have multiple child databases, and each child in turn can be a parent to other databases.
Reserved-suffix collisions (
_lim/_has/_anyshadowing real column names) are detected at theDatabaseboundary; seeDatabase.__call__(). The container does not re-check the post-rename column space — overlapping-column renames inadd()almost never produce new collisions in practice.- property all_db_names: list[str]
List of database names in the container.
- property all_column_names: list[str]
Names of all columns in all databases.
- property all_data_fields: list[str]
Names of data fields in all databases. For example, “heart_rate”, added through Database.add_data_field()
- get_db_name_of_column(column_name: str) str
Get which database a column is in.
- get_db_name_of_data_field(data_field: str) str
Get the database name containing the data field.
- get_heritage(db_name: str) list
Return parent, grandparent, … For example,
“trial” -> [“subject”] “action” -> [“trial”, “subject”]
- add(child_name: str, db: Database, parent_name: str = None, child_to_parent_id: Callable = None) None
Add a database to the container.
- Parameters:
child_name (str) – Name of the database inside the container. Use a singular word, e.g. “trial” instead of “trials”
db (Database) – The database to be added to the container.
parent_name (str) – Name of the parent in the container. Set this to None for the top level database (default).
child_to_parent_id (Callable) – A function that maps a row in the child to that in a parent. For example, if subject_id = (1,1), and trial_id = (1, 1, 4), child_to_parent_id = lambda trial_id: trial_id[:2]
- __call__(*args, **kwargs)
Generalize data search and retrieval across databases created at different levels - e.g. subject, trial, action create a temporary database to execute a search from attributes across multiple levels
See class docstring for examples.
- datanest.get_example_data() MutableMapping[int, Any]
Get example data for adding to a data field.
Example
db = datanest.get_example_database() db.add_data_field('heart_rate', datanest.get_example_data(), 'participant_id') db.heart_rate(age_lim=(40,50), surgery_performed=False)
- Returns:
Fake time and heart rate values encapsulated in a python object
- Return type:
MutableMapping[int, Any]
dill-backed file-cache decorators.
Two decorators that skip recomputation by storing the wrapped function’s return value on disk:
cache_me_if_you_can()— bypass the wrapped function entirely if its cache file already exists.cache_me_if_you_can_incremental()— accumulate results across calls (e.g. building up a per-trial dictionary one trial at a time).
Both vary the cache filename per call via an optional suffix
callable, which receives the wrapped function’s (*args, **kwargs)
and returns a string inserted between the file’s stem and extension.
- datanest.cache.cache_me_if_you_can(cache_fname: str | Path, *, suffix: Callable[[...], str] | None = None, verbose: bool = False)
Decorator: skip recomputation if cache_fname exists on disk.
- Parameters:
cache_fname – Path to the cache file (dill-serialized).
suffix – Optional callable invoked with the wrapped function’s
(*args, **kwargs)at call time; its return value is inserted between the file’s stem and its extension. Use this to vary the cache file by call-time inputs.verbose – If True, print create/load messages.
Example
>>> @cache_me_if_you_can("results.pkl", suffix=lambda *a, **kw: "_" + a[0]) ... def expensive(tag): ... return heavy_computation(tag) >>> expensive("alpha") # computes, writes results_alpha.pkl >>> expensive("alpha") # loads from results_alpha.pkl >>> expensive("beta") # computes, writes results_beta.pkl
- datanest.cache.cache_me_if_you_can_incremental(cache_fname: str | Path, return_name: str, return_default: Any, *, suffix: Callable[[...], str] | None = None, verbose: bool = False, force_save: bool = False)
Decorator: incrementally accumulate results in a dill-cached object.
Save the current state, and avoid repeating computations (e.g. when adding files to a database one after the other and extracting metrics from them). The wrapped function receives the running accumulator via the keyword named in
return_name; it returns the (mutated) accumulator.- Parameters:
cache_fname – Path to the cache file (dill-serialized).
return_name – Name of the keyword argument injected into the wrapped function carrying the running accumulator.
return_default – Initial accumulator value when no cache exists.
suffix – Optional callable invoked with the wrapped function’s
(*args, **kwargs)at call time; its return value is inserted between the file’s stem and its extension.verbose – If True, print create/add messages.
force_save – If True, always rewrite the cache file even when the accumulator dict’s key set is unchanged.
Example
>>> @cache_me_if_you_can_incremental( ... "trials.pkl", return_name="ret", return_default={}) ... def process(trial_id, ret=None): ... if trial_id not in ret: ... ret[trial_id] = expensive_compute(trial_id) ... return ret