datanest

src PyPI - Version Documentation Status GitHub license

A simple, intuitive, pandas-based database.

Perfect for handling data such as time series, images, or any Python objects alongside their metadata. This tool encapsulates a pandas DataFrame containing metadata and Python objects. It provides an intuitive data and metadata retrieval syntax through keyword-arguments.


Installation

pip install datanest

Usage

datanest.Database is the core class that wraps a pandas.DataFrame object. Even before adding any data fields using the add_data_field method, the database can already be used to query rows from the encapsulated DataFrame with an intuitive keyword argument syntax.

import datanest

# Load example DataFrame with columns: 
# participant_id (int), age (float), surgery_performed (bool), notes (str)
db = datanest.get_example_database()

# Retrieve all metadata
db()

# Retrieve metadata for participant 3
db(participant_id=3)

# Retrieve metadata for participants aged 50 to 60 who have not had surgery
db(age_lim=(50, 60), surgery_performed=True)

# Retrieve metadata for participants where the notes string contains the word interesting
db(notes_has='interesting')

The add_data_field method can be used to add arbitrary python objects to the database, and we can retrieve relevant data entries using the same keyword argument syntax.

# Add heart rate data to the database, indexed by participant_id
db.add_data_field('heart_rate', datanest.get_example_data(), 'participant_id')

# Retrieve all heart rate time series data
db.heart_rate()

# Retrieve heart rate time series data for participant 3
db.heart_rate(participant_id=3)

# Retrieve heart rate time series for participants aged 50 to 60
db.heart_rate(age_lim=(50, 60))

# Retrieve heart rate time series for participants where the notes string contains the word interesting
db.heart_rate(notes_has='interesting')

Payload types — anything goes

The value side of add_data_field is unconstrained: datanest does not inspect the objects you attach, only the keys that index them. So a payload dict can hold NumPy arrays, images, custom dataclasses, pysampled.Data signals — anything that makes sense for your project.

import numpy as np

db = datanest.get_example_database()
db.add_data_field(
    "eeg",
    {pid: np.random.randn(1000) for pid in db()["participant_id"]},
    "participant_id",
)
db.eeg(age_lim=(50, 60))   # {participant_id: ndarray} for the matching rows

In the lab where datanest originated, time-series payloads are typically pysampled.Data objects, but this is a use convention, not a hard dependency — datanest does not import pysampled and works with whatever payload type you choose.

Hierarchical data: DatabaseContainer

When metadata lives at multiple levels (e.g. subjecttrialaction), wrap a set of Database instances in a DatabaseContainer. Each level is added with a key-derivation function that maps a child id to its parent id. The container provides the same keyword-argument query syntax as Database, resolving the right level automatically.

import datanest

dbc = datanest.DatabaseContainer()
dbc.add("subject", subject_db)
dbc.add("trial", trial_db, "subject", lambda trial_id: trial_id[:2])
dbc.add("action", action_db, "trial", lambda action_id: action_id[:3])

# Query at any level — subject metadata filters trials and actions too
dbc(subject=3)                       # all subject-3 trials/actions
dbc(action_phase='extension')        # subset of action rows
dbc.heart_rate(age_lim=(50, 60))     # data field added at any level

DatabaseContainer uses the same _lim / _has / _any suffix conventions as Database. Add data fields to the child databases directly (trial_db.add_data_field(...)); the container makes them queryable at any level via dbc.<field>(...).

Caching expensive computations

datanest.cache_me_if_you_can and cache_me_if_you_can_incremental are dill-backed file-cache decorators. Use them to skip recomputation when loading or summarizing data fields is expensive. Both accept an optional suffix callable that receives the wrapped function’s (*args, **kwargs) and returns a string inserted between the cache file’s stem and extension — handy for one cache file per input.

from datanest import cache_me_if_you_can, cache_me_if_you_can_incremental

# One cache file per subject (e.g. heart_rate_s03.pkl)
@cache_me_if_you_can("heart_rate.pkl", suffix=lambda subject_id: f"_s{subject_id:02d}")
def load_heart_rate(subject_id):
    return expensive_load(subject_id)            # runs once per subject_id

# Build up a {trial_id: metrics} dict across many calls, persisting on disk
@cache_me_if_you_can_incremental("trial_metrics.pkl", return_name="ret", return_default={})
def summarize_trial(trial_id, ret=None):
    if trial_id not in ret:
        ret[trial_id] = compute_metrics(trial_id)
    return ret

License

datanest is distributed under the terms of the MIT license.

Contact

Praneeth Namburi

Project Link: https://github.com/praneethnamburi/datanest

Acknowledgments

This tool was developed as part of the ImmersionToolbox initiative at the MIT.nano Immersion Lab. Thanks to NCSOFT for supporting this initiative.