DatasetCollection¶
Overview¶
DatasetCollection coordinates multiple datasets and combines them into
outputs used by model elements.
Use Cases¶
Combine multiple data sources into one attribute pipeline.
Centralize dataset orchestration for reproducible transformations.
Examples¶
The code below shows an example of how to implement a subclass of the
DatasetCollection abstract class. Please read the docstrings carefully as
they contain detailed information on required methods and syntax.
from __future__ import annotations
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from pathlib import Path
from zen_creator.datasets.datasets.dataset import Dataset
from zen_creator.elements.element import Element
from zen_creator.datasets.dataset_collections.dataset_collection import (
DatasetCollection,
)
from zen_creator.datasets.datasets import TemplateDataset
from zen_creator.utils.attribute import Attribute
class TemplateDatasetCollection(DatasetCollection):
"""
Template class for dataset collections. This template is designed as a
starting point for users wishing to implement a new dataset collection.
Please read the docstrings and comments carefully for notes on how to use
the template.
A dataset collection groups multiple datasets and exposes methods that return
Attribute objects for elements. Even when multiple datasets are available,
data processing should primarily happen in dataset classes and this class
should be used as a readable map to those datasets.
All dataset collections must inherit from the DatasetCollection class and
implement the required abstract methods. In this template, only one dataset
(`TemplateDataset`) is used for demonstration purposes.
All methods and properties that need to be implemented are marked with a
`TODO` comment. You can search for `TODO` in this file to quickly find all
places where you need to make changes.
"""
name = "template_dataset_collection"
def __init__(self, source_path: Path | str | None = None):
super().__init__(source_path=source_path)
def _get_data(self) -> dict[str, Dataset]:
"""
Return all datasets belonging to this collection.
This method is used to set the self.data property when the dataset
collection is constructed.
`TODO`: This method must be implemented. It should return a dictionary
where keys are dataset names and values are Dataset objects.
"""
return {
"template_dataset": TemplateDataset(self.source_path),
# Add more datasets here if needed
# "dataset2_name": Dataset2(),
# "dataset3_name": Dataset3(),
}
def get_max_load(self, element: Element, **kwargs) -> Attribute:
"""
Function for creating max_load attribute.
Functions for other attributes should follow the same naming
convention i.e. get_<attribute_name>.
This function uses information from self.data and returns an object
of class Attribute. Any internal functions which are called by this
function should begin with an underscore to clearly mark them as
internal.
Additional keyword arguments can be added to the function signature if
needed. These can be helpful if the dataset collection has multiple
configurations and/or settings which control the result.
"""
dataset = self.data["template_dataset"]
# validate to ensure dataset is correct type
if not isinstance(dataset, TemplateDataset):
raise TypeError(
"Expected 'template_dataset' entry to be a TemplateDataset, got "
f"{type(dataset).__name__}."
)
attr = dataset.get_max_load(element=element, **kwargs)
return attr
def _max_load_unit(self):
"""
Helper function for creating the 'max_load' attribute.
All helper functions should begin with an underscore to clearly mark
them as internal.
"""
return "MW"
Summary
Initialize a DatasetCollection instance. |
Constructors
- DatasetCollection.__init__(source_path: Path | str | None = None)¶
Initialize a DatasetCollection instance.
- Parameters:
source_path (Path | str | None) – Path to the source data directory.
Member Reference
- class zen_creator.DatasetCollection(*args: Any, **kwargs: Any)
Bases:
ABCCombined dataset for various data.
- property data: Dict[str, Dataset]
Dictionary of datasets.
Each key is the dataset name and each value is the dataset object.
- property metadata: Dict[str, MetaData]
Metadata for all datasets in the collection.
- Returns:
Dictionary mapping dataset names to metadata.
- Return type:
Dict[str, MetaData]
- name: str