Dataset¶
Overview¶
Dataset represents a single raw data source and related transformation
logic used to produce model attributes.
Use Cases¶
Encapsulate loading and preprocessing of one source dataset.
Expose attribute-ready outputs to elements and dataset collections.
Examples¶
The code below shows an example of how to implement a subclass of the
Dataset abstract class. Please read the docstrings carefully as they
contain detailed information on required methods and syntax.
import numbers
from pathlib import Path
import pandas as pd
from zen_creator.elements.element import Element
from zen_creator.utils.attribute import Attribute
from .dataset import Dataset
from .metadata import MetaData, SourceInformation
class TemplateDataset(Dataset[pd.DataFrame]):
"""
Template class for datasets. This template is designed as a starting point
for users wishing to implement a new dataset. Please read the
docstrings and comments carefully for notes on how to use the template.
All datasets must inherit from the Dataset class and implement the required abstract
methods. These methods are called during the construction of the dataset object to
set the metadata, path, and data properties of the dataset. Each of
these methods is marked with a `TODO` comment to indicate
that they must be implemented. You can search for `TODO` in this file to quickly
find all the places where you need to make changes.
The Dataset class takes a generic type parameter which specifies the
return type of the data property. Please set this to the appropriate type
for your dataset (e.g., pd.DataFrame, Dict[str, pd.DataFrame], etc.) and
adjust the return type of _get_data() accordingly. In this template, we have set
it to pd.DataFrame for demonstration purposes.
All Datasets are singleton objects, meaning that they only get constructed once
regardless of how many times they are instantiated. This is because datasets can be
large and expensive to load, so we want to avoid loading them multiple times.
The first time a dataset is instantiated, it will be constructed and loaded
as normal. The second time it is instantiated, the existing instance will
be returned instead of constructing a new one. This means that the constructor
and the methods called during construction (i.e., the methods marked
with `TODO` comments) will only be called once, even if the dataset is
instantiated multiple times. This also means that the raw data only gets loaded
once, and subsequent instantiations of the dataset will use the already loaded data.
"""
name = "template_dataset"
def __init__(self, source_path: Path | str | None = None):
super().__init__(source_path=source_path)
def _set_metadata(self) -> MetaData:
"""
Return citation metadata for the dataset.
This method is used to set the self.metadata property when the
dataset is constructed.
`TODO`: This method must be implemented. It should return a MetaData
object containing citation information for the dataset.
"""
return MetaData(
name=self.name,
title=(
"Technology lifetimes and availability data for energy "
"system modeling"
),
author=["Reliability and Risk Engineering Lab"],
publication="Journal of Reliability and Risk Engineering",
publication_year=2026,
url="https://example.com/dataset.csv",
)
def _set_path(self) -> Path | None:
"""
Return the path to the dataset file.
This method is used to set the self.path property when the dataset is
constructed.
`TODO`: This method must be implemented. It should return a Path object
pointing to the location of the dataset. It should use the self.source_path
argument passed to the constructor to determine the location of the raw
dataset files.
"""
return Path(".")
def _set_data(self) -> pd.DataFrame:
"""
Load the dataset from self.path.
This should be implemented to load the dataset from self.path and return
it as a pandas DataFrame or a dictonary of pandas DataFrames. The exact
implementation will depend on the format of the dataset (e.g., CSV, Excel,
etc.) and the structure of the data. Any preprocessing steps (e.g.,
handling missing values, renaming columns, etc.) should also be
included in this method.
The method is used to set the self.data property when the dataset is
constructed. It therefore cannot take any inpyut arguments, but can
access self.path and any other properties of the dataset.
'TODO': This method must be implemented.
"""
# can access self.path to load the dataset,
# but here we will just return a dummy dataset for demonstration purposes
data = pd.DataFrame(
{"max_load": [100, 150, 200, 250], "availability_import": [1, 2, 3, 4]},
index=[
"template_conversion_technology",
"template_storage_technology",
"template_transport_technology",
"template_retrofitting_technology",
],
)
return data
# -------- methods ------------------------
def get_max_load(self, element: Element, **kwargs) -> Attribute:
"""
Function for creating max_load attribute.
Functions for other attributes should follow the same naming
convention i.e. get_<attribute_name>.
This function uses information from self.data and returns an object
of class Attribute. Any internal functions which are called by this
function should begin with an underscore to clearly mark them as
internal.
Additional keyword arguments can be added to the function signature if needed.
These can be helpful if, for example, the dataset has multiple configurations
and/or settings which control the result. In this case, the relevant settings
can be passed as keyword arguments to the function.
"""
default_value = self.data.at[element.name, "max_load"]
if not isinstance(default_value, numbers.Real):
raise ValueError(
"Expected numeric value for max_load, got type "
f"{type(default_value).__name__}"
)
attr = Attribute("max_load", element)
attr.set_data(
default_value=float(default_value),
unit=self._max_load_unit(),
source=SourceInformation(
description="Description of how max_load was determined.",
metadata=self.metadata,
),
)
return attr
def _max_load_unit(self):
"""
Helper function for creating the 'max_load' attribute.
All helper functions should begin with an underscore to clearly mark them as
internal.
"""
return "MW"
Summary
Initialize a Dataset instance. |
Constructors
- Dataset.__init__(source_path: str | Path | None)¶
Initialize a Dataset instance.
- Parameters:
source_path (str | Path | None) – Path to the source data directory.
Member Reference
- class zen_creator.Dataset(*args: Any, **kwargs: Any)
Bases:
ABC,Generic[T]Abstract base class for datasets.
Subclasses must implement internal abstract hooks to provide metadata, path, and data.
- property data: T
The dataset data.
- Returns:
The dataset as a DataFrame or dict of DataFrames.
- Return type:
T
- property metadata: MetaData
Citation metadata for the dataset.
- Returns:
Citation metadata object.
- Return type:
MetaData
- name: str
- property path: Path
The file path to the dataset.
- Returns:
The path to the dataset file.
- Return type:
Path
- Raises:
ValueError – If the path has not been set or does not exist.