Example 3: Hierarchical Seasonal Selection¶

This notebook demonstrates hierarchical candidate generation — features are computed at daily resolution, but the selection operates at the monthly level with seasonal constraints.

Why hierarchical? Evaluating months by their daily composition gives a finer-grained quality signal than month-level statistics alone. And enforcing "one month per season" guarantees seasonal coverage, which pure optimization might sacrifice for aggregate fidelity.

Key concepts:

GroupQuotaHierarchicalCombiGen: constrained candidate generation with seasonal quotas
Pareto front visualization: understanding trade-offs between objectives

In [1]:

Copied!





import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

In [2]:

Copied!

url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

Problem context with daily slicing¶

We slice at the day level (365 candidate periods). Features are computed per day, which gives the objective functions much more granular data to work with compared to month-level features.

In [3]:

Copied!

child_slicer = rep.TimeSlicer(unit="day")
context = rep.ProblemContext(df_raw=df_raw, slicer=child_slicer)
print(f"{len(context.get_unique_slices())} daily slices")
child_slicer = rep.TimeSlicer(unit="day")
context = rep.ProblemContext(df_raw=df_raw, slicer=child_slicer)
print(f"{len(context.get_unique_slices())} daily slices")

365 daily slices

In [4]:

Copied!

feature_engineer = rep.StandardStatsFeatureEngineer()
context = feature_engineer.run(context)
print(f"Features computed for {len(context.df_features)} daily periods")
feature_engineer = rep.StandardStatsFeatureEngineer()
context = feature_engineer.run(context)
print(f"Features computed for {len(context.df_features)} daily periods")

Features computed for 365 daily periods

Objectives: Wasserstein + Correlation¶

Two complementary fidelity metrics:

Wasserstein: are the value distributions of each variable preserved?
Correlation: are the dependencies between variables preserved?

The ParetoMaxMinStrategy picks the combination that is Pareto-optimal and maximizes the worst-performing objective — a robust, balanced choice.

In [5]:

Copied!





objective_set = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
})
policy = rep.ParetoMaxMinStrategy()
objective_set = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
})
policy = rep.ParetoMaxMinStrategy()

Hierarchical combination generator¶

This is where the magic happens. GroupQuotaHierarchicalCombiGen does two things:

Seasonal quotas: enforces exactly 1 month per season (winter, spring, summer, fall) — so the 4 selected months are structurally diverse
Hierarchical evaluation: each candidate "month" is expanded to its constituent days for scoring

With 3 months per season and 1 pick each, we get $3^4 = 81$ candidate combinations — far fewer than the unconstrained $\binom{12}{4} = 495$.

In [6]:

Copied!





combi_gen = rep.GroupQuotaHierarchicalCombiGen.from_slicers_with_seasons(
    parent_k=4,
    dt_index=df_raw.index,
    child_slicer=child_slicer,
    group_quota={'winter': 1, 'spring': 1, 'summer': 1, 'fall': 1}
)

days = context.get_unique_slices()
print(f"{combi_gen.count(days)} candidate combinations")
print("Each = 4 months (1 per season), evaluated on ~120 days total")
combi_gen = rep.GroupQuotaHierarchicalCombiGen.from_slicers_with_seasons(
    parent_k=4,
    dt_index=df_raw.index,
    child_slicer=child_slicer,
    group_quota={'winter': 1, 'spring': 1, 'summer': 1, 'fall': 1}
)

days = context.get_unique_slices()
print(f"{combi_gen.count(days)} candidate combinations")
print("Each = 4 months (1 per season), evaluated on ~120 days total")

81 candidate combinations
Each = 4 months (1 per season), evaluated on ~120 days total

Run the workflow¶

In [7]:

Copied!





search_algorithm = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(objective_set, policy, combi_gen)
representation_model = rep.KMedoidsClustersizeRepresentation()

workflow = rep.Workflow(feature_engineer, search_algorithm, representation_model)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()
search_algorithm = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(objective_set, policy, combi_gen)
representation_model = rep.KMedoidsClustersizeRepresentation()

workflow = rep.Workflow(feature_engineer, search_algorithm, representation_model)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()

Iterating over combinations: 100%|██████████| 81/81 [00:00<00:00, 204.13it/s]

In [8]:

Copied!





# Identify which months were selected
selected_months = sorted({day.asfreq('M') for day in result.selection})
print(f"Selected months: {selected_months}")
print(f"Total days in selection: {len(result.selection)}")
print(f"Scores: {result.scores}")
# Identify which months were selected
selected_months = sorted({day.asfreq('M') for day in result.selection})
print(f"Selected months: {selected_months}")
print(f"Total days in selection: {len(result.selection)}")
print(f"Scores: {result.scores}")

Selected months: [Period('2015-01', 'M'), Period('2015-03', 'M'), Period('2015-08', 'M'), Period('2015-09', 'M')]
Total days in selection: 123
Scores: {'wasserstein': 0.14336076099229295, 'correlation': 0.04274265308227279}

Pareto front analysis¶

The scatter plot shows all 81 evaluated combinations in objective space. The Pareto front (highlighted) contains the non-dominated solutions — no other combination is better on both objectives simultaneously. The selected combination is marked.

In [9]:

Copied!





fig = diag.ParetoScatter2D(
    objective_x='wasserstein', objective_y='correlation'
).plot(search_algorithm=search_algorithm, selected_combination=result.selection)
fig.update_layout(title='Pareto Front: Wasserstein vs Correlation')
fig.show()
fig = diag.ParetoScatter2D(
    objective_x='wasserstein', objective_y='correlation'
).plot(search_algorithm=search_algorithm, selected_combination=result.selection)
fig.update_layout(title='Pareto Front: Wasserstein vs Correlation')
fig.show()

In [10]:

Copied!

fig = diag.ParetoParallelCoordinates().plot(search_algorithm=search_algorithm)
fig.update_layout(title='Pareto Front: Parallel Coordinates')
fig.show()
fig = diag.ParetoParallelCoordinates().plot(search_algorithm=search_algorithm)
fig.update_layout(title='Pareto Front: Parallel Coordinates')
fig.show()

Score contributions and weights¶

In [11]:

Copied!

fig = diag.ScoreContributionBars().plot(result.scores, normalize=True)
fig.update_layout(title='Score Component Contributions (Normalized)')
fig.show()
fig = diag.ScoreContributionBars().plot(result.scores, normalize=True)
fig.update_layout(title='Score Component Contributions (Normalized)')
fig.show()

In [12]:

Copied!

fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights')
fig.show()
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights')
fig.show()

Distribution fidelity per variable¶

ECDF overlays for each variable show how well the selection reproduces the full-year distributions.

In [13]:

Copied!





selected_indices = child_slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_selection = df_raw.loc[selected_indices]

for var in df_raw.columns:
    fig = diag.DistributionOverlayECDF().plot(df_raw[var], df_selection[var])
    fig.update_layout(title=f'ECDF Overlay: {var}')
    fig.show()
selected_indices = child_slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_selection = df_raw.loc[selected_indices]

for var in df_raw.columns:
    fig = diag.DistributionOverlayECDF().plot(df_raw[var], df_selection[var])
    fig.update_layout(title=f'ECDF Overlay: {var}')
    fig.show()

Feature space with selection¶

In [14]:

Copied!





cols = list(context.df_features.columns[:2])
fig = diag.FeatureSpaceScatter2D().plot(
    context.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='Feature Space with Selection')
fig.show()
cols = list(context.df_features.columns[:2])
fig = diag.FeatureSpaceScatter2D().plot(
    context.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='Feature Space with Selection')
fig.show()