Example 1: Getting Started¶
A minimal end-to-end workflow that selects 4 representative months from a year of hourly time-series data.
This example walks through the five pillars of the energy-repset framework:
| Pillar | Component | Choice in this example |
|---|---|---|
| F — Feature Space | How periods are compared | Statistical summaries (mean, std, min, max, quantiles, ramps) |
| O — Objective | What "representative" means | Wasserstein distance (marginal distribution fidelity) |
| S — Selection Space | What we pick from | All 4-of-12 monthly combinations (495 candidates) |
| R — Representation | How selected periods stand in for the year | Uniform weights (each month = 1/4 of the year) |
| A — Search Algorithm | How we find the best selection | Exhaustive generate-and-test with weighted-sum policy |
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
Load data¶
One year of hourly time series with four variables: electricity demand (load), onshore wind (onwind), offshore wind (offwind), and solar capacity factors (solar).
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
df_raw
| variable | load | onwind | offwind | solar |
|---|---|---|---|---|
| 2015-01-01 00:00:00 | 41.151 | 0.1566 | 0.7030 | 0.0 |
| 2015-01-01 01:00:00 | 40.135 | 0.1659 | 0.6875 | 0.0 |
| 2015-01-01 02:00:00 | 39.106 | 0.1746 | 0.6535 | 0.0 |
| 2015-01-01 03:00:00 | 38.765 | 0.1745 | 0.6803 | 0.0 |
| 2015-01-01 04:00:00 | 38.941 | 0.1826 | 0.7272 | 0.0 |
| ... | ... | ... | ... | ... |
| 2015-12-31 19:00:00 | 47.719 | 0.1388 | 0.4434 | 0.0 |
| 2015-12-31 20:00:00 | 45.911 | 0.1211 | 0.4023 | 0.0 |
| 2015-12-31 21:00:00 | 45.611 | 0.1082 | 0.4171 | 0.0 |
| 2015-12-31 22:00:00 | 43.762 | 0.1026 | 0.4716 | 0.0 |
| 2015-12-31 23:00:00 | 41.905 | 0.0975 | 0.5239 | 0.0 |
8760 rows × 4 columns
Define the problem context¶
The TimeSlicer divides the year into candidate periods — here, 12 calendar months. The ProblemContext bundles the raw data and slicing logic into a single object that flows through the entire pipeline.
slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"Candidate slices: {context.get_unique_slices()}")
Candidate slices: [Period('2015-01', 'M'), Period('2015-02', 'M'), Period('2015-03', 'M'), Period('2015-04', 'M'), Period('2015-05', 'M'), Period('2015-06', 'M'), Period('2015-07', 'M'), Period('2015-08', 'M'), Period('2015-09', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'), Period('2015-12', 'M')]
Pillar F: Feature engineering¶
Before we can compare months, we need a numerical representation. StandardStatsFeatureEngineer computes a set of statistical summaries (mean, std, min, max, quantiles, ramp rates) per variable per month. This transforms each month into a fixed-length feature vector.
feature_engineer = rep.StandardStatsFeatureEngineer()
Pillar O: Objective¶
We use a single score component: Wasserstein fidelity. It measures how well the marginal distribution of the selected months matches the full year. Lower distance = better match.
With only one objective, the selection policy is straightforward — just pick the combination with the best score.
objective_set = rep.ObjectiveSet({
'wasserstein': (1.0, rep.WassersteinFidelity()),
})
Pillars S + A: Selection space and search¶
ExhaustiveCombiGen enumerates all $\binom{12}{4} = 495$ ways to pick 4 months from 12. Each candidate is scored by the objective, and WeightedSumPolicy (trivial here with one component) picks the winner.
k = 4
combi_gen = rep.ExhaustiveCombiGen(k=k)
policy = rep.WeightedSumPolicy()
search_algorithm = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
objective_set, policy, combi_gen
)
Pillar R: Representation model¶
With uniform weights, each selected month represents exactly 1/4 of the year. This is the simplest model — no cluster assignment, no optimization of weights. It places the full burden on the selection itself being intrinsically representative.
representation_model = rep.UniformRepresentationModel()
Run the workflow¶
workflow = rep.Workflow(feature_engineer, search_algorithm, representation_model)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()
Iterating over combinations: 100%|██████████| 495/495 [00:02<00:00, 215.34it/s]
Inspect results¶
print(f"Selected months: {result.selection}")
print(f"Weights: {result.weights}")
print(f"Wasserstein score: {result.scores['wasserstein']:.4f}")
Selected months: (Period('2015-01', 'M'), Period('2015-02', 'M'), Period('2015-05', 'M'), Period('2015-06', 'M'))
Weights: {Period('2015-01', 'M'): 0.25, Period('2015-02', 'M'): 0.25, Period('2015-05', 'M'): 0.25, Period('2015-06', 'M'): 0.25}
Wasserstein score: 0.0684
Diagnostic: responsibility weights¶
The bar chart below shows the weight assigned to each selected month. With uniform representation, all bars are equal at 0.25. The dashed line indicates the "ideal" uniform reference.
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights (Uniform)')
fig.show()