Example 2: Feature Space Exploration¶

This notebook explores how feature engineering and objective choice shape the selection process. We:

Start with statistical features and examine why dimensionality reduction is needed
Use PCA to compress the feature space, guided by variance analysis
Run two experiments with different objectives to see how they influence the result
Compare the two selections side-by-side in feature space

Key concepts introduced:

Manual feature engineering chain (stats then PCA)
PCA variance analysis to choose the number of components
Multi-objective selection with CentroidBalance vs DiversityReward
SelectionComparisonScatterMatrix for comparing multiple selections

In [1]:

Copied!





import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

In [2]:

Copied!





url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"{len(context.get_unique_slices())} candidate monthly slices")
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"{len(context.get_unique_slices())} candidate monthly slices")

12 candidate monthly slices

Statistical features: a first look¶

We begin with StandardStatsFeatureEngineer, which computes summary statistics (mean, std, quantiles, ramp rates, etc.) for each variable and slice. This gives us a multi-dimensional profile for every candidate month.

In [3]:

Copied!





stats_eng = rep.StandardStatsFeatureEngineer()
context_stats = stats_eng.run(context)
print(f"{context_stats.df_features.shape[1]} statistical features for {context_stats.df_features.shape[0]} slices")
context_stats.df_features.head(3)
stats_eng = rep.StandardStatsFeatureEngineer()
context_stats = stats_eng.run(context)
print(f"{context_stats.df_features.shape[1]} statistical features for {context_stats.df_features.shape[0]} slices")
context_stats.df_features.head(3)

38 statistical features for 12 slices

Out[3]:

	mean__load	mean__onwind	mean__offwind	mean__solar	std__load	std__onwind	std__offwind	std__solar	q10__load	q10__onwind	...	ramp_std__load	ramp_std__onwind	ramp_std__offwind	ramp_std__solar	corr__load__onwind	corr__load__offwind	corr__load__solar	corr__onwind__offwind	corr__onwind__solar	corr__offwind__solar
2015-01	0.790427	1.623676	1.227180	-1.400283	0.552545	1.875624	1.933594	-1.648768	0.623443	0.404962	...	0.484976	-0.042419	0.637891	-1.853069	-0.161411	-1.234681	-0.919195	0.790412	0.522001	0.458110
2015-02	1.454759	-0.178153	0.032890	-0.779126	-1.266598	-0.386469	0.387692	-0.710218	1.967376	0.269301	...	-0.110690	-0.511887	0.024703	-0.629509	-0.948116	-1.024592	-0.810737	-1.246019	0.262220	1.687519
2015-03	0.845602	0.381202	0.172626	-0.018479	-1.705158	0.914114	0.375525	0.222239	1.098170	-0.158519	...	-0.465317	0.921492	0.768165	0.394653	0.160418	-0.492397	0.132615	0.368639	-0.497747	-1.376189

3 rows × 38 columns

How do the months compare on average values?¶

A scatter matrix of just the mean features shows how months relate in terms of their average load, onshore wind, offshore wind, and solar generation. This is a useful starting point, but we want fidelity across all statistical dimensions — not just means.

In [4]:

Copied!





mean_cols = [c for c in context_stats.df_features.columns if c.startswith('mean__')]
fig = diag.FeatureSpaceScatterMatrix().plot(context_stats.df_features, dimensions=mean_cols)
fig.update_layout(title='Scatter Matrix: Monthly Means')
fig.show()
mean_cols = [c for c in context_stats.df_features.columns if c.startswith('mean__')]
fig = diag.FeatureSpaceScatterMatrix().plot(context_stats.df_features, dimensions=mean_cols)
fig.update_layout(title='Scatter Matrix: Monthly Means')
fig.show()

The curse of dimensionality¶

With 12 data points and dozens of features, many of which are highly correlated, distance-based comparisons become unreliable. The feature correlation heatmap reveals substantial redundancy:

In [5]:

Copied!

fig = diag.FeatureCorrelationHeatmap().plot(context_stats.df_features, method='pearson')
fig.update_layout(title='Feature Correlation Matrix (Statistical Features)')
fig.show()
fig = diag.FeatureCorrelationHeatmap().plot(context_stats.df_features, method='pearson')
fig.update_layout(title='Feature Correlation Matrix (Statistical Features)')
fig.show()

Dimensionality reduction with PCA¶

PCA projects the correlated features onto orthogonal axes ordered by variance explained. This addresses the curse of dimensionality while retaining the essential structure.

In [6]:

Copied!





pca_full = rep.PCAFeatureEngineer()
context_full_pca = pca_full.run(context_stats)

fig = diag.PCAVarianceExplained(pca_full).plot(show_cumulative=True)
fig.update_layout(title='PCA Variance Explained')
fig.show()
pca_full = rep.PCAFeatureEngineer()
context_full_pca = pca_full.run(context_stats)

fig = diag.PCAVarianceExplained(pca_full).plot(show_cumulative=True)
fig.update_layout(title='PCA Variance Explained')
fig.show()

The cumulative curve shows a clear bend around 4 components — beyond that, each additional PC adds very little. Four PCs capture the essential structure while keeping the feature space compact.

In [7]:

Copied!





fig = diag.FeatureSpaceScatterMatrix().plot(
    context_full_pca.df_features, dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3']
)
fig.update_layout(title='Scatter Matrix: First 4 Principal Components')
fig.show()
fig = diag.FeatureSpaceScatterMatrix().plot(
    context_full_pca.df_features, dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3']
)
fig.update_layout(title='Scatter Matrix: First 4 Principal Components')
fig.show()

Narrowing the feature space¶

Based on the variance analysis, 4 PCs capture the essential structure. We create a dedicated 4-PC feature context for both experiments below.

In [8]:

Copied!

pca_4 = rep.PCAFeatureEngineer(n_components=4)
context_4pc = pca_4.run(context_stats)
context_4pc.df_features
pca_4 = rep.PCAFeatureEngineer(n_components=4)
context_4pc = pca_4.run(context_stats)
context_4pc.df_features

Out[8]:

	pc_0	pc_1	pc_2	pc_3
2015-01	6.184021	0.964440	-0.653615	0.223279
2015-02	1.934236	-3.160402	-1.649667	2.709401
2015-03	1.081686	0.569398	-3.265472	0.251567
2015-04	-4.156230	1.690578	-0.116559	-0.684499
2015-05	-3.367315	3.007706	1.534711	0.538564
2015-06	-4.609243	0.098181	0.590968	-0.152467
2015-07	-2.183370	2.267731	-1.127886	0.872675
2015-08	-5.217145	-1.117851	-0.008774	-0.607658
2015-09	-0.651842	-0.811346	0.338338	-0.044285
2015-10	-1.720184	-4.710915	1.411115	-1.246839
2015-11	6.424006	0.514382	-1.390549	-2.695662
2015-12	6.281379	0.688098	4.337392	0.835923

Experiment A: Balanced selection¶

Our first objective set combines:

Wasserstein fidelity: marginal distribution similarity
Correlation fidelity: preservation of cross-variable dependencies
Centroid balance: penalises selections whose feature centroid deviates from the data centroid

The ParetoMaxMinStrategy picks the Pareto-optimal combination that maximises the worst objective — a balanced, conservative choice.

In [9]:

Copied!





obj_balanced = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'centroid_balance': (0.5, rep.CentroidBalance()),
})

k = 3
search_a = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_balanced,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)
repr_model = rep.KMedoidsClustersizeRepresentation()

workflow_a = rep.Workflow(pca_4, search_a, repr_model)
exp_a = rep.RepSetExperiment(context_4pc, workflow_a)
result_a = exp_a.run()

print(f"Selected months (A): {result_a.selection}")
print(f"Weights (A):         {result_a.weights}")
print(f"Scores (A):          {result_a.scores}")
obj_balanced = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'centroid_balance': (0.5, rep.CentroidBalance()),
})

k = 3
search_a = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_balanced,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)
repr_model = rep.KMedoidsClustersizeRepresentation()

workflow_a = rep.Workflow(pca_4, search_a, repr_model)
exp_a = rep.RepSetExperiment(context_4pc, workflow_a)
result_a = exp_a.run()

print(f"Selected months (A): {result_a.selection}")
print(f"Weights (A):         {result_a.weights}")
print(f"Scores (A):          {result_a.scores}")

Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 201.66it/s]

Selected months (A): (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Weights (A):         {Period('2015-01', 'M'): 0.25, Period('2015-04', 'M'): 0.4166666666666667, Period('2015-09', 'M'): 0.3333333333333333}
Scores (A):          {'wasserstein': 0.20025433806827211, 'correlation': 0.0231217738568774, 'centroid_balance': 0.7982187924943351}

Experiment B: Diversity-focused selection¶

We replace CentroidBalance with DiversityReward, which favours selections that are maximally spread out in feature space (large pairwise distances). This can pull the selection towards extreme months rather than central ones.

In [10]:

Copied!





obj_diverse = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'diversity': (0.5, rep.DiversityReward()),
})

search_b = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_diverse,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)

workflow_b = rep.Workflow(pca_4, search_b, rep.KMedoidsClustersizeRepresentation())
exp_b = rep.RepSetExperiment(context_4pc, workflow_b)
result_b = exp_b.run()

print(f"Selected months (B): {result_b.selection}")
print(f"Weights (B):         {result_b.weights}")
print(f"Scores (B):          {result_b.scores}")
obj_diverse = rep.ObjectiveSet({
    'wasserstein': (0.5, rep.WassersteinFidelity()),
    'correlation': (0.5, rep.CorrelationFidelity()),
    'diversity': (0.5, rep.DiversityReward()),
})

search_b = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
    obj_diverse,
    rep.ParetoMaxMinStrategy(),
    rep.ExhaustiveCombiGen(k=k),
)

workflow_b = rep.Workflow(pca_4, search_b, rep.KMedoidsClustersizeRepresentation())
exp_b = rep.RepSetExperiment(context_4pc, workflow_b)
result_b = exp_b.run()

print(f"Selected months (B): {result_b.selection}")
print(f"Weights (B):         {result_b.weights}")
print(f"Scores (B):          {result_b.scores}")

Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 216.62it/s]

Selected months (B): (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Weights (B):         {Period('2015-07', 'M'): 0.5833333333333334, Period('2015-10', 'M'): 0.16666666666666666, Period('2015-11', 'M'): 0.25}
Scores (B):          {'wasserstein': 0.17951351867468845, 'correlation': 0.04652147452533339, 'diversity': 9.132970566687343}

Comparing the two selections¶

How does the objective shape the result? Let's compare which months each experiment chose.

In [11]:

Copied!





set_a = set(result_a.selection)
set_b = set(result_b.selection)

print(f"Experiment A (Balanced):  {result_a.selection}")
print(f"Experiment B (Diverse):   {result_b.selection}")
print(f"Overlap:                  {set_a & set_b or 'none'}")
print(f"Only in A:                {set_a - set_b or 'none'}")
print(f"Only in B:                {set_b - set_a or 'none'}")
set_a = set(result_a.selection)
set_b = set(result_b.selection)

print(f"Experiment A (Balanced):  {result_a.selection}")
print(f"Experiment B (Diverse):   {result_b.selection}")
print(f"Overlap:                  {set_a & set_b or 'none'}")
print(f"Only in A:                {set_a - set_b or 'none'}")
print(f"Only in B:                {set_b - set_a or 'none'}")

Experiment A (Balanced):  (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Experiment B (Diverse):   (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Overlap:                  none
Only in A:                {Period('2015-09', 'M'), Period('2015-01', 'M'), Period('2015-04', 'M')}
Only in B:                {Period('2015-11', 'M'), Period('2015-07', 'M'), Period('2015-10', 'M')}

Side-by-side in feature space¶

The SelectionComparisonScatterMatrix plots both selections on top of the full feature space. Distinct markers and colours make it easy to spot where the objectives push the selection.

In [12]:

Copied!





fig = diag.SelectionComparisonScatterMatrix().plot(
    context_4pc.df_features,
    selections={
        'A: Balanced': result_a.selection,
        'B: Diverse': result_b.selection,
    },
    dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3'],
)
fig.update_layout(title='Selection Comparison in PCA Feature Space')
fig.show()
fig = diag.SelectionComparisonScatterMatrix().plot(
    context_4pc.df_features,
    selections={
        'A: Balanced': result_a.selection,
        'B: Diverse': result_b.selection,
    },
    dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3'],
)
fig.update_layout(title='Selection Comparison in PCA Feature Space')
fig.show()

Per-experiment diagnostics¶

Responsibility weights¶

The KMedoids representation assigns weights proportional to how many months each representative "covers". The dashed line shows uniform (1/k) for reference.

In [13]:

Copied!

fig = diag.ResponsibilityBars().plot(result_a.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment A (Balanced)')
fig.show()
fig = diag.ResponsibilityBars().plot(result_a.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment A (Balanced)')
fig.show()

In [14]:

Copied!

fig = diag.ResponsibilityBars().plot(result_b.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment B (Diverse)')
fig.show()
fig = diag.ResponsibilityBars().plot(result_b.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment B (Diverse)')
fig.show()

Distribution fidelity (ECDF grid)¶

The ECDF grid shows — for every variable at once — how well the selection's marginal distribution tracks the full year. Gaps indicate value ranges that the selection under- or over-represents.

In [15]:

Copied!





selected_idx_a = context.slicer.get_indices_for_slice_combi(context.df_raw.index, result_a.selection)
df_sel_a = context.df_raw.loc[selected_idx_a]

fig = diag.DistributionOverlayECDFGrid().plot(context.df_raw, df_sel_a)
fig.update_layout(title='Distribution Fidelity — Experiment A (Balanced)')
fig.update_xaxes(matches=None)
fig.update_yaxes(matches=None)
fig.show()
selected_idx_a = context.slicer.get_indices_for_slice_combi(context.df_raw.index, result_a.selection)
df_sel_a = context.df_raw.loc[selected_idx_a]

fig = diag.DistributionOverlayECDFGrid().plot(context.df_raw, df_sel_a)
fig.update_layout(title='Distribution Fidelity — Experiment A (Balanced)')
fig.update_xaxes(matches=None)
fig.update_yaxes(matches=None)
fig.show()

Correlation preservation¶

The heatmap shows the difference between the correlation matrix of the selection and the full year. Values near zero (light) mean the selection preserves that cross-variable relationship well.

In [16]:

Copied!





fig = diag.CorrelationDifferenceHeatmap().plot(
    context.df_raw, df_sel_a, method='pearson', show_lower_only=True
)
fig.update_layout(title='Correlation Difference — Experiment A (Balanced)')
fig.show()
fig = diag.CorrelationDifferenceHeatmap().plot(
    context.df_raw, df_sel_a, method='pearson', show_lower_only=True
)
fig.update_layout(title='Correlation Difference — Experiment A (Balanced)')
fig.show()

Diurnal profiles¶

Average hourly shape (hour 0-23) of each variable, comparing the full year with the selection. Good matches indicate the selection captures typical within-day patterns.

In [17]:

Copied!





fig = diag.DiurnalProfileOverlay().plot(
    context.df_raw, df_sel_a, variables=['load', 'onwind', 'offwind', 'solar']
)
fig.update_layout(title='Diurnal Profiles — Experiment A (Balanced)')
fig.show()
fig = diag.DiurnalProfileOverlay().plot(
    context.df_raw, df_sel_a, variables=['load', 'onwind', 'offwind', 'solar']
)
fig.update_layout(title='Diurnal Profiles — Experiment A (Balanced)')
fig.show()