Example 2: Feature Space Exploration¶
This notebook explores how feature engineering and objective choice shape the selection process. We:
- Start with statistical features and examine why dimensionality reduction is needed
- Use PCA to compress the feature space, guided by variance analysis
- Run two experiments with different objectives to see how they influence the result
- Compare the two selections side-by-side in feature space
Key concepts introduced:
- Manual feature engineering chain (stats then PCA)
- PCA variance analysis to choose the number of components
- Multi-objective selection with
CentroidBalancevsDiversityReward SelectionComparisonScatterMatrixfor comparing multiple selections
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
slicer = rep.TimeSlicer(unit="month")
context = rep.ProblemContext(df_raw=df_raw, slicer=slicer)
print(f"{len(context.get_unique_slices())} candidate monthly slices")
12 candidate monthly slices
Statistical features: a first look¶
We begin with StandardStatsFeatureEngineer, which computes summary statistics (mean, std, quantiles, ramp rates, etc.) for each variable and slice. This gives us a multi-dimensional profile for every candidate month.
stats_eng = rep.StandardStatsFeatureEngineer()
context_stats = stats_eng.run(context)
print(f"{context_stats.df_features.shape[1]} statistical features for {context_stats.df_features.shape[0]} slices")
context_stats.df_features.head(3)
38 statistical features for 12 slices
| mean__load | mean__onwind | mean__offwind | mean__solar | std__load | std__onwind | std__offwind | std__solar | q10__load | q10__onwind | ... | ramp_std__load | ramp_std__onwind | ramp_std__offwind | ramp_std__solar | corr__load__onwind | corr__load__offwind | corr__load__solar | corr__onwind__offwind | corr__onwind__solar | corr__offwind__solar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015-01 | 0.790427 | 1.623676 | 1.227180 | -1.400283 | 0.552545 | 1.875624 | 1.933594 | -1.648768 | 0.623443 | 0.404962 | ... | 0.484976 | -0.042419 | 0.637891 | -1.853069 | -0.161411 | -1.234681 | -0.919195 | 0.790412 | 0.522001 | 0.458110 |
| 2015-02 | 1.454759 | -0.178153 | 0.032890 | -0.779126 | -1.266598 | -0.386469 | 0.387692 | -0.710218 | 1.967376 | 0.269301 | ... | -0.110690 | -0.511887 | 0.024703 | -0.629509 | -0.948116 | -1.024592 | -0.810737 | -1.246019 | 0.262220 | 1.687519 |
| 2015-03 | 0.845602 | 0.381202 | 0.172626 | -0.018479 | -1.705158 | 0.914114 | 0.375525 | 0.222239 | 1.098170 | -0.158519 | ... | -0.465317 | 0.921492 | 0.768165 | 0.394653 | 0.160418 | -0.492397 | 0.132615 | 0.368639 | -0.497747 | -1.376189 |
3 rows × 38 columns
How do the months compare on average values?¶
A scatter matrix of just the mean features shows how months relate in terms of their average load, onshore wind, offshore wind, and solar generation. This is a useful starting point, but we want fidelity across all statistical dimensions — not just means.
mean_cols = [c for c in context_stats.df_features.columns if c.startswith('mean__')]
fig = diag.FeatureSpaceScatterMatrix().plot(context_stats.df_features, dimensions=mean_cols)
fig.update_layout(title='Scatter Matrix: Monthly Means')
fig.show()
The curse of dimensionality¶
With 12 data points and dozens of features, many of which are highly correlated, distance-based comparisons become unreliable. The feature correlation heatmap reveals substantial redundancy:
fig = diag.FeatureCorrelationHeatmap().plot(context_stats.df_features, method='pearson')
fig.update_layout(title='Feature Correlation Matrix (Statistical Features)')
fig.show()
Dimensionality reduction with PCA¶
PCA projects the correlated features onto orthogonal axes ordered by variance explained. This addresses the curse of dimensionality while retaining the essential structure.
pca_full = rep.PCAFeatureEngineer()
context_full_pca = pca_full.run(context_stats)
fig = diag.PCAVarianceExplained(pca_full).plot(show_cumulative=True)
fig.update_layout(title='PCA Variance Explained')
fig.show()
The cumulative curve shows a clear bend around 4 components — beyond that, each additional PC adds very little. Four PCs capture the essential structure while keeping the feature space compact.
fig = diag.FeatureSpaceScatterMatrix().plot(
context_full_pca.df_features, dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3']
)
fig.update_layout(title='Scatter Matrix: First 4 Principal Components')
fig.show()
Narrowing the feature space¶
Based on the variance analysis, 4 PCs capture the essential structure. We create a dedicated 4-PC feature context for both experiments below.
pca_4 = rep.PCAFeatureEngineer(n_components=4)
context_4pc = pca_4.run(context_stats)
context_4pc.df_features
| pc_0 | pc_1 | pc_2 | pc_3 | |
|---|---|---|---|---|
| 2015-01 | 6.184021 | 0.964440 | -0.653615 | 0.223279 |
| 2015-02 | 1.934236 | -3.160402 | -1.649667 | 2.709401 |
| 2015-03 | 1.081686 | 0.569398 | -3.265472 | 0.251567 |
| 2015-04 | -4.156230 | 1.690578 | -0.116559 | -0.684499 |
| 2015-05 | -3.367315 | 3.007706 | 1.534711 | 0.538564 |
| 2015-06 | -4.609243 | 0.098181 | 0.590968 | -0.152467 |
| 2015-07 | -2.183370 | 2.267731 | -1.127886 | 0.872675 |
| 2015-08 | -5.217145 | -1.117851 | -0.008774 | -0.607658 |
| 2015-09 | -0.651842 | -0.811346 | 0.338338 | -0.044285 |
| 2015-10 | -1.720184 | -4.710915 | 1.411115 | -1.246839 |
| 2015-11 | 6.424006 | 0.514382 | -1.390549 | -2.695662 |
| 2015-12 | 6.281379 | 0.688098 | 4.337392 | 0.835923 |
Experiment A: Balanced selection¶
Our first objective set combines:
- Wasserstein fidelity: marginal distribution similarity
- Correlation fidelity: preservation of cross-variable dependencies
- Centroid balance: penalises selections whose feature centroid deviates from the data centroid
The ParetoMaxMinStrategy picks the Pareto-optimal combination that maximises the worst objective — a balanced, conservative choice.
obj_balanced = rep.ObjectiveSet({
'wasserstein': (0.5, rep.WassersteinFidelity()),
'correlation': (0.5, rep.CorrelationFidelity()),
'centroid_balance': (0.5, rep.CentroidBalance()),
})
k = 3
search_a = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
obj_balanced,
rep.ParetoMaxMinStrategy(),
rep.ExhaustiveCombiGen(k=k),
)
repr_model = rep.KMedoidsClustersizeRepresentation()
workflow_a = rep.Workflow(pca_4, search_a, repr_model)
exp_a = rep.RepSetExperiment(context_4pc, workflow_a)
result_a = exp_a.run()
print(f"Selected months (A): {result_a.selection}")
print(f"Weights (A): {result_a.weights}")
print(f"Scores (A): {result_a.scores}")
Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 201.66it/s]
Selected months (A): (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Weights (A): {Period('2015-01', 'M'): 0.25, Period('2015-04', 'M'): 0.4166666666666667, Period('2015-09', 'M'): 0.3333333333333333}
Scores (A): {'wasserstein': 0.20025433806827211, 'correlation': 0.0231217738568774, 'centroid_balance': 0.7982187924943351}
Experiment B: Diversity-focused selection¶
We replace CentroidBalance with DiversityReward, which favours selections that are maximally spread out in feature space (large pairwise distances). This can pull the selection towards extreme months rather than central ones.
obj_diverse = rep.ObjectiveSet({
'wasserstein': (0.5, rep.WassersteinFidelity()),
'correlation': (0.5, rep.CorrelationFidelity()),
'diversity': (0.5, rep.DiversityReward()),
})
search_b = rep.ObjectiveDrivenCombinatorialSearchAlgorithm(
obj_diverse,
rep.ParetoMaxMinStrategy(),
rep.ExhaustiveCombiGen(k=k),
)
workflow_b = rep.Workflow(pca_4, search_b, rep.KMedoidsClustersizeRepresentation())
exp_b = rep.RepSetExperiment(context_4pc, workflow_b)
result_b = exp_b.run()
print(f"Selected months (B): {result_b.selection}")
print(f"Weights (B): {result_b.weights}")
print(f"Scores (B): {result_b.scores}")
Iterating over combinations: 100%|██████████| 220/220 [00:01<00:00, 216.62it/s]
Selected months (B): (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Weights (B): {Period('2015-07', 'M'): 0.5833333333333334, Period('2015-10', 'M'): 0.16666666666666666, Period('2015-11', 'M'): 0.25}
Scores (B): {'wasserstein': 0.17951351867468845, 'correlation': 0.04652147452533339, 'diversity': 9.132970566687343}
Comparing the two selections¶
How does the objective shape the result? Let's compare which months each experiment chose.
set_a = set(result_a.selection)
set_b = set(result_b.selection)
print(f"Experiment A (Balanced): {result_a.selection}")
print(f"Experiment B (Diverse): {result_b.selection}")
print(f"Overlap: {set_a & set_b or 'none'}")
print(f"Only in A: {set_a - set_b or 'none'}")
print(f"Only in B: {set_b - set_a or 'none'}")
Experiment A (Balanced): (Period('2015-01', 'M'), Period('2015-04', 'M'), Period('2015-09', 'M'))
Experiment B (Diverse): (Period('2015-07', 'M'), Period('2015-10', 'M'), Period('2015-11', 'M'))
Overlap: none
Only in A: {Period('2015-09', 'M'), Period('2015-01', 'M'), Period('2015-04', 'M')}
Only in B: {Period('2015-11', 'M'), Period('2015-07', 'M'), Period('2015-10', 'M')}
Side-by-side in feature space¶
The SelectionComparisonScatterMatrix plots both selections on top of the full feature space. Distinct markers and colours make it easy to spot where the objectives push the selection.
fig = diag.SelectionComparisonScatterMatrix().plot(
context_4pc.df_features,
selections={
'A: Balanced': result_a.selection,
'B: Diverse': result_b.selection,
},
dimensions=['pc_0', 'pc_1', 'pc_2', 'pc_3'],
)
fig.update_layout(title='Selection Comparison in PCA Feature Space')
fig.show()
fig = diag.ResponsibilityBars().plot(result_a.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment A (Balanced)')
fig.show()
fig = diag.ResponsibilityBars().plot(result_b.weights, show_uniform_reference=True)
fig.update_layout(title='Responsibility Weights — Experiment B (Diverse)')
fig.show()
Distribution fidelity (ECDF grid)¶
The ECDF grid shows — for every variable at once — how well the selection's marginal distribution tracks the full year. Gaps indicate value ranges that the selection under- or over-represents.
selected_idx_a = context.slicer.get_indices_for_slice_combi(context.df_raw.index, result_a.selection)
df_sel_a = context.df_raw.loc[selected_idx_a]
fig = diag.DistributionOverlayECDFGrid().plot(context.df_raw, df_sel_a)
fig.update_layout(title='Distribution Fidelity — Experiment A (Balanced)')
fig.update_xaxes(matches=None)
fig.update_yaxes(matches=None)
fig.show()
Correlation preservation¶
The heatmap shows the difference between the correlation matrix of the selection and the full year. Values near zero (light) mean the selection preserves that cross-variable relationship well.
fig = diag.CorrelationDifferenceHeatmap().plot(
context.df_raw, df_sel_a, method='pearson', show_lower_only=True
)
fig.update_layout(title='Correlation Difference — Experiment A (Balanced)')
fig.show()
Diurnal profiles¶
Average hourly shape (hour 0-23) of each variable, comparing the full year with the selection. Good matches indicate the selection captures typical within-day patterns.
fig = diag.DiurnalProfileOverlay().plot(
context.df_raw, df_sel_a, variables=['load', 'onwind', 'offwind', 'solar']
)
fig.update_layout(title='Diurnal Profiles — Experiment A (Balanced)')
fig.show()