Example 6: K-Medoids Clustering¶
K-medoids clustering is a constructive (Workflow Type 2) algorithm that partitions the feature space into $k$ clusters and selects the medoid of each cluster as a representative period. Unlike k-means, which produces synthetic centroids, k-medoids always selects actual data points --- making it a natural fit for representative period selection.
Key properties:
- Internal objective: minimizes within-cluster sum of squares (WCSS)
- Weights: pre-computed as cluster-size fractions ($w_j = n_j / N$)
- No external ObjectiveSet needed: the algorithm has its own built-in objective
- Fast: converges in a few iterations for typical problem sizes
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
Monthly K-Medoids¶
We select 4 representative months from 12 using k-medoids clustering on statistical features. The algorithm partitions the 12 months into 4 clusters and picks the medoid of each.
context = rep.ProblemContext(df_raw=df_raw, slicer=rep.TimeSlicer(unit="month"))
workflow = rep.Workflow(
feature_engineer=rep.StandardStatsFeatureEngineer(),
search_algorithm=rep.KMedoidsSearch(k=4, random_state=42),
)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()
print(f"Selection: {result.selection}")
print(f"WCSS: {result.scores['wcss']:.4f}")
print(f"Weights: { {str(k): round(v, 3) for k, v in result.weights.items()} }")
Selection: (Period('2015-07', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'), Period('2015-06', 'M'))
WCSS: 115.3572
Weights: {'2015-07': 0.167, '2015-01': 0.25, '2015-09': 0.333, '2015-06': 0.25}
/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning: Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.
if 'cluster_info' in result.diagnostics:
print("Cluster membership:")
for info in result.diagnostics['cluster_info']:
print(f" Cluster {info['cluster']}: medoid={info['medoid']}, "
f"size={info['size']}, members={info['members']}")
Cluster membership:
Cluster 0: medoid=2015-07, size=2, members=[Period('2015-05', 'M'), Period('2015-07', 'M')]
Cluster 1: medoid=2015-01, size=3, members=[Period('2015-01', 'M'), Period('2015-11', 'M'), Period('2015-12', 'M')]
Cluster 2: medoid=2015-09, size=4, members=[Period('2015-02', 'M'), Period('2015-03', 'M'), Period('2015-09', 'M'), Period('2015-10', 'M')]
Cluster 3: medoid=2015-06, size=3, members=[Period('2015-04', 'M'), Period('2015-06', 'M'), Period('2015-08', 'M')]
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='K-Medoids: Responsibility Weights (Cluster Fractions)')
fig.show()
feature_ctx = experiment.feature_context
cols = list(feature_ctx.df_features.columns[:2])
fig = diag.FeatureSpaceScatter2D().plot(
feature_ctx.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='K-Medoids: Feature Space (First Two Features)')
fig.show()
slicer = rep.TimeSlicer(unit="month")
selected_idx = slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_sel = df_raw.loc[selected_idx]
fig = diag.DistributionOverlayECDF().plot(df_raw['load'], df_sel['load'])
fig.update_layout(title='K-Medoids: ECDF Overlay (Load)')
fig.show()
Effect of k¶
More clusters mean lower WCSS (tighter clusters), but fewer representatives per cluster means less compression. Let's compare k=3 and k=6.
results_by_k = {}
for k in [3, 4, 6]:
wf = rep.Workflow(
feature_engineer=rep.StandardStatsFeatureEngineer(),
search_algorithm=rep.KMedoidsSearch(k=k, random_state=42),
)
res = rep.RepSetExperiment(context, wf).run()
results_by_k[k] = res
print(f"{'k':>3} {'WCSS':>10} {'Selection'}")
print("-" * 50)
for k, res in results_by_k.items():
print(f"{k:>3} {res.scores['wcss']:>10.4f} {res.selection}")
k WCSS Selection
--------------------------------------------------
3 144.0684 (Period('2015-06', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'))
4 115.3572 (Period('2015-07', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'), Period('2015-06', 'M'))
6 62.8189 (Period('2015-07', 'M'), Period('2015-11', 'M'), Period('2015-03', 'M'), Period('2015-06', 'M'), Period('2015-10', 'M'), Period('2015-12', 'M'))
/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning: Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead. /Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning: Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead. /Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning: Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.
Summary¶
K-medoids clustering is a good default when you want a standard, fast, well-understood clustering-based selection without additional constraints. It works well for monthly or weekly slicing and produces cluster-size-proportional weights automatically.
For contiguous temporal segments, use CTPC instead. For multi-day subsequences, use the Snippet algorithm.