Example 6: K-Medoids Clustering¶

K-medoids clustering is a constructive (Workflow Type 2) algorithm that partitions the feature space into $k$ clusters and selects the medoid of each cluster as a representative period. Unlike k-means, which produces synthetic centroids, k-medoids always selects actual data points --- making it a natural fit for representative period selection.

Key properties:

Internal objective: minimizes within-cluster sum of squares (WCSS)
Weights: pre-computed as cluster-size fractions ($w_j = n_j / N$)
No external ObjectiveSet needed: the algorithm has its own built-in objective
Fast: converges in a few iterations for typical problem sizes

In [1]:

Copied!





import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'
import pandas as pd
import energy_repset as rep
import energy_repset.diagnostics as diag
import plotly.io as pio; pio.renderers.default = 'notebook_connected'

In [2]:

Copied!

url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)
url = "https://tubcloud.tu-berlin.de/s/pKttFadrbTKSJKF/download/time-series-lecture-2.csv"
df_raw = pd.read_csv(url, index_col=0, parse_dates=True).rename_axis('variable', axis=1)
df_raw = df_raw.drop('prices', axis=1)

Monthly K-Medoids¶

We select 4 representative months from 12 using k-medoids clustering on statistical features. The algorithm partitions the 12 months into 4 clusters and picks the medoid of each.

In [3]:

Copied!





context = rep.ProblemContext(df_raw=df_raw, slicer=rep.TimeSlicer(unit="month"))

workflow = rep.Workflow(
    feature_engineer=rep.StandardStatsFeatureEngineer(),
    search_algorithm=rep.KMedoidsSearch(k=4, random_state=42),
)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()

print(f"Selection: {result.selection}")
print(f"WCSS:      {result.scores['wcss']:.4f}")
print(f"Weights:   { {str(k): round(v, 3) for k, v in result.weights.items()} }")
context = rep.ProblemContext(df_raw=df_raw, slicer=rep.TimeSlicer(unit="month"))

workflow = rep.Workflow(
    feature_engineer=rep.StandardStatsFeatureEngineer(),
    search_algorithm=rep.KMedoidsSearch(k=4, random_state=42),
)
experiment = rep.RepSetExperiment(context, workflow)
result = experiment.run()

print(f"Selection: {result.selection}")
print(f"WCSS:      {result.scores['wcss']:.4f}")
print(f"Weights:   { {str(k): round(v, 3) for k, v in result.weights.items()} }")

Selection: (Period('2015-07', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'), Period('2015-06', 'M'))
WCSS:      115.3572
Weights:   {'2015-07': 0.167, '2015-01': 0.25, '2015-09': 0.333, '2015-06': 0.25}

/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning:

Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.

In [4]:

Copied!





if 'cluster_info' in result.diagnostics:
    print("Cluster membership:")
    for info in result.diagnostics['cluster_info']:
        print(f"  Cluster {info['cluster']}: medoid={info['medoid']}, "
              f"size={info['size']}, members={info['members']}")
if 'cluster_info' in result.diagnostics:
    print("Cluster membership:")
    for info in result.diagnostics['cluster_info']:
        print(f"  Cluster {info['cluster']}: medoid={info['medoid']}, "
              f"size={info['size']}, members={info['members']}")

Cluster membership:
  Cluster 0: medoid=2015-07, size=2, members=[Period('2015-05', 'M'), Period('2015-07', 'M')]
  Cluster 1: medoid=2015-01, size=3, members=[Period('2015-01', 'M'), Period('2015-11', 'M'), Period('2015-12', 'M')]
  Cluster 2: medoid=2015-09, size=4, members=[Period('2015-02', 'M'), Period('2015-03', 'M'), Period('2015-09', 'M'), Period('2015-10', 'M')]
  Cluster 3: medoid=2015-06, size=3, members=[Period('2015-04', 'M'), Period('2015-06', 'M'), Period('2015-08', 'M')]

In [5]:

Copied!

fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='K-Medoids: Responsibility Weights (Cluster Fractions)')
fig.show()
fig = diag.ResponsibilityBars().plot(result.weights, show_uniform_reference=True)
fig.update_layout(title='K-Medoids: Responsibility Weights (Cluster Fractions)')
fig.show()

In [6]:

Copied!





feature_ctx = experiment.feature_context
cols = list(feature_ctx.df_features.columns[:2])

fig = diag.FeatureSpaceScatter2D().plot(
    feature_ctx.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='K-Medoids: Feature Space (First Two Features)')
fig.show()
feature_ctx = experiment.feature_context
cols = list(feature_ctx.df_features.columns[:2])

fig = diag.FeatureSpaceScatter2D().plot(
    feature_ctx.df_features, x=cols[0], y=cols[1], selection=result.selection
)
fig.update_layout(title='K-Medoids: Feature Space (First Two Features)')
fig.show()

In [7]:

Copied!





slicer = rep.TimeSlicer(unit="month")
selected_idx = slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_sel = df_raw.loc[selected_idx]

fig = diag.DistributionOverlayECDF().plot(df_raw['load'], df_sel['load'])
fig.update_layout(title='K-Medoids: ECDF Overlay (Load)')
fig.show()
slicer = rep.TimeSlicer(unit="month")
selected_idx = slicer.get_indices_for_slice_combi(df_raw.index, result.selection)
df_sel = df_raw.loc[selected_idx]

fig = diag.DistributionOverlayECDF().plot(df_raw['load'], df_sel['load'])
fig.update_layout(title='K-Medoids: ECDF Overlay (Load)')
fig.show()

Effect of k¶

More clusters mean lower WCSS (tighter clusters), but fewer representatives per cluster means less compression. Let's compare k=3 and k=6.

In [9]:

Copied!





results_by_k = {}
for k in [3, 4, 6]:
    wf = rep.Workflow(
        feature_engineer=rep.StandardStatsFeatureEngineer(),
        search_algorithm=rep.KMedoidsSearch(k=k, random_state=42),
    )
    res = rep.RepSetExperiment(context, wf).run()
    results_by_k[k] = res

print(f"{'k':>3}  {'WCSS':>10}  {'Selection'}")
print("-" * 50)
for k, res in results_by_k.items():
    print(f"{k:>3}  {res.scores['wcss']:>10.4f}  {res.selection}")
results_by_k = {}
for k in [3, 4, 6]:
    wf = rep.Workflow(
        feature_engineer=rep.StandardStatsFeatureEngineer(),
        search_algorithm=rep.KMedoidsSearch(k=k, random_state=42),
    )
    res = rep.RepSetExperiment(context, wf).run()
    results_by_k[k] = res

print(f"{'k':>3}  {'WCSS':>10}  {'Selection'}")
print("-" * 50)
for k, res in results_by_k.items():
    print(f"{k:>3}  {res.scores['wcss']:>10.4f}  {res.selection}")

  k        WCSS  Selection
--------------------------------------------------
  3    144.0684  (Period('2015-06', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'))
  4    115.3572  (Period('2015-07', 'M'), Period('2015-01', 'M'), Period('2015-09', 'M'), Period('2015-06', 'M'))
  6     62.8189  (Period('2015-07', 'M'), Period('2015-11', 'M'), Period('2015-03', 'M'), Period('2015-06', 'M'), Period('2015-10', 'M'), Period('2015-12', 'M'))

/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning:

Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.

/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning:

Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.

/Users/helgeesch/Documents/repositories/energy-repset/venv/lib/python3.12/site-packages/sklearn/utils/deprecation.py:95: FutureWarning:

Function stable_cumsum is deprecated; `sklearn.utils.extmath.stable_cumsum` is deprecated in version 1.8 and will be removed in 1.10. Use `np.cumulative_sum` with the desired dtype directly instead.

Summary¶

K-medoids clustering is a good default when you want a standard, fast, well-understood clustering-based selection without additional constraints. It works well for monthly or weekly slicing and produces cluster-size-proportional weights automatically.

For contiguous temporal segments, use CTPC instead. For multi-day subsequences, use the Snippet algorithm.