Home > Back-end >  Dataframe missing the time label after assigning clusters
Dataframe missing the time label after assigning clusters

Time:03-10

I have the below data sample

date,00:00:00,00:15:00,00:30:00,00:45:00,01:00:00,01:15:00,01:30:00,01:45:00,02:00:00,event
2008-01-01,115.87869701,115.37569504,79.9510802,123.68891355,110.89528693, 112.15190765,110.1277647,76.16662078,100.39338951,A
2008-01-02,104.29757522,89.11652179,91.80890697,109.91423556,112.91809129,114.91459611,117.50170579,111.08030786,81.5893157,B
2008-01-02,81.16506701,97.13170328,89.25478466,93.51884481,107.11447296,120.40638709,116.1653649,79.8861492,111.99530301,C
2008-01-02,121.98507602,105.20973701,84.46996209,96.2210916,107.65437228,121.4604217,120.96638889,117.94695867,94.33309319,D
2008-01-02,82.5839125,104.3308685,98.32658468,101.79562494,86.02883206,90.61788466,109.89027977,107.89093632,101.64082595,E
2008-01-02,100.68446746,89.90700858,115.97450984,112.85364917,100.76204374,87.49141078,81.69930821,79.78106694,99.97354515,F
2008-01-02,98.49917234,112.93161335,85.30015915,120.59233515,102.15602621,84.9536008,116.98786228,107.95753105,112.75693735,G
2008-01-02,76.5186262,111.22137123,102.20065099,88.4490991,84.67584098,86.00205813,95.02734271,114.29076806,102.62969032,H
2008-01-02,93.27785451,122.90242719,123.27263927,102.83454346,87.84973282,95.38098403,88.03719802,108.68335342,97.6581398,I
2008-01-02,119.589143,94.15858259,94.32809506,120.5637488,120.43827996,79.66190052,100.40782173,89.90362719,80.46005726,J

I want to assign clusters to the data and have the final output in the below format

Expected output

time       date    00:00:00    00:15:00    00:30:00    00:45:00    01:00:00    01:15:00    01:30:00    01:45:00    02:00:00  cluster_num
0     2008-01-01   115.878697  115.375695  79.951080   123.688914  110.895287  112.151908  110.127765  76.166621   100.393390          0
1     2008-01-02   97.622322   102.989982  98.326255   105.193686  101.066410  97.876583   105.187030  101.935633  98.115212           1

I have tried the below and the current output does not return 'time' label in the first row

import pandas as pd
import numpy as np
from datetime import datetime
from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
from sklearn import metrics

#read data
df = pd.read_csv('df.csv', index_col=0)
df = df.drop(['event'], axis=1)

#stack the data
df = df.stack()
df.index = pd.to_datetime([' '.join(i) for i in df.index])
df = df.rename_axis('event_timestamp').reset_index(name='value')
df.index = df.event_timestamp
df = df.drop(['event_timestamp'], axis=1)
df.columns = ['value']

#normalize the df
df_norm = (df - df.mean()) / (df.max() - df.min())
df['time'] = df.index.map(lambda t: t.time())
df['date'] = df.index.map(lambda t: t.date())
df_norm['time'] = df_norm.index.map(lambda t: t.time())
df_norm['date'] = df_norm.index.map(lambda t: t.date())

#pivot data
df_daily = pd.pivot_table(df, values='value', index='date', columns='time', aggfunc='mean')
df_daily_norm = pd.pivot_table(df_norm, values='value', index='date', columns='time', aggfunc='mean')

#assign clusters to daily data
df_daily_matrix_norm = np.matrix(df_daily_norm.dropna())
centers, _ = kmeans(df_daily_matrix_norm, 2)
cluster, _ = vq(df_daily_matrix_norm, centers)
clusterdf = pd.DataFrame(cluster, columns=['cluster_num'])
dailyclusters = pd.concat([df_daily.dropna().reset_index(), clusterdf], axis=1)
print(dailyclusters)

Current output

       date        00:00:00    00:15:00    00:30:00    00:45:00    01:00:00    01:15:00    01:30:00    01:45:00    02:00:00  cluster_num
0     2008-01-01   115.878697  115.375695  79.951080   123.688914  110.895287  112.151908  110.127765  76.166621   100.393390          0
1     2008-01-02   97.622322   102.989982  98.326255   105.193686  101.066410  97.876583   105.187030  101.935633  98.115212           1

What do I need to do to get the desired output with the 'time' label.

CodePudding user response:

simply add the name to the index:

dailyclusters.index.name = "time"

CodePudding user response:

Use:

dailyclusters = df_daily.dropna().assign(cluster_num=cluster).reset_index()
print(dailyclusters)

# Output
time        date    00:00:00    00:15:00   00:30:00    00:45:00    01:00:00    01:15:00    01:30:00    01:45:00    02:00:00  cluster_num
0     2008-01-01  115.878697  115.375695  79.951080  123.688914  110.895287  112.151908  110.127765   76.166621  100.393390            1
1     2008-01-02   97.622322  102.989982  98.326255  105.193686  101.066410   97.876583  105.187030  101.935633   98.115212            0
  • Related