I have the below data sample
date,00:00:00,00:15:00,00:30:00,00:45:00,01:00:00,01:15:00,01:30:00,01:45:00,02:00:00,event
2008-01-01,115.87869701,115.37569504,79.9510802,123.68891355,110.89528693, 112.15190765,110.1277647,76.16662078,100.39338951,A
2008-01-02,104.29757522,89.11652179,91.80890697,109.91423556,112.91809129,114.91459611,117.50170579,111.08030786,81.5893157,B
2008-01-02,81.16506701,97.13170328,89.25478466,93.51884481,107.11447296,120.40638709,116.1653649,79.8861492,111.99530301,C
2008-01-02,121.98507602,105.20973701,84.46996209,96.2210916,107.65437228,121.4604217,120.96638889,117.94695867,94.33309319,D
2008-01-02,82.5839125,104.3308685,98.32658468,101.79562494,86.02883206,90.61788466,109.89027977,107.89093632,101.64082595,E
2008-01-02,100.68446746,89.90700858,115.97450984,112.85364917,100.76204374,87.49141078,81.69930821,79.78106694,99.97354515,F
2008-01-02,98.49917234,112.93161335,85.30015915,120.59233515,102.15602621,84.9536008,116.98786228,107.95753105,112.75693735,G
2008-01-02,76.5186262,111.22137123,102.20065099,88.4490991,84.67584098,86.00205813,95.02734271,114.29076806,102.62969032,H
2008-01-02,93.27785451,122.90242719,123.27263927,102.83454346,87.84973282,95.38098403,88.03719802,108.68335342,97.6581398,I
2008-01-02,119.589143,94.15858259,94.32809506,120.5637488,120.43827996,79.66190052,100.40782173,89.90362719,80.46005726,J
I want to assign clusters to the data and have the final output in the below format
Expected output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
I have tried the below and the current output does not return 'time' label in the first row
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
from sklearn import metrics
#read data
df = pd.read_csv('df.csv', index_col=0)
df = df.drop(['event'], axis=1)
#stack the data
df = df.stack()
df.index = pd.to_datetime([' '.join(i) for i in df.index])
df = df.rename_axis('event_timestamp').reset_index(name='value')
df.index = df.event_timestamp
df = df.drop(['event_timestamp'], axis=1)
df.columns = ['value']
#normalize the df
df_norm = (df - df.mean()) / (df.max() - df.min())
df['time'] = df.index.map(lambda t: t.time())
df['date'] = df.index.map(lambda t: t.date())
df_norm['time'] = df_norm.index.map(lambda t: t.time())
df_norm['date'] = df_norm.index.map(lambda t: t.date())
#pivot data
df_daily = pd.pivot_table(df, values='value', index='date', columns='time', aggfunc='mean')
df_daily_norm = pd.pivot_table(df_norm, values='value', index='date', columns='time', aggfunc='mean')
#assign clusters to daily data
df_daily_matrix_norm = np.matrix(df_daily_norm.dropna())
centers, _ = kmeans(df_daily_matrix_norm, 2)
cluster, _ = vq(df_daily_matrix_norm, centers)
clusterdf = pd.DataFrame(cluster, columns=['cluster_num'])
dailyclusters = pd.concat([df_daily.dropna().reset_index(), clusterdf], axis=1)
print(dailyclusters)
Current output
date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
What do I need to do to get the desired output with the 'time' label.
CodePudding user response:
simply add the name to the index:
dailyclusters.index.name = "time"
CodePudding user response:
Use:
dailyclusters = df_daily.dropna().assign(cluster_num=cluster).reset_index()
print(dailyclusters)
# Output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 1
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 0