I want to rearrange my example dataframe (df.csv) below based on the date column. Each row represents an hour's data for instance for both dates 2002-01-01 and 2002-01-02, there is 5 rows respectively, each representing 1 hour.
date,symbol
2002-01-01,A
2002-01-01,A
2002-01-01,A
2002-01-01,B
2002-01-01,A
2002-01-02,B
2002-01-02,B
2002-01-02,A
2002-01-02,A
2002-01-02,A
My expected output is as below .
date,hour1, hour2, hour3, hour4, hour5
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
I have tried the below as explained here: https://pandas.pydata.org/docs/user_guide/reshaping.html, but it doesnt work in my case because the symbol column contains duplicates.
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
pivoted = df.pivot(index="date", columns="symbol")
print(pivoted)
The data does not have the timestamps but only the date. However, each row for the same date represents an hourly interval, for instance the output could also be represented as below:
date,01:00, 02:00, 03:00, 04:00, 05:00
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
where the hour1 represent 01:00, hour2 represent 02:00...etc
CodePudding user response:
I'd have an approach for you, I guess it not the most elegant way since I have to rename both index and columns but it does the job.
new_cols = ['01:00', '02:00', '03:00', '04:00', '05:00']
df1 = df.loc[df['date']=='2002-01-01', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-01'])
df2 = df.loc[df['date']=='2002-01-02', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-02'])
result = pd.concat([df1,df2])
print(result)
Output:
01:00 02:00 03:00 04:00 05:00
2002-01-01 A A A B A
2002-01-02 B B A A A
Lets see how @mozway does the magic like always....then you can forget my answer ;)
CodePudding user response:
You had the correct pivot
approach, but you were missing a column 'time', so let's split the datetime into date and time:
s = pd.to_datetime(df['date'])
df['date'] = s.dt.date
df['time'] = s.dt.time
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
Alternatively for having a HH:MM time, use df['time'] = s.dt.strftime('%H:%M')