I have a dataframe which has two columns. date and value.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = ['2020-03-01 00:00:00','2020-03-01 00:00:15', '2020-03-01 00:00:30', '2020-03-02 00:00:00','2020-03-02 00:00:15', '2020-03-02 00:00:30' , '2020-03-03 00:00:15', '2020-03-03 00:00:30', '2020-03-05 00:00:00', '2020-03-05 00:00:30']
df['value'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4]
df
date value
0 2020-03-01 00:00:00 1
1 2020-03-01 00:00:15 2
2 2020-03-01 00:00:30 3
3 2020-03-02 00:00:00 4
4 2020-03-02 00:00:15 5
5 2020-03-02 00:00:30 6
6 2020-03-03 00:00:15 1
7 2020-03-03 00:00:30 2
8 2020-03-05 00:00:00 3
9 2020-03-05 00:00:30 4
in the date column, I have some missing values (I want all the days, like 1-2-3-4-... but in this example I dont have day 2020-03-4, so I put nan for that), so I want to build this df at first which shows me the days which I dont have their data:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.0 2.0 3.0
1 2020-03-02 4.0 5.0 6.0
2 2020-03-03 NaN 1.0 2.0
3 2020-03-04 NaN NaN NaN
4 2020-03-05 3.0 NaN 4.0
Then replace the Nan values with mean of columns, like:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.000000 2.000000 3.000000
1 2020-03-02 4.000000 5.000000 6.000000
2 2020-03-03 2.666667 1.000000 2.000000
3 2020-03-04 2.666667 2.666667 2.666667
4 2020-03-05 3.000000 5.000000 4.000000
And then build one df with one row as(the name of columns is not important)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 4 5 6 2.67 1 2 2.67 2.67 2.67 3 5 4
I am working with pivot and groupby, but I could not solve it. Especially the missing date. Could you please help me with that?
CodePudding user response:
you can use resample():
df['date']=pd.to_datetime(df['date'])
dfx=df.set_index('date').resample('15S').first()
We got the distribution of all hours of the day. But we only need values between 00:00:00 and 00:00:30.
dfx = dfx.between_time("00:00:00", "00:00:30").reset_index()
print(dfx)
'''
date value
0 2020-03-01 00:00:00 1.0
1 2020-03-01 00:00:15 2.0
2 2020-03-01 00:00:30 3.0
3 2020-03-02 00:00:00 4.0
4 2020-03-02 00:00:15 5.0
5 2020-03-02 00:00:30 6.0
6 2020-03-03 00:00:00
7 2020-03-03 00:00:15 1.0
8 2020-03-03 00:00:30 2.0
9 2020-03-04 00:00:00
10 2020-03-04 00:00:15
11 2020-03-04 00:00:30
12 2020-03-05 00:00:00 3.0
13 2020-03-05 00:00:15
14 2020-03-05 00:00:30 4.0
'''
Then i convert times into columns using crosstab:
dfx=pd.crosstab(dfx['date'].dt.date, dfx['date'].dt.time,values=dfx['value'],aggfunc='sum',dropna=False)
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.0 2.0 3.0
2020-03-02 4.0 5.0 6.0
2020-03-03 0.0 1.0 2.0
2020-03-04 0.0 0.0 0.0
2020-03-05 3.0 0.0 4.0
'''
Values with 0 are times that are not in the data set. I replace them with nan and populate them with the column averages:
dfx=dfx.replace(0,np.nan)
for i in dfx.columns:
dfx[i]=dfx[i].fillna(dfx[i].mean())
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.000000 2.000000 3.00
2020-03-02 4.000000 5.000000 6.00
2020-03-03 2.666667 1.000000 2.00
2020-03-04 2.666667 2.666667 3.75
2020-03-05 3.000000 2.666667 4.00
'''
I did not fully understand what you want at the last stage, if you write it in detail, I will edit my answer.