Home > database >  Convert a data frame with date and value to a df with one row
Convert a data frame with date and value to a df with one row

Time:11-17

I have a dataframe which has two columns. date and value.

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = ['2020-03-01 00:00:00','2020-03-01 00:00:15', '2020-03-01 00:00:30', '2020-03-02 00:00:00','2020-03-02 00:00:15', '2020-03-02 00:00:30' , '2020-03-03 00:00:15', '2020-03-03 00:00:30', '2020-03-05 00:00:00', '2020-03-05 00:00:30']
df['value'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4]
df

date    value
0   2020-03-01 00:00:00     1
1   2020-03-01 00:00:15     2
2   2020-03-01 00:00:30     3
3   2020-03-02 00:00:00     4
4   2020-03-02 00:00:15     5
5   2020-03-02 00:00:30     6
6   2020-03-03 00:00:15     1
7   2020-03-03 00:00:30     2
8   2020-03-05 00:00:00     3
9   2020-03-05 00:00:30     4

in the date column, I have some missing values (I want all the days, like 1-2-3-4-... but in this example I dont have day 2020-03-4, so I put nan for that), so I want to build this df at first which shows me the days which I dont have their data:

day     00:00:00    00:00:15    00:00:30
0   2020-03-01  1.0     2.0     3.0
1   2020-03-02  4.0     5.0     6.0
2   2020-03-03  NaN     1.0     2.0
3   2020-03-04  NaN     NaN     NaN
4   2020-03-05  3.0     NaN     4.0

Then replace the Nan values with mean of columns, like:

day     00:00:00    00:00:15    00:00:30
0   2020-03-01  1.000000    2.000000    3.000000
1   2020-03-02  4.000000    5.000000    6.000000
2   2020-03-03  2.666667    1.000000    2.000000
3   2020-03-04  2.666667    2.666667    2.666667
4   2020-03-05  3.000000    5.000000    4.000000

And then build one df with one row as(the name of columns is not important)

    1   2   3   4   5   6       7   8   9       10      11      12     13   14
0   1   2   4   5   6   2.67    1   2   2.67    2.67    2.67    3      5    4

I am working with pivot and groupby, but I could not solve it. Especially the missing date. Could you please help me with that?

CodePudding user response:

you can use resample():

df['date']=pd.to_datetime(df['date'])
dfx=df.set_index('date').resample('15S').first()

We got the distribution of all hours of the day. But we only need values ​​between 00:00:00 and 00:00:30.

dfx = dfx.between_time("00:00:00", "00:00:30").reset_index()
print(dfx)
'''
    date    value
0   2020-03-01 00:00:00 1.0
1   2020-03-01 00:00:15 2.0
2   2020-03-01 00:00:30 3.0
3   2020-03-02 00:00:00 4.0
4   2020-03-02 00:00:15 5.0
5   2020-03-02 00:00:30 6.0
6   2020-03-03 00:00:00 
7   2020-03-03 00:00:15 1.0
8   2020-03-03 00:00:30 2.0
9   2020-03-04 00:00:00 
10  2020-03-04 00:00:15 
11  2020-03-04 00:00:30 
12  2020-03-05 00:00:00 3.0
13  2020-03-05 00:00:15 
14  2020-03-05 00:00:30 4.0

'''

Then i convert times into columns using crosstab:

dfx=pd.crosstab(dfx['date'].dt.date, dfx['date'].dt.time,values=dfx['value'],aggfunc='sum',dropna=False)
print(dfx)
'''
date        00:00:00  00:00:15  00:00:30
date                                    
2020-03-01       1.0       2.0       3.0
2020-03-02       4.0       5.0       6.0
2020-03-03       0.0       1.0       2.0
2020-03-04       0.0       0.0       0.0
2020-03-05       3.0       0.0       4.0
'''

Values ​​with 0 are times that are not in the data set. I replace them with nan and populate them with the column averages:

dfx=dfx.replace(0,np.nan)
for i in dfx.columns:
    dfx[i]=dfx[i].fillna(dfx[i].mean())
print(dfx)
'''
date        00:00:00  00:00:15  00:00:30
date                                    
2020-03-01  1.000000  2.000000      3.00
2020-03-02  4.000000  5.000000      6.00
2020-03-03  2.666667  1.000000      2.00
2020-03-04  2.666667  2.666667      3.75
2020-03-05  3.000000  2.666667      4.00
'''

I did not fully understand what you want at the last stage, if you write it in detail, I will edit my answer.

  • Related