I have a CSV-file containing the following data structure:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815
I have the following code:
df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])
for i in range(len(df)):
if i != len(df)-1:
next_i = i 1
if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
now = int(df.loc[i, "minute"][:2] df.loc[i, "minute"][3:5])
future = int(df.loc[next_i, "minute"][:2] df.loc[next_i, "minute"][3:5])
while now 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
newminutes = str(int(df.loc[i, "minute"][3:5]) 5*a)
newtime = df.loc[next_i, "minute"][:2] ":" newminutes ":00"
df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
df = df.sort_index().reset_index(drop=True)
now = int(newtime[:2] newtime[3:5])
future = int(df.loc[next_i 1, "minute"][:2] df.loc[next_i 1, "minute"][3:5])
However, it's not working.
CodePudding user response:
one way is create the needed index, merge left and forward fill:
first make sure you have a proper timestamp column:
df['ts'] = pd.to_datetime(df[0] ' ' df[1])
df = df[['ts', 2]]
you should get something like this:
ts | 2 | |
---|---|---|
0 | 2015-01-02 09:30:00 | 64.815 |
1 | 2015-01-02 09:35:00 | 64.8741 |
2 | 2015-01-02 09:55:00 | 65.0255 |
3 | 2015-01-02 10:00:00 | 64.9269 |
then create the date range index:
new_df = pd.DataFrame(index=pd.date_range(start=df['ts'].min(),
end=df['ts'].max(), freq='5min'))
then left merge to it and forward fill:
new_df.merge(df, left_index=True, right_on='ts', how='left').fillna(method='ffill').reset_index(drop=True)
ts | 2 | |
---|---|---|
0 | 2015-01-02 09:30:00 | 64.815 |
1 | 2015-01-02 09:35:00 | 64.8741 |
2 | 2015-01-02 09:40:00 | 64.8741 |
3 | 2015-01-02 09:45:00 | 64.8741 |
4 | 2015-01-02 09:50:00 | 64.8741 |
5 | 2015-01-02 09:55:00 | 65.0255 |
6 | 2015-01-02 10:00:00 | 64.9269 |
CodePudding user response:
I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815
.
To accomodate that as well you can reindex using pd.DateRange
.
Creating data
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
Creating datetime column for reindexing
df['datetime'] = pd.to_datetime(df['date'] ' ' df['time'])
df.set_index('datetime', inplace=True)
Generating output
df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill().reset_index(drop=True)
Output
This gives us the expected output
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:35:00 64.8741
3 2015-01-02 09:35:00 64.8741
4 2015-01-02 09:35:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
7 2015-01-02 10:00:00 64.9269
However if that was a typo and you don't want the last row you can do this :
df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill().reset_index(drop=True)
which gives is
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:35:00 64.8741
3 2015-01-02 09:35:00 64.8741
4 2015-01-02 09:35:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269