A csv file with 600 thousand lines looks something like this.
| Time | value |
| --------- | -------------- |
| 2022-07-05 10:00:23 | 2 |
| 2022-07-05 10:01:11 | 5 |
| 2022-07-05 10:01:16 | 1 |
| 2022-07-05 10:01:33 | 2 |
| 2022-07-05 10:04:26 | 15 |
| 2022-07-05 10:05:57 | 1 |
| 2022-07-05 10:05:57 | 1 |
| 2022-07-05 10:05:57 | 3 |
And so on until 23:59:59
I'm trying something like
def V_t_5_Min():
df = pd.DataFrame(pd.read_csv(Ex_Csv, usecols = ["Time", 'value']))
a = df['Time'] = pd.to_datetime(df['Time'])
b = df['value']
c = pd.Series(a).dt.round("5T", 0)
df = pd.DataFrame(c)
df['value'] = b
c = df.groupby('Time')
df1 = pd.DataFrame(c.sum())
df1.to_csv('Datasets\CSV\V_5_min.csv')
But I get terrible rounding and incorrect calculations of the value
The desired result for an interval of 5 minutes
| Time | value |
| --------- | -------------- |
| 2022-07-05 10:05:00 | 30 |
and so on for every 5 minutes without rounding
CodePudding user response:
A bit of a code review first
def V_t_5_Min():
df = pd.DataFrame(pd.read_csv(Ex_Csv, usecols = ["Time", 'value']))
a = df['Time'] = pd.to_datetime(df['Time'])
b = df['value']
c = pd.Series(a).dt.round("5T", 0)
df = pd.DataFrame(c)
df['value'] = b
c = df.groupby('Time')
df1 = pd.DataFrame(c.sum())
df1.to_csv('Datasets\CSV\V_5_min.csv')
What you could try to do - play around with the custom_round
to achieve the desired rounding.
def custom_round(x):
if x.minute % 10 == 5:
# deal with the 10:05:xx case
x = x.replace(minute=x.minute - 1)
x = x.ceil('5T')
return x
# data = [[e.strip() for e in line.split('|') if e.strip()]for line in data_.split("\n") if '---' not in line]
# df_ = pd.DataFrame(data[1:], columns = data[0])
df_ = pd.DataFrame(pd.read_csv(Ex_Csv, usecols = ["Time", 'value']))
df_['Time'] = df_['Time'].apply(pd.to_datetime)
df_['rounded_time'] = df_['Time'].apply(custom_round)
df_
this is the result from running the code on your data
Time value rounded_time
0 2022-07-05 10:00:23 2 2022-07-05 10:05:00
1 2022-07-05 10:01:11 5 2022-07-05 10:05:00
2 2022-07-05 10:01:16 1 2022-07-05 10:05:00
3 2022-07-05 10:01:33 2 2022-07-05 10:05:00
4 2022-07-05 10:04:26 15 2022-07-05 10:05:00
5 2022-07-05 10:05:57 1 2022-07-05 10:05:00
6 2022-07-05 10:05:57 1 2022-07-05 10:05:00
7 2022-07-05 10:05:57 3 2022-07-05 10:05:00
CodePudding user response:
You could use the dt.ceil()
function to round up to the next 5 second interval and then use a groupby to sum matching rows:
import pandas as pd
df = pd.DataFrame(pd.read_csv('input.csv', usecols=["Time", 'value'], parse_dates=[0]))
df['Ceil'] = df['Time'].dt.ceil("5T", 0)
print(df.groupby('Ceil').agg('sum'))
Giving you:
value
Ceil
2022-07-05 10:05:00 25
2022-07-05 10:10:00 5