Python pandas sort columns in csv by time with an interval of 5 minutes without rounding and sum-CodePudding

A csv file with 600 thousand lines looks something like this.


| Time | value  |
| --------- | -------------- |
| 2022-07-05 10:00:23   | 2      |
| 2022-07-05 10:01:11    | 5         |
| 2022-07-05 10:01:16   | 1      |
| 2022-07-05 10:01:33    | 2         |
| 2022-07-05 10:04:26   | 15      |
| 2022-07-05 10:05:57    | 1         |
| 2022-07-05 10:05:57   | 1      |
| 2022-07-05 10:05:57    | 3         |

And so on until 23:59:59

I'm trying something like

def V_t_5_Min():
    df = pd.DataFrame(pd.read_csv(Ex_Csv,  usecols = ["Time", 'value']))
    a = df['Time'] = pd.to_datetime(df['Time'])
    b = df['value']
    c = pd.Series(a).dt.round("5T", 0)
    df = pd.DataFrame(c)
    df['value'] = b
    c = df.groupby('Time')
    df1 = pd.DataFrame(c.sum())
    df1.to_csv('Datasets\CSV\V_5_min.csv')

But I get terrible rounding and incorrect calculations of the value

The desired result for an interval of 5 minutes

| Time | value  |
| --------- | -------------- |
| 2022-07-05 10:05:00   | 30      |

and so on for every 5 minutes without rounding

CodePudding user response：

A bit of a code review first

def V_t_5_Min():
    df = pd.DataFrame(pd.read_csv(Ex_Csv,  usecols = ["Time", 'value']))
    a = df['Time'] = pd.to_datetime(df['Time'])
    b = df['value']
    c = pd.Series(a).dt.round("5T", 0)
    df = pd.DataFrame(c)
    df['value'] = b
    c = df.groupby('Time')
    df1 = pd.DataFrame(c.sum())
    df1.to_csv('Datasets\CSV\V_5_min.csv')

What you could try to do - play around with the custom_round to achieve the desired rounding.

def custom_round(x):
    if x.minute % 10 == 5:
        # deal with the 10:05:xx case
        x = x.replace(minute=x.minute - 1)

    x = x.ceil('5T')
    return x


# data = [[e.strip() for e in line.split('|') if e.strip()]for line in data_.split("\n") if '---' not in line]
# df_ = pd.DataFrame(data[1:], columns = data[0])
df_ = pd.DataFrame(pd.read_csv(Ex_Csv,  usecols = ["Time", 'value']))
df_['Time'] = df_['Time'].apply(pd.to_datetime)
df_['rounded_time'] = df_['Time'].apply(custom_round)
df_

this is the result from running the code on your data

    Time    value   rounded_time
0   2022-07-05 10:00:23 2   2022-07-05 10:05:00
1   2022-07-05 10:01:11 5   2022-07-05 10:05:00
2   2022-07-05 10:01:16 1   2022-07-05 10:05:00
3   2022-07-05 10:01:33 2   2022-07-05 10:05:00
4   2022-07-05 10:04:26 15  2022-07-05 10:05:00
5   2022-07-05 10:05:57 1   2022-07-05 10:05:00
6   2022-07-05 10:05:57 1   2022-07-05 10:05:00
7   2022-07-05 10:05:57 3   2022-07-05 10:05:00

CodePudding user response：

You could use the dt.ceil() function to round up to the next 5 second interval and then use a groupby to sum matching rows:

import pandas as pd

    
df = pd.DataFrame(pd.read_csv('input.csv',  usecols=["Time", 'value'], parse_dates=[0]))
df['Ceil'] = df['Time'].dt.ceil("5T", 0)

print(df.groupby('Ceil').agg('sum'))

Giving you:

                     value
Ceil                      
2022-07-05 10:05:00     25
2022-07-05 10:10:00      5