Home > database >  Split up duration while upsampling dataframe
Split up duration while upsampling dataframe

Time:12-17

How do i split up a duration while upsampleing a dataframe, as in the example below. And can i replace the for loop, with e.g. the group_by function?

I want to use pandas to transform data like this:

  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34

into this:

time started        Bedtime         videos           Commute                   
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00  0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12  0 days 00:08:06
...

And i get this far:

import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])

# calclating the duration
df['duration']=df['time ended']-df['time started']

# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])

for a in df['activity name'].unique():
    df[a]=(df['activity name']==a)*df['duration']

df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
time started                                               
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00  0 days
2021-10-25 01:00:00             NaT             NaT     NaT
2021-10-25 02:00:00             NaT             NaT     NaT
2021-10-25 03:00:00             NaT             NaT     NaT
2021-10-25 04:00:00             NaT             NaT     NaT
2021-10-25 05:00:00             NaT             NaT     NaT
2021-10-25 06:00:00             NaT             NaT     NaT
2021-10-25 07:00:00             NaT             NaT     NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12  0 days

CodePudding user response:

Try this:

import pandas as pd
from io import StringIO

txtfile = StringIO(
    """  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34"""
)

df = pd.read_csv(txtfile, sep="\s\s ", engine="python")

df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
    pd.to_datetime
)
df_e = df.assign(
    date=[
        pd.date_range(s, e, freq="s")
        for s, e in zip(df["time started"], df["time ended"])
    ]
).explode("date")

df_out = (
    df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
    .count()
    .unstack(0)
    .apply(pd.to_timedelta, unit="s")
)

print(df_out)

Output:

activity name               Bedtime         Commute          videos
date                                                               
2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00             NaT 0 days 00:29:35             NaT

Address @DerekO comment:

import pandas as pd
from io import StringIO

txtfile = StringIO(
    """  activity name         time started           time ended
0       Bedtime  2021-10-25 00:00:00  2021-10-25 08:25:42
1        videos  2021-10-25 08:25:42  2021-10-25 08:51:54
2       Commute  2021-10-25 08:51:54  2021-10-25 09:29:34
3       Bedtime  2021-10-25 11:00:00  2021-10-25 13:04:31"""
)

df = pd.read_csv(txtfile, sep="\s\s ", engine="python")

df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
    pd.to_datetime
)
df_e = df.assign(
    date=[
        pd.date_range(s, e, freq="s")
        for s, e in zip(df["time started"], df["time ended"])
    ]
).explode("date")

df_out = (
    df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
    .count()
    .unstack(0)
    .apply(pd.to_timedelta, unit="s")
    .sort_index()
)

print(df_out)

Output:

activity name               Bedtime         Commute          videos
date                                                               
2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00             NaT 0 days 00:29:35             NaT
2021-10-25 11:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 12:00:00 0 days 01:00:00             NaT             NaT
2021-10-25 13:00:00 0 days 00:04:32             NaT             NaT

CodePudding user response:

Although I agree that using groupby and resample would be best, I couldn't make such a solution work. You can instead brute force the problem by creating a new DataFrame for every row of your original DataFrame, and concatenating them together.

The way it works is that we use pd.date_range to create a DatetimeIndex between the floor of the start and end times, and the start and end times are inserted into the DatetimeIndex as well. Then the difference between all of the datetimes in this DatetimeIndex are the values of your new DataFrame.

To try to make my solution as robust as possible, I added two additional rows to your original DataFrame with a repeated category, and tested situations where the starting time falls exactly on the hour versus ahead of the hour.

import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta

df=pd.DataFrame({
    'activity name':['Bedtime','videos','Commute','Work','Commute'],
    'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
    'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})

# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])

## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]

Starting df:

>>> df
  activity_name        time_started          time_ended
0       Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1        videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2       Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3          Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4       Commute 2021-10-25 17:00:00 2021-10-25 18:01:00

## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')

## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)

for row in df.itertuples(index=True):
    new_row = {}
    daterange_start = row.time_started.floor('1h')
    daterange_end = row.time_ended.floor('1h')
    datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')

    all_datetimes = datetimes_index.union([row.time_started, row.time_ended])

    ## take the difference and shift by -1 to drop the first NaT
    new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
    
    ## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
    if (row.Index == 0) & (row.time_started > daterange_start):
        df_new = df_new.append(pd.DataFrame(new_row))[1:]
    else:
        df_new = df_new.append(pd.DataFrame(new_row))

df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)

Result:

>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT             NaT
1  2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT             NaT
2  2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT             NaT
3  2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT             NaT
4  2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT             NaT
5  2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT             NaT
6  2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT             NaT
7  2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT             NaT
8  2021-10-25 08:00:00 0 days 00:25:42             NaT             NaT             NaT
9  2021-10-25 08:25:42             NaT             NaT             NaT             NaT
10 2021-10-25 08:00:00             NaT 0 days 00:25:42             NaT             NaT
11 2021-10-25 08:25:42             NaT 0 days 00:26:12             NaT             NaT
12 2021-10-25 08:51:54             NaT             NaT             NaT             NaT
13 2021-10-25 08:00:00             NaT             NaT 0 days 00:51:54             NaT
14 2021-10-25 08:51:54             NaT             NaT 0 days 00:08:06             NaT
15 2021-10-25 09:00:00             NaT             NaT 0 days 00:29:34             NaT
16 2021-10-25 09:29:34             NaT             NaT             NaT             NaT
17 2021-10-25 09:00:00             NaT             NaT             NaT 0 days 00:29:34
18 2021-10-25 09:29:34             NaT             NaT             NaT 0 days 00:30:26
19 2021-10-25 10:00:00             NaT             NaT             NaT 0 days 01:00:00
20 2021-10-25 11:00:00             NaT             NaT             NaT 0 days 01:00:00
21 2021-10-25 12:00:00             NaT             NaT             NaT 0 days 01:00:00
22 2021-10-25 13:00:00             NaT             NaT             NaT 0 days 01:00:00
23 2021-10-25 14:00:00             NaT             NaT             NaT 0 days 01:00:00
24 2021-10-25 15:00:00             NaT             NaT             NaT 0 days 01:00:00
25 2021-10-25 16:00:00             NaT             NaT             NaT 0 days 01:00:00
26 2021-10-25 17:00:00             NaT             NaT             NaT             NaT
27 2021-10-25 17:00:00             NaT             NaT 0 days 01:00:00             NaT
28 2021-10-25 18:00:00             NaT             NaT 0 days 00:01:00             NaT
29 2021-10-25 18:01:00             NaT             NaT             NaT             NaT

For each activity we created a new DataFrame obtaining the differences between times with all_datetimes.to_series().diff().shift(-1) which means there is NaT between each change in an activity. These aren't useful, so we will drop any rows where the activities are all NaT.

We then drop duplicate timestamps in the time_started column and keep the first value of these duplicates, and take the floor of all timestamps in the time_started column:

df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))

Result:

>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00             NaT             NaT             NaT
1  2021-10-25 01:00:00 0 days 01:00:00             NaT             NaT             NaT
2  2021-10-25 02:00:00 0 days 01:00:00             NaT             NaT             NaT
3  2021-10-25 03:00:00 0 days 01:00:00             NaT             NaT             NaT
4  2021-10-25 04:00:00 0 days 01:00:00             NaT             NaT             NaT
5  2021-10-25 05:00:00 0 days 01:00:00             NaT             NaT             NaT
6  2021-10-25 06:00:00 0 days 01:00:00             NaT             NaT             NaT
7  2021-10-25 07:00:00 0 days 01:00:00             NaT             NaT             NaT
8  2021-10-25 08:00:00 0 days 00:25:42             NaT             NaT             NaT
11 2021-10-25 08:00:00             NaT 0 days 00:26:12             NaT             NaT
14 2021-10-25 08:00:00             NaT             NaT 0 days 00:08:06             NaT
15 2021-10-25 09:00:00             NaT             NaT 0 days 00:29:34             NaT
18 2021-10-25 09:00:00             NaT             NaT             NaT 0 days 00:30:26
19 2021-10-25 10:00:00             NaT             NaT             NaT 0 days 01:00:00
20 2021-10-25 11:00:00             NaT             NaT             NaT 0 days 01:00:00
21 2021-10-25 12:00:00             NaT             NaT             NaT 0 days 01:00:00
22 2021-10-25 13:00:00             NaT             NaT             NaT 0 days 01:00:00
23 2021-10-25 14:00:00             NaT             NaT             NaT 0 days 01:00:00
24 2021-10-25 15:00:00             NaT             NaT             NaT 0 days 01:00:00
25 2021-10-25 16:00:00             NaT             NaT             NaT 0 days 01:00:00
27 2021-10-25 17:00:00             NaT             NaT 0 days 01:00:00             NaT
28 2021-10-25 18:00:00             NaT             NaT 0 days 00:01:00             NaT

Now we fill all NaT with pd.Timedelta("0s"), then we can groupby values in the time_started column and sum them together:

df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()

Final result:

>>> df_new
          time_started         Bedtime          videos         Commute            Work
0  2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1  2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2  2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3  2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4  2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5  2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6  2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7  2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8  2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9  2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00
  • Related