How do i split up a duration while upsampleing a dataframe, as in the example below. And can i replace the for loop, with e.g. the group_by function?
I want to use pandas to transform data like this:
activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
into this:
time started Bedtime videos Commute
2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days
2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06
...
And i get this far:
import pandas as pd
df=pd.DataFrame({'activity name':['Bedtime','videos','Commute'],'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54"],'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
# calclating the duration
df['duration']=df['time ended']-df['time started']
# changeing index
df.index=df['time started']
df=df.drop(columns=['time started','time ended'])
for a in df['activity name'].unique():
df[a]=(df['activity name']==a)*df['duration']
df=df.drop(columns=['activity name','duration'])
df.resample('H').first()
time started
2021-10-25 00:00:00 0 days 08:25:42 0 days 00:00:00 0 days
2021-10-25 01:00:00 NaT NaT NaT
2021-10-25 02:00:00 NaT NaT NaT
2021-10-25 03:00:00 NaT NaT NaT
2021-10-25 04:00:00 NaT NaT NaT
2021-10-25 05:00:00 NaT NaT NaT
2021-10-25 06:00:00 NaT NaT NaT
2021-10-25 07:00:00 NaT NaT NaT
2021-10-25 08:00:00 0 days 00:00:00 0 days 00:26:12 0 days
CodePudding user response:
Try this:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34"""
)
df = pd.read_csv(txtfile, sep="\s\s ", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
Address @DerekO comment:
import pandas as pd
from io import StringIO
txtfile = StringIO(
""" activity name time started time ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Bedtime 2021-10-25 11:00:00 2021-10-25 13:04:31"""
)
df = pd.read_csv(txtfile, sep="\s\s ", engine="python")
df[["time started", "time ended"]] = df[["time started", "time ended"]].apply(
pd.to_datetime
)
df_e = df.assign(
date=[
pd.date_range(s, e, freq="s")
for s, e in zip(df["time started"], df["time ended"])
]
).explode("date")
df_out = (
df_e.groupby(["activity name", pd.Grouper(key="date", freq="H")])["activity name"]
.count()
.unstack(0)
.apply(pd.to_timedelta, unit="s")
.sort_index()
)
print(df_out)
Output:
activity name Bedtime Commute videos
date
2021-10-25 00:00:00 0 days 01:00:00 NaT NaT
2021-10-25 01:00:00 0 days 01:00:00 NaT NaT
2021-10-25 02:00:00 0 days 01:00:00 NaT NaT
2021-10-25 03:00:00 0 days 01:00:00 NaT NaT
2021-10-25 04:00:00 0 days 01:00:00 NaT NaT
2021-10-25 05:00:00 0 days 01:00:00 NaT NaT
2021-10-25 06:00:00 0 days 01:00:00 NaT NaT
2021-10-25 07:00:00 0 days 01:00:00 NaT NaT
2021-10-25 08:00:00 0 days 00:25:43 0 days 00:08:06 0 days 00:26:13
2021-10-25 09:00:00 NaT 0 days 00:29:35 NaT
2021-10-25 11:00:00 0 days 01:00:00 NaT NaT
2021-10-25 12:00:00 0 days 01:00:00 NaT NaT
2021-10-25 13:00:00 0 days 00:04:32 NaT NaT
CodePudding user response:
Although I agree that using groupby
and resample
would be best, I couldn't make such a solution work. You can instead brute force the problem by creating a new DataFrame for every row of your original DataFrame, and concatenating them together.
The way it works is that we use pd.date_range
to create a DatetimeIndex
between the floor of the start and end times, and the start and end times are inserted into the DatetimeIndex as well. Then the difference between all of the datetimes in this DatetimeIndex are the values of your new DataFrame.
To try to make my solution as robust as possible, I added two additional rows to your original DataFrame with a repeated category, and tested situations where the starting time falls exactly on the hour versus ahead of the hour.
import pandas as pd
from pandas._libs.tslibs.timedeltas import Timedelta
df=pd.DataFrame({
'activity name':['Bedtime','videos','Commute','Work','Commute'],
'time started':["2021-10-25 00:00:00","2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00"],
'time ended':["2021-10-25 08:25:42","2021-10-25 08:51:54","2021-10-25 09:29:34","2021-10-25 17:00:00","2021-10-25 18:01:00"]})
# converting strings to datetime
df['time ended']=pd.to_datetime(df['time ended'])
df['time started']=pd.to_datetime(df['time started'])
## column names with spaces can't be accessed by name when using iterruples to iterate through the df
df.columns = [col.replace(" ","_") for col in df.columns]
Starting df:
>>> df
activity_name time_started time_ended
0 Bedtime 2021-10-25 00:00:00 2021-10-25 08:25:42
1 videos 2021-10-25 08:25:42 2021-10-25 08:51:54
2 Commute 2021-10-25 08:51:54 2021-10-25 09:29:34
3 Work 2021-10-25 09:29:34 2021-10-25 17:00:00
4 Commute 2021-10-25 17:00:00 2021-10-25 18:01:00
## we use the start and end times to determine what daterange we create
start_time = df['time_started'].min().floor('h')
end_time = df['time_started'].max().ceil('h')
## setup an empty DataFrame to hold the final result
new_columns = list(df.activity_name.unique())
df_new = pd.DataFrame(columns=new_columns)
for row in df.itertuples(index=True):
new_row = {}
daterange_start = row.time_started.floor('1h')
daterange_end = row.time_ended.floor('1h')
datetimes_index = pd.date_range(daterange_start, daterange_end, freq='1h')
all_datetimes = datetimes_index.union([row.time_started, row.time_ended])
## take the difference and shift by -1 to drop the first NaT
new_row[row.activity_name] = all_datetimes.to_series().diff().shift(-1)
## if the first row starts in the middle of an hour, we don't want the difference between the beginning of the hour and the time in that row
if (row.Index == 0) & (row.time_started > daterange_start):
df_new = df_new.append(pd.DataFrame(new_row))[1:]
else:
df_new = df_new.append(pd.DataFrame(new_row))
df_new.index.name = 'time_started'
df_new.reset_index(inplace=True)
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
9 2021-10-25 08:25:42 NaT NaT NaT NaT
10 2021-10-25 08:00:00 NaT 0 days 00:25:42 NaT NaT
11 2021-10-25 08:25:42 NaT 0 days 00:26:12 NaT NaT
12 2021-10-25 08:51:54 NaT NaT NaT NaT
13 2021-10-25 08:00:00 NaT NaT 0 days 00:51:54 NaT
14 2021-10-25 08:51:54 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
16 2021-10-25 09:29:34 NaT NaT NaT NaT
17 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:29:34
18 2021-10-25 09:29:34 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
26 2021-10-25 17:00:00 NaT NaT NaT NaT
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
29 2021-10-25 18:01:00 NaT NaT NaT NaT
For each activity we created a new DataFrame obtaining the differences between times with all_datetimes.to_series().diff().shift(-1)
which means there is NaT
between each change in an activity. These aren't useful, so we will drop any rows where the activities are all NaT
.
We then drop duplicate timestamps in the time_started
column and keep the first value of these duplicates, and take the floor of all timestamps in the time_started
column:
df_new = df_new.dropna(subset=new_columns, how='all').drop_duplicates(subset=['time_started'], keep='first')
df_new['time_started'] = df_new['time_started'].apply(lambda x: x.floor('1h'))
Result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 NaT NaT NaT
1 2021-10-25 01:00:00 0 days 01:00:00 NaT NaT NaT
2 2021-10-25 02:00:00 0 days 01:00:00 NaT NaT NaT
3 2021-10-25 03:00:00 0 days 01:00:00 NaT NaT NaT
4 2021-10-25 04:00:00 0 days 01:00:00 NaT NaT NaT
5 2021-10-25 05:00:00 0 days 01:00:00 NaT NaT NaT
6 2021-10-25 06:00:00 0 days 01:00:00 NaT NaT NaT
7 2021-10-25 07:00:00 0 days 01:00:00 NaT NaT NaT
8 2021-10-25 08:00:00 0 days 00:25:42 NaT NaT NaT
11 2021-10-25 08:00:00 NaT 0 days 00:26:12 NaT NaT
14 2021-10-25 08:00:00 NaT NaT 0 days 00:08:06 NaT
15 2021-10-25 09:00:00 NaT NaT 0 days 00:29:34 NaT
18 2021-10-25 09:00:00 NaT NaT NaT 0 days 00:30:26
19 2021-10-25 10:00:00 NaT NaT NaT 0 days 01:00:00
20 2021-10-25 11:00:00 NaT NaT NaT 0 days 01:00:00
21 2021-10-25 12:00:00 NaT NaT NaT 0 days 01:00:00
22 2021-10-25 13:00:00 NaT NaT NaT 0 days 01:00:00
23 2021-10-25 14:00:00 NaT NaT NaT 0 days 01:00:00
24 2021-10-25 15:00:00 NaT NaT NaT 0 days 01:00:00
25 2021-10-25 16:00:00 NaT NaT NaT 0 days 01:00:00
27 2021-10-25 17:00:00 NaT NaT 0 days 01:00:00 NaT
28 2021-10-25 18:00:00 NaT NaT 0 days 00:01:00 NaT
Now we fill all NaT
with pd.Timedelta("0s")
, then we can groupby values in the time_started
column and sum them together:
df_new = df_new.fillna(pd.Timedelta(0)).groupby("time_started").sum().reset_index()
Final result:
>>> df_new
time_started Bedtime videos Commute Work
0 2021-10-25 00:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
1 2021-10-25 01:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
2 2021-10-25 02:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
3 2021-10-25 03:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
4 2021-10-25 04:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
5 2021-10-25 05:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
6 2021-10-25 06:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
7 2021-10-25 07:00:00 0 days 01:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00
8 2021-10-25 08:00:00 0 days 00:25:42 0 days 00:26:12 0 days 00:08:06 0 days 00:00:00
9 2021-10-25 09:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:29:34 0 days 00:30:26
10 2021-10-25 10:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
11 2021-10-25 11:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
12 2021-10-25 12:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
13 2021-10-25 13:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
14 2021-10-25 14:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
15 2021-10-25 15:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
16 2021-10-25 16:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00
17 2021-10-25 17:00:00 0 days 00:00:00 0 days 00:00:00 0 days 01:00:00 0 days 00:00:00
18 2021-10-25 18:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:01:00 0 days 00:00:00