I have the following dataframe in Python:
ID | country_ID | visit_time |
---|---|---|
0 | ESP | 10 days 12:03:00 |
0 | ESP | 5 days 02:03:00 |
0 | ENG | 5 days 10:02:00 |
1 | ENG | 3 days 08:05:03 |
1 | ESP | 1 days 03:02:00 |
1 | ENG | 2 days 07:01:03 |
2 | ENG | 0 days 12:01:02 |
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
- standard deviation of visit time with country_ID = ESP for each ID.
- standard deviation of visit time with country_ID = ENG for each ID.
ID | std_visit_ESP | std_visit_ENG |
---|---|---|
0 | 2 days 17:00:00 | 0 days 00:00:00 |
1 | 0 days 00:00:00 | 0 days 12:32:00 |
2 | NaT | 0 days 00:00:00 |
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.
CodePudding user response:
If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773
CodePudding user response:
Use GroupBy.std
and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00