Calculates a standard deviation columns for timedelta elements-CodePudding

I have the following dataframe in Python:

ID	country_ID	visit_time
0	ESP	10 days 12:03:00
0	ESP	5 days 02:03:00
0	ENG	5 days 10:02:00
1	ENG	3 days 08:05:03
1	ESP	1 days 03:02:00
1	ENG	2 days 07:01:03
2	ENG	0 days 12:01:02

For each ID I want to calculate the standard deviation of each country_ID group.

std_visit_ESP and std_visit_ENG columns.

standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.

ID	std_visit_ESP	std_visit_ENG
0	2 days 17:00:00	0 days 00:00:00
1	0 days 00:00:00	0 days 12:32:00
2	NaT	0 days 00:00:00

With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.

My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:

    td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
    td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
    arr = [td1,td2]
    var = np.std(arr)
    show_s = pd.to_timedelta(var, unit='s')
    print(show_s)

I don't know how to use this with groupby to get the desired result. I am grateful for your help.

CodePudding user response：

If I understand correctly, this should work for you:

stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))

Output:

>>> stddevs
country_ID
ENG   2 days 01:17:43.835702
ESP   4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]

Formatting:

stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)

Output:

>>> stddevs
           std_visit_ENG          std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773

CodePudding user response：

Use GroupBy.std and pd.to_timedelta

total_seconds = \
    pd.to_timedelta(
        df['visit_time'].dt.total_seconds()
                        .groupby([df['ID'], df['country_ID']]).std(),
        unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID                       ENG                       ESP
ID                                                            
0                    0 days 00:00:00 3 days 19:55:25.973595304
1          0 days 17:43:29.315934274           0 days 00:00:00
2                    0 days 00:00:00           0 days 00:00:00