From the following python dataframe:
country_ID date ID visit_ENG visit_FRA visit_ESP visit_time
ENG 2022-02-04 16:30:21 3 1 0 0 0 days 01:00:00
ENG 2022-02-04 16:40:21 3 1 0 0 0 days 00:05:00
ENG 2022-02-06 16:35:21 3 1 0 0 1 days 19:55:00
ENG 2022-02-04 10:10:21 2 1 0 0 NaT
ESP 2022-02-04 15:10:21 2 1 1 1 NaT
ENG 2022-02-04 12:35:21 1 1 0 0 NaT
ENG 2022-02-04 16:10:21 1 0 0 0 0 days 03:35:00
ESP 2022-02-04 14:23:21 0 0 0 1 NaT
ESP 2022-02-04 15:27:21 0 1 0 0 0 days 01:04:00
FRA 2022-02-04 15:35:21 0 0 0 0 NaT
ENG 2022-02-04 16:35:21 0 0 0 0 0 days 00:35:00
I have extracted the following information in the following subsets, using the following code:
- Sum of visit columns for each ID.
visit_df = pd.DataFrame()
visit_df['visit_ENG'] = df.groupby('ID')['visit_ENG'].sum()
visit_df['visit_FRA'] = df.groupby('ID')['visit_FRA'].sum()
visit_df['visit_ESP'] = df.groupby('ID')['visit_ESP'].sum()
print(visit_df)
visit_ENG visit_FRA visit_ESP
ID
0 1 0 1
1 1 0 0
2 2 1 1
3 3 0 0
- Mean of each country_ID group for each ID.
subset_avg = pd.to_timedelta(subset_out['visit_time'].dt.total_seconds() \
.groupby([subset_out['ID'], subset_out['country_ID']])
.mean(), unit='s').fillna(pd.Timedelta(days=0)).unstack()
.add_prefix('avg_visit_')
subset_avg.columns.names = ['']
print(subset_avg)
avg_visit_ESP avg_visit_ENG
ID
0 0 days 01:04:00 0 days 00:35:00
1 NaT 0 days 03:35:00
3 NaT 0 days 15:00:00
- Standard deviation of each country_ID group for each ID.
subset_std = pd.to_timedelta(df['visit_time'].dt.total_seconds() \
.groupby([df['ID'],df['country_ID']])
.std(),unit='s').fillna(pd.Timedelta(days=0)).unstack()
.add_prefix('std_visit_')
subset_std.columns.names = ['']
print(subset_std)
std_visit_ESP std_visit_ENG
ID
0 0 days 00:00:00 0 days 00:00:00
1 NaT 0 days 00:00:00
3 NaT 1 days 01:02:48.342559903
I would like to make a joint table with all the extracted data. I need to concatenate all the tables above (visit_df,subset_avg, subset_std
), create a column for the ID and complete the data that does not appear in the tables as NaT
. As you can see below:
ID | visit_ENG | visit_FRA | visit_ESP | avg_visit_ESP | avg_visit_ENG | std_visit_ESP | std_visit_ENG |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 days 01:04:00 | 0 days 00:35:00 | 0 days 00:00:00 | 0 days 00:00:00 |
1 | 1 | 0 | 0 | NaT | 0 days 03:35:00 | NaT | 0 days 00:00:00 |
2 | 2 | 1 | 1 | NaT | NaT | NaT | NaT |
3 | 3 | 0 | 0 | NaT | 0 days 15:00:00 | NaT | 1 days 01:02:48.342559903 |
If there is a way to build the solution I ask for, but using other python commands that are more efficient, let me know.
CodePudding user response:
Are you looking for pd.concat
:
>>> pd.concat([visit_df, subset_avg, subset_std], axis=1).reset_index()
ID visit_ENG visit_FRA visit_ESP avg_visit_ESP avg_visit_ENG std_visit_ESP std_visit_ENG
0 0 1 0 1 0 days 01:04:00 0 days 00:35:00 0 days 00:00:00 0 days 00:00:00
1 1 1 0 0 NaT 0 days 03:35:00 NaT 0 days 00:00:00
2 2 2 1 1 NaN NaN NaN NaN
3 3 3 0 0 NaT 0 days 15:00:00 NaT 1 days 01:02:48.342559903
Tip for visit_id
:
visit_df = df.groupby('ID').sum()
print(visit_df)
# Output
visit_ENG visit_FRA visit_ESP
ID
0 1 0 1
1 1 0 0
2 2 1 1
3 3 0 0