this is my first question here as I really couldn't figure it out with related answers: I have a list of dataframes "df_list", for each user I have a dataframe which basically looks like:
- |User |Timestamp |Timestamp2 |check |in_out
0 |0001 |2022-01-07 07:40:44 |2022-01-07 17:32:18 |0 days 09:51:34 |in
1 |0001 |2022-01-07 17:32:18 |2022-01-10 17:31:50 |2 days 23:59:32 |out
2 |0001 |2022-01-10 17:31:50 |2022-01-11 07:43:48 |0 days 14:11:58 |in
Data:
data = {'User': [1, 1, 1],
'Timestamp': ['2022-01-07 07:40:44',
'2022-01-07 17:32:18',
'2022-01-10 17:31:50'],
'Timestamp2': ['2022-01-07 17:32:18',
'2022-01-10 17:31:50',
'2022-01-11 07:43:48'],
'check': ['0 days 09:51:34', '2 days 23:59:32', '0 days 14:11:58'],
'in_out': ['in', 'out', 'in']}
I would like to go through all the dataframes in my df_list and inside each df I would like to add 1 row (below) where the 'check' value is > 15 hours.
Desired Output
- |User |Timestamp |Timestamp2 |check |in_out
0 |0001 |2022-01-07 07:40:44 |2022-01-07 17:32:18 |0 days 09:51:34 |in
1 |0001 |2022-01-07 17:32:18 |2022-01-10 17:31:50 |2 days 23:59:32 |out
2 |0001
3 |0001 |2022-01-10 17:31:50 |2022-01-11 07:43:48 |0 days 14:11:58 |in
Attempt:
So what I tried is to go with a for-loop into the list and there with another loop through all my rows in each dataframe. Then I checked via if-clause for my condition... I create a new blank line with the required index and then concat the dataframe so i can include the empty line.
for dfx in df_list:
for i in dfx.index.values.tolist():
if (pd.Timedelta(dfx["check"].loc[i]) > pd.Timedelta("15 hours")) == True:
line = pd.DataFrame({'ID': '', 'Timestamp': None,'Timestamp2': '','check': None}, index=[i 1])
dfx = pd.concat([dfx.iloc[:i 1], line, dfx.iloc[i:]]).reset_index(drop=True)
I also found this here Add empty row if a condition is met Pandas but I can't get this running as well...
for i in dfy.index.values.tolist():
indices = dfy.loc[pd.Timedelta(((dfy["check"].loc[i]) > pd.Timedelta("15 hours"))== True)].index.tolist()
ERROR: Value must be Timedelta, string, integer, float, timedelta or convertible, not bool
In the next step I want to add the 'Timestamp' value from previous line or - 10 hours, depending on the 'value' of the in/out column...but I thought I should figure out one problem at a time...
CodePudding user response:
You can create a boolean mask using "check" column and add a row using Index.repeat
and reindex
ing. Then sort_index
and delete duplicate values:
msk = df['check'].astype('timedelta64[h]') > 15
df = df.reindex(df[msk].index.repeat(2)).append(df[~msk]).sort_index()
df.loc[df.duplicated(['Timestamp','check']), df.columns.drop('User')] = np.nan
Note that this assumes "check" is dtype timedelta object.
Output:
User Timestamp Timestamp2 check in_out
0 1 2022-01-07 07:40:44 2022-01-07 17:32:18 0 days 09:51:34 in
1 1 2022-01-07 17:32:18 2022-01-10 17:31:50 2 days 23:59:32 out
1 1 NaT NaT NaT NaN
2 1 2022-01-10 17:31:50 2022-01-11 07:43:48 0 days 14:11:58 in