So I'm working on a script in python using pandas where I sometimes get an output that looks like this:
Time index | Value 1 | Value 2 |
---|---|---|
19:30 | 5 | 0 |
21:30 | 4 | 0 |
21:31 | "NaN" | 1 |
23:30 | 4 | 0 |
Basically the 21:30/21:31 values correspond to each other but are misplaced due to faulty sensors or rounding errors. I was wondering if anyone know of a way to merge two rows when their time indexes are so similar (so within a minute of one another)? This would look as following:
Time index | Value 1 | Value 2 |
---|---|---|
19:30 | 5 | 0 |
21:30 | 4 | 1 |
23:30 | 4 | 0 |
Thanks for the help in advance!
CodePudding user response:
You can use a custom group with aggregation to first/max:
# form new group if diff > 2min
m = ~pd.to_datetime(df['Time index']).diff().le('2min')
# aggregate all columns with max except "Time index"
agg_f = {c: 'max' for c in df}
agg_f['Time index'] = 'first'
# groupby aggregation
out = df.groupby(m.cumsum(), as_index=False).agg(agg_f)
NB. assuming NaN is a real NaN, else replace as first step: .replace({'"NaN"', pd.NA})
.
output:
Time index Value 1 Value 2
0 19:30 5.0 0
1 21:30 4.0 1
2 23:30 4.0 0
CodePudding user response:
Code:
#Create dict by unique time, where time is key and set two list as zero
df_dic = {i[0]:[0,0] for i in [row.tolist() for i, row in df.iterrows()]}
#running the loop over df rows
#compare the values of dict and df by time if its bigger then update
for i in [row.tolist() for i, row in df.iterrows()]:
for j in range(1,len(i)):
if i[j] > df_dic[i[0]][j-1]:
df_dic[i[0]][j-1] = i[j]
#Convert dict to list and list to DF
pd.DataFrame([sum([[k],v],[]) for k,v in df_dic.items()],columns=df.columns)
Output:
Timeindex Value 1 Value 2
0 19:30 5.0 0.0
1 21:30 4.0 1.0
3 23:30 4.0 0.0