Hi I have the following code :
df_bla = pd.DataFrame({"timestamp" : [1, 2, 2, 3], "values": [4,5,6,7]})
df_events = pd.DataFrame({"timestamp" : [2,4], "d": [18,11]})
df_bla.merge(df_events, on = 'timestamp', how = 'outer')
I get the following result
timestamp values d
0 1 4.0 NaN
1 2 5.0 18.0
2 2 6.0 18.0
3 3 7.0 NaN
4 4 NaN 11.0
My problem is the 18
value duplicated twice.
What I want is that the items from df_events
, won't be duplicated/propagated and will be joined to the first/last/I don't care item.
We can assume the key I'm joining by is sorted (It probably will simplify the answer)
Meaning I want to have
timestamp values d
0 1 4.0 NaN
1 2 5.0 18.0
2 2 6.0 NaN
3 3 7.0 NaN
4 4 NaN 11.0
I incentive is that df_events, contain real life events, and there total number must be correct.
CodePudding user response:
You can mark the duplicated as NaN after merge
out = df_bla.merge(df_events, on = 'timestamp', how = 'outer')
m = out.duplicated(['timestamp', 'd'])
out['d'] = out['d'].mask(m, np.nan)
# or
out.loc[m, 'd'] = np.nan
print(out)
timestamp values d
0 1 4.0 NaN
1 2 5.0 18.0
2 2 6.0 NaN
3 3 7.0 NaN
4 4 NaN 11.0
CodePudding user response:
Merge only the first occurrences of rows having the same timestamp in df_bla, and then concatenate this with the other occurences:
pd.concat( [
df_bla.drop_duplicates('timestamp').merge(df_events, on='timestamp',how='outer'),
df_bla[(df_bla.duplicated('timestamp',keep='first'))]
], axis=0 ).sort_values(['timestamp','values'])
timestamp values d
0 1 4.0 NaN
1 2 5.0 18.0
2 2 6.0 NaN
2 3 7.0 NaN
3 4 NaN 11.0
The keep='first' is default and can be omitted, but included for clarity here.
CodePudding user response:
Hey although this already has answer, i would still like to propose my solution i would love some critic wether it is applyable or not.
for a,b in enumerate(zip(df['timestamp'],df['d'])):
if df['timestamp'][a] == df['timestamp'][a 1]:
df['d'][a] = np.nan
else:
break
print(df)
timestamp values d
0 1 4.0 NaN
1 2 5.0 NaN
2 2 6.0 18.0
3 3 7.0 NaN
4 4 NaN 11.0