Home > Software design >  Pandas dataframe merge without duplicating items
Pandas dataframe merge without duplicating items

Time:05-27

Hi I have the following code :

df_bla = pd.DataFrame({"timestamp" : [1, 2, 2, 3], "values": [4,5,6,7]})
df_events = pd.DataFrame({"timestamp" : [2,4], "d": [18,11]})

df_bla.merge(df_events, on = 'timestamp', how = 'outer')

I get the following result

timestamp   values  d
0       1   4.0     NaN
1       2   5.0     18.0
2       2   6.0     18.0
3       3   7.0     NaN
4       4   NaN     11.0

My problem is the 18 value duplicated twice.

What I want is that the items from df_events, won't be duplicated/propagated and will be joined to the first/last/I don't care item. We can assume the key I'm joining by is sorted (It probably will simplify the answer)

Meaning I want to have

timestamp   values  d
0       1   4.0     NaN
1       2   5.0     18.0
2       2   6.0     NaN
3       3   7.0     NaN
4       4   NaN     11.0

I incentive is that df_events, contain real life events, and there total number must be correct.

CodePudding user response:

You can mark the duplicated as NaN after merge

out = df_bla.merge(df_events, on = 'timestamp', how = 'outer')
m = out.duplicated(['timestamp', 'd'])

out['d'] = out['d'].mask(m, np.nan)
# or
out.loc[m, 'd'] = np.nan
print(out)

   timestamp  values     d
0          1     4.0   NaN
1          2     5.0  18.0
2          2     6.0   NaN
3          3     7.0   NaN
4          4     NaN  11.0

CodePudding user response:

Merge only the first occurrences of rows having the same timestamp in df_bla, and then concatenate this with the other occurences:

pd.concat( [
    df_bla.drop_duplicates('timestamp').merge(df_events, on='timestamp',how='outer'),
    df_bla[(df_bla.duplicated('timestamp',keep='first'))] 
], axis=0 ).sort_values(['timestamp','values'])


   timestamp  values     d
0          1     4.0   NaN
1          2     5.0  18.0
2          2     6.0   NaN
2          3     7.0   NaN
3          4     NaN  11.0

The keep='first' is default and can be omitted, but included for clarity here.

CodePudding user response:

Hey although this already has answer, i would still like to propose my solution i would love some critic wether it is applyable or not.

for a,b in enumerate(zip(df['timestamp'],df['d'])):
if df['timestamp'][a] == df['timestamp'][a 1]:
    df['d'][a] = np.nan
else:
    break

print(df)
      timestamp  values     d
0          1     4.0   NaN
1          2     5.0  NaN
2          2     6.0  18.0
3          3     7.0   NaN
4          4     NaN  11.0
  • Related