Home > Mobile >  Optimize Pandas assignation based on second dataframe
Optimize Pandas assignation based on second dataframe

Time:04-06

I have two dataframes, df and df2. I want to include on df an extra column on df called names. This column should contain the name of the row of df that is defined on df2 based on the conjunction of the group names of df and df2 and when this condition is met: (df[val2] >= df2[val] >= df[val2]).

Therefore, given this inputs:

df

  val1 val2 group
0   1   2   a1
1   3   4   a1
2   10  12  a1
3   23  25  a1
4   1   2   a2
5   3   4   a2
6   10  12  a2
7   23  25  a2

df2

   val name group
0   1   abc a1
1   4   def a1
2   12  ghi a1
3   26  jkl a1
4   2   p1  a2
5   4   p2  a2
6   12  p3  a2
7   26  p4  a2

The desired output would be:

df

  val1 val2 group name
0   1   2   a1  abc
1   3   4   a1  def
2   10  12  a1  ghi
3   23  25  a1  
4   1   2   a2  p1
5   3   4   a2  p2
6   10  12  a2  p3
7   23  25  a2

As said, I manage to obtain the correct result, but my code is not efficient at all (in this small example, it does not really matter, but when running this snippet on an example with thousands of registers, you can tell that it is quite slow). Is there any way to boost this operation? PS: bellow is my current code with a working example:

d = {'val1': [1, 3, 10, 23, 1, 3, 10, 23], 'val2': [2, 4, 12, 25, 2, 4, 12, 25], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
d2 = {'val': [1, 4, 12, 26, 2, 4, 12, 26], 'name': ['abc', 'def', 'ghi', 'jkl', 'p1', 'p2', 'p3', 'p4'], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)

names = set(df["group"])
df["name"] = ""

for name in names:
    df_it = df[df["group"] == name]
    df2_it = df2[df2["group"] == name]
    for it in df_it.index:
        df2_it_2 = df2_it[(df2["val"] >= \
                           df_it["val1"][it]) &
                          (df2["val"] <=
                           df_it["val2"][it])]
        
        if not df2_it_2.empty:
            df["name"][it] = df2_it_2["name"].iloc[0]

CodePudding user response:

You can perform a merge_asof to have val ≥ val1, then mask the "name"s for which the merged val is greater than val2:

(pd.merge_asof(df.reset_index().sort_values(by='val1'),
               df2.sort_values(by='val'),
               direction='forward',
               by='group', left_on='val1', right_on='val')
   .set_index('index').sort_index()
   .assign(name=lambda d: d['name'].mask(d['val'].gt(d['val2'])))
   .drop(columns='val')
)

Another option using a second join, this might be more flexible for many columns to add:

cols = ['name']
df.join(
 pd.merge_asof(df.reset_index().sort_values(by='val1'),
               df2.sort_values(by='val'),
               direction='forward',
               by='group', left_on='val1', right_on='val')
   .loc[lambda d: d['val'].le(d['val2']), ['index'] cols]
   .set_index('index')
)

output:

       val1  val2 group name
index                       
0         1     2    a1  abc
1         3     4    a1  def
2        10    12    a1  ghi
3        23    25    a1  NaN
4         1     2    a2   p1
5         3     4    a2   p2
6        10    12    a2   p3
7        23    25    a2  NaN
  • Related