Quickly assign values between DataFrame based on multiple keys-CodePudding

I have two dataframes. One, named population has two columns randomly ordered positions. The other, named keyFrame, has two columns of ordered keys and a column of attributes ('attr') associated with the pair of keys.

I use the below code to:

Create an empty column in population.
Iterate over each row in keyFrame (the iterable dataframe is not being altered).
Assign the rows 'attr' value to populations 'assignment' where either position1 == key1 & position2 == key2 OR where position1 == key2 & position2 == key1.

This works perfectly, but is extremely slow in my actual code. the population dataframe is >500k rows in actuality and the keyFrame dataframe has >1500 values.

Question: Is there a way to assign the 'attr' values from keyFrame to population where the keys match (interchangeably) all at once?

# Sample code for you to test! Thank you!    
import pandas as pd
import numpy as np

population = pd.DataFrame(data={'position1': [1, 6, 1, 1, 1, 7, 1, 8, 16],
                                'position2': [5, 1, 15, 9, 17, 1, 2, 1, 1]})
keyFrame = pd.DataFrame(data={'key1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                              'key2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17],
                              'attr': [0.79, 0.65, 0.99, 0.03, 0.58, 0.19, 0.53,
                                       0.76, 0.49, 0.46, 0.25, 0.11, 0.22, 0.38, 0.94]})

population['assignment'] = np.NaN  # Step 1
for index, row in keyFrame.iterrows():  # Step 2
    # Step 3
    population['assignment'].loc[((population['position1'] == row['key1']) & (
                population['position2'] == row['key2'])) | (
                (population['position1'] == row['key2']) & (
                population['position2'] == row['key1']))] = row['attr']

P.S. I am aware many questions exist that are similar to this, but they either don't fully match my use case or they don't solve the issue in a more efficient manner.

CodePudding user response：

One approach could be as follows:

population['assignment'] = pd.DataFrame(population[['position1','position2']]\
    .apply(sorted, axis=1).tolist()).merge(keyFrame, left_on=[0,1], 
                                           right_on=['key1','key2'], how='left')\
                                           ['attr']

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38

Explanation

First, we use .apply(sorted, axis=1) to cols position1, position2 to get sorted lists. e.g. [1,5], [1,6] etc.
Next, we apply .tolist() inside the pd.DataFrame constructor to get two columns again (col names will default to [0,1]).
Sorted, we can use df.merge with left_on=[0,1] and right_on=['key1','key2'], select only attr, and assign it to a new column for the df population.

CodePudding user response：

Maybe like this (almost the same as @ouroboros1 proposed):

population['key'] = [tuple(sorted([p1, p2])) for p1, p2 in zip(population.position1, population.position2)]
keyFrame['key'] = [tuple(sorted([k1, k2])) for k1, k2 in zip(keyFrame.key1, keyFrame.key2)]
population['assignment'] = population.merge(keyFrame[['key', 'attr']], on='key')['attr']
population.drop(columns=['key'], inplace=True)

Result:

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38

CodePudding user response：

One method would be to merge both ways and then combine the results.

# smol var names make me happy.
p_df = population
k_df = keyFrame

df1 = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
df2 = p_df.merge(k_df, left_on=['position2', 'position1'], right_on=['key1', 'key2'])
df = pd.concat([df1, df2], ignore_index=True)
print(df)

Output:

   position1  position2  key1  key2  attr
0          1          5     1     5  0.03
1          1         15     1    15  0.22
2          1          9     1     9  0.76
3          1         17     1    17  0.94
4          1          2     1     2  0.79
5          6          1     1     6  0.58
6          7          1     1     7  0.19
7          8          1     1     8  0.53
8         16          1     1    16  0.38

If we want to pre-sort things...

p_df = p_df.assign(position1=p_df.min(axis=1), position2=p_df.max(axis=1))
df = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
print(df)

(Same Output as above)