Home > database >  Quickly assign values between DataFrame based on multiple keys
Quickly assign values between DataFrame based on multiple keys

Time:09-28

I have two dataframes. One, named population has two columns randomly ordered positions. The other, named keyFrame, has two columns of ordered keys and a column of attributes ('attr') associated with the pair of keys.

I use the below code to:

  1. Create an empty column in population.
  2. Iterate over each row in keyFrame (the iterable dataframe is not being altered).
  3. Assign the rows 'attr' value to populations 'assignment' where either position1 == key1 & position2 == key2 OR where position1 == key2 & position2 == key1.

This works perfectly, but is extremely slow in my actual code. the population dataframe is >500k rows in actuality and the keyFrame dataframe has >1500 values.

Question: Is there a way to assign the 'attr' values from keyFrame to population where the keys match (interchangeably) all at once?

# Sample code for you to test! Thank you!    
import pandas as pd
import numpy as np

population = pd.DataFrame(data={'position1': [1, 6, 1, 1, 1, 7, 1, 8, 16],
                                'position2': [5, 1, 15, 9, 17, 1, 2, 1, 1]})
keyFrame = pd.DataFrame(data={'key1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                              'key2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17],
                              'attr': [0.79, 0.65, 0.99, 0.03, 0.58, 0.19, 0.53,
                                       0.76, 0.49, 0.46, 0.25, 0.11, 0.22, 0.38, 0.94]})

population['assignment'] = np.NaN  # Step 1
for index, row in keyFrame.iterrows():  # Step 2
    # Step 3
    population['assignment'].loc[((population['position1'] == row['key1']) & (
                population['position2'] == row['key2'])) | (
                (population['position1'] == row['key2']) & (
                population['position2'] == row['key1']))] = row['attr']

P.S. I am aware many questions exist that are similar to this, but they either don't fully match my use case or they don't solve the issue in a more efficient manner.

CodePudding user response:

One approach could be as follows:

population['assignment'] = pd.DataFrame(population[['position1','position2']]\
    .apply(sorted, axis=1).tolist()).merge(keyFrame, left_on=[0,1], 
                                           right_on=['key1','key2'], how='left')\
                                           ['attr']

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38

Explanation

  • First, we use .apply(sorted, axis=1) to cols position1, position2 to get sorted lists. e.g. [1,5], [1,6] etc.
  • Next, we apply .tolist() inside the pd.DataFrame constructor to get two columns again (col names will default to [0,1]).
  • Sorted, we can use df.merge with left_on=[0,1] and right_on=['key1','key2'], select only attr, and assign it to a new column for the df population.

CodePudding user response:

Maybe like this (almost the same as @ouroboros1 proposed):

population['key'] = [tuple(sorted([p1, p2])) for p1, p2 in zip(population.position1, population.position2)]
keyFrame['key'] = [tuple(sorted([k1, k2])) for k1, k2 in zip(keyFrame.key1, keyFrame.key2)]
population['assignment'] = population.merge(keyFrame[['key', 'attr']], on='key')['attr']
population.drop(columns=['key'], inplace=True)

Result:

   position1  position2  assignment
0          1          5        0.03
1          6          1        0.58
2          1         15        0.22
3          1          9        0.76
4          1         17        0.94
5          7          1        0.19
6          1          2        0.79
7          8          1        0.53
8         16          1        0.38

CodePudding user response:

One method would be to merge both ways and then combine the results.

# smol var names make me happy.
p_df = population
k_df = keyFrame

df1 = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
df2 = p_df.merge(k_df, left_on=['position2', 'position1'], right_on=['key1', 'key2'])
df = pd.concat([df1, df2], ignore_index=True)
print(df)

Output:

   position1  position2  key1  key2  attr
0          1          5     1     5  0.03
1          1         15     1    15  0.22
2          1          9     1     9  0.76
3          1         17     1    17  0.94
4          1          2     1     2  0.79
5          6          1     1     6  0.58
6          7          1     1     7  0.19
7          8          1     1     8  0.53
8         16          1     1    16  0.38

If we want to pre-sort things...

p_df = p_df.assign(position1=p_df.min(axis=1), position2=p_df.max(axis=1))
df = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
print(df)

(Same Output as above)

  • Related