I have two dataframes. One, named population has two columns randomly ordered positions. The other, named keyFrame, has two columns of ordered keys and a column of attributes ('attr'
) associated with the pair of keys.
I use the below code to:
- Create an empty column in population.
- Iterate over each row in keyFrame (the iterable dataframe is not being altered).
- Assign the rows
'attr'
value to populations'assignment'
where eitherposition1 == key1 & position2 == key2
OR whereposition1 == key2 & position2 == key1
.
This works perfectly, but is extremely slow in my actual code. the population dataframe is >500k rows in actuality and the keyFrame dataframe has >1500 values.
Question: Is there a way to assign the 'attr'
values from keyFrame to population where the keys match (interchangeably) all at once?
# Sample code for you to test! Thank you!
import pandas as pd
import numpy as np
population = pd.DataFrame(data={'position1': [1, 6, 1, 1, 1, 7, 1, 8, 16],
'position2': [5, 1, 15, 9, 17, 1, 2, 1, 1]})
keyFrame = pd.DataFrame(data={'key1': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'key2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17],
'attr': [0.79, 0.65, 0.99, 0.03, 0.58, 0.19, 0.53,
0.76, 0.49, 0.46, 0.25, 0.11, 0.22, 0.38, 0.94]})
population['assignment'] = np.NaN # Step 1
for index, row in keyFrame.iterrows(): # Step 2
# Step 3
population['assignment'].loc[((population['position1'] == row['key1']) & (
population['position2'] == row['key2'])) | (
(population['position1'] == row['key2']) & (
population['position2'] == row['key1']))] = row['attr']
P.S. I am aware many questions exist that are similar to this, but they either don't fully match my use case or they don't solve the issue in a more efficient manner.
CodePudding user response:
One approach could be as follows:
population['assignment'] = pd.DataFrame(population[['position1','position2']]\
.apply(sorted, axis=1).tolist()).merge(keyFrame, left_on=[0,1],
right_on=['key1','key2'], how='left')\
['attr']
position1 position2 assignment
0 1 5 0.03
1 6 1 0.58
2 1 15 0.22
3 1 9 0.76
4 1 17 0.94
5 7 1 0.19
6 1 2 0.79
7 8 1 0.53
8 16 1 0.38
Explanation
- First, we use
.apply(sorted, axis=1)
to colsposition1, position2
to get sorted lists. e.g.[1,5], [1,6]
etc. - Next, we apply
.tolist()
inside thepd.DataFrame
constructor to get two columns again (col names will default to[0,1]
). - Sorted, we can use
df.merge
withleft_on=[0,1]
andright_on=['key1','key2']
, select onlyattr
, and assign it to a new column for the dfpopulation
.
CodePudding user response:
Maybe like this (almost the same as @ouroboros1 proposed):
population['key'] = [tuple(sorted([p1, p2])) for p1, p2 in zip(population.position1, population.position2)]
keyFrame['key'] = [tuple(sorted([k1, k2])) for k1, k2 in zip(keyFrame.key1, keyFrame.key2)]
population['assignment'] = population.merge(keyFrame[['key', 'attr']], on='key')['attr']
population.drop(columns=['key'], inplace=True)
Result:
position1 position2 assignment
0 1 5 0.03
1 6 1 0.58
2 1 15 0.22
3 1 9 0.76
4 1 17 0.94
5 7 1 0.19
6 1 2 0.79
7 8 1 0.53
8 16 1 0.38
CodePudding user response:
One method would be to merge both ways and then combine the results.
# smol var names make me happy.
p_df = population
k_df = keyFrame
df1 = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
df2 = p_df.merge(k_df, left_on=['position2', 'position1'], right_on=['key1', 'key2'])
df = pd.concat([df1, df2], ignore_index=True)
print(df)
Output:
position1 position2 key1 key2 attr
0 1 5 1 5 0.03
1 1 15 1 15 0.22
2 1 9 1 9 0.76
3 1 17 1 17 0.94
4 1 2 1 2 0.79
5 6 1 1 6 0.58
6 7 1 1 7 0.19
7 8 1 1 8 0.53
8 16 1 1 16 0.38
If we want to pre-sort things...
p_df = p_df.assign(position1=p_df.min(axis=1), position2=p_df.max(axis=1))
df = p_df.merge(k_df, left_on=['position1', 'position2'], right_on=['key1', 'key2'])
print(df)
(Same Output as above)