I have two similar dataframes to the below:
import pandas as pd
num1 = ["1111 2222", "3333", "4444 5555 6666", "7777 8888", "9999"]
num2 = ["A1", "A2", "A3", "A4", "A5"]
linkage = pd.DataFrame({"num1":num1, "num2":num2})
num1 = ["2222", "3333", "5555", "8888", "9999"]
num2 = ['none', 'none', 'none', 'none', 'none']
df = pd.DataFrame({"num1":num1, "num2":num2})
Linkage:
num1 | num2 |
---|---|
1111 2222 | A1 |
3333 | A2 |
4444 5555 6666 | A3 |
7777 8888 | A4 |
9999 | A5 |
df:
num1 | num2 |
---|---|
2222 | none |
3333 | none |
5555 | none |
8888 | none |
9999 | none |
I want to place the "num2" value from the linkage dataframe in the second dataframe based on if the "num1" value from the second dataframe is one of the "num1" values in the linkage dataframe. The code I currently have is:
df.num2 = [linkage.num2[i] for y in df.num1 for i, x in enumerate(linkage.num1) if y in x]
Which yields what I want:
num1 | num2 |
---|---|
2222 | A1 |
3333 | A2 |
5555 | A3 |
8888 | A4 |
9999 | A5 |
But the code is noticeably slower the larger the dataframes get. CPU times: total: 516 ms
Wall time: 519 ms
Is there a better method of using linkage dataframes?
CodePudding user response:
split
the string and explode
, then use this to map
the data:
mapper = (linkage.assign(num1=linkage['num1'].str.split())
.explode('num1')
.set_index('num1')['num2']
)
df['num2'] = df['num1'].map(mapper)
output:
num1 num2
0 2222 A1
1 3333 A2
2 5555 A3
3 8888 A4
4 9999 A5
intermediate mapper
:
num1
1111 A1
2222 A1
3333 A2
4444 A3
5555 A3
6666 A3
7777 A4
8888 A4
9999 A5
Name: num2, dtype: object
CodePudding user response:
You can make use of pd.Series.str.extract
to capture groupings and assign the matches only and use pd.merge
to join the corresponding num2
values of your linkage
data frame:
pd.merge(df.drop(columns="num2"), linkage.assign(num1=linkage["num1"].str.extract(f'({"|".join(df["num1"].unique())})')), on=["num1"], how="left")
Output:
num1 num2
0 2222 A1
1 3333 A2
2 5555 A3
3 8888 A4
4 9999 A5