Faster way to use linkage dataframes with other dataframes

I have two similar dataframes to the below:

import pandas as pd

num1 = ["1111 2222", "3333", "4444 5555 6666", "7777 8888", "9999"]
num2 = ["A1", "A2", "A3", "A4", "A5"] 
linkage = pd.DataFrame({"num1":num1, "num2":num2})
num1 = ["2222", "3333", "5555", "8888", "9999"]
num2 = ['none', 'none', 'none', 'none', 'none']
df = pd.DataFrame({"num1":num1, "num2":num2})

Linkage:

num1	num2
1111 2222	A1
3333	A2
4444 5555 6666	A3
7777 8888	A4
9999	A5

df:

num1	num2
2222	none
3333	none
5555	none
8888	none
9999	none

I want to place the "num2" value from the linkage dataframe in the second dataframe based on if the "num1" value from the second dataframe is one of the "num1" values in the linkage dataframe. The code I currently have is:

df.num2 = [linkage.num2[i] for y in df.num1 for i, x in enumerate(linkage.num1) if y in x]

Which yields what I want:

num1	num2
2222	A1
3333	A2
5555	A3
8888	A4
9999	A5

But the code is noticeably slower the larger the dataframes get. CPU times: total: 516 ms Wall time: 519 ms Is there a better method of using linkage dataframes?

CodePudding user response：

split the string and explode, then use this to map the data:

mapper = (linkage.assign(num1=linkage['num1'].str.split())
                 .explode('num1')
                 .set_index('num1')['num2']
          )

df['num2'] = df['num1'].map(mapper)

output:

   num1 num2
0  2222   A1
1  3333   A2
2  5555   A3
3  8888   A4
4  9999   A5

intermediate mapper:

num1
1111    A1
2222    A1
3333    A2
4444    A3
5555    A3
6666    A3
7777    A4
8888    A4
9999    A5
Name: num2, dtype: object

CodePudding user response：

You can make use of pd.Series.str.extract to capture groupings and assign the matches only and use pd.merge to join the corresponding num2 values of your linkage data frame:

pd.merge(df.drop(columns="num2"), linkage.assign(num1=linkage["num1"].str.extract(f'({"|".join(df["num1"].unique())})')), on=["num1"], how="left")

Output:

    num1    num2
0   2222    A1
1   3333    A2
2   5555    A3
3   8888    A4
4   9999    A5