Pandas: Creating a new column based on if part of a string is anywhere in another column-CodePudding

Let's suppose we have two dataframes:

df1 = pd.DataFrame({
0: 'ETERNITON',
1: 'CIELOON',
2: 'M.DIASBRANCOON',
3: 'IRBBRASIL REON',
4: '01/00 ATACADÃO S.A ON',
5: 'AMBEV S/A ON',
6: '01/00 RUMO S.A. ON',
7: 'COGNA ONON',
8: 'CURY S/A'}.items(), columns=['index', 'name']).set_index('index')

df2 = pd.DataFrame({'name': {0: 'ALLIARON', 1: 'M.DIASBRANCOON', 2: 'AMBEVS/AON', 3: 'CIELOON',
  4: 'AESBRASILON', 5: 'BRASILAGROON', 6: 'IRBBRASILREON', 7: 'ATACADÃOS.AON', 8: 'ALPARGATASON',
  9: 'RUMOS.A.ON', 10: 'COGNAONON'},
 'yf_ticker': {0: 'AALR3.SA', 1: 'MDIA3.SA', 2: 'ABEV3.SA', 3: 'CIEL3.SA', 4: 'AESB3.SA',
  5: 'AGRO3.SA', 6: 'IRBR3.SA', 7: 'CRFB3.SA', 8: 'ALPA3.SA', 9: 'RAIL3.SA', 10: 'COGN3.SA'}})

I'd like to create a new column ('ticker') in df1 using the column 'yf_ticker' from df2. If a name/string in df2['yf_ticker'] is in df1['name'] (even if it is not an exactly match), then add the yf_ticker from df2 to that row in df1['ticker']. To make it clear, the expected output would be something like:

print(df1)
name                    ticker
ETERNITON               Missing or N/A or Nan
CIELOON                 CIEL3.SA
M.DIASBRANCOON          MDIA3.SA
IRBBRASIL REON          IRBR3.SA
01/00 ATACADÃO S.A ON   CRFB3.SA
AMBEV S/A ON            ABEV3.SA
01/00 RUMO S.A. ON      RAIL3.SA
COGNA ONON              COGN3.SA
CURY S/A                Missing or N/A or Nan

The solution I tried:


df1['name'] = df1['name'].str.replace(" ","")

for i in range(len(df1)):
    for j in range(len(df2)):
        if df2.iloc[j,0] in df1.iloc[i,0]:
            df1.loc[i, 'ticker'] = df2.iloc[j,1]

Although it worked, it seems to me that such for loop for a larger dataset is inefficient. Is there a faster (or 'vectorized') way to do that?

CodePudding user response：

I suggest fuzzy matching on the name columns, then get the yf_ticker from the matching row. Here is an example with python's built-in difflib:

import difflib

df1['yf_ticker'] = df1['name'].apply(lambda x: df2.loc[df2['name'] == y[0], 'yf_ticker'].iloc[0] if (y := (difflib.get_close_matches(x, df2.name))) else None)

Output:

index	name	yf_ticker
0	ETERNITON
1	CIELOON	CIEL3.SA
2	M.DIASBRANCOON	MDIA3.SA
3	IRBBRASIL REON	IRBR3.SA
4	01/00 ATACADÃO S.A ON	CRFB3.SA
5	AMBEV S/A ON	ABEV3.SA
6	01/00 RUMO S.A. ON	RAIL3.SA
7	COGNA ONON	COGN3.SA
8	CURY S/A