speed up looping through two dataframe cell by cell and find if one is part of the other-CodePudding

I have two big dataframe with size of 125953 and 174808. the total check takes round 30 mins and i want to speed it up. this is a sample how my dataframes looks like:

   color material
0    red     wood
1   blue    metal
2  green  plastic


            name          description
0  my blue color  it is a great color
1      red chair       made with wood
2      green rod      made with metal

and i want to check every cell from the data if it has any word from the parameter dataframe cell by cell.

this is my current code.

import pandas as pd
import time
data = pd.read_csv('x.csv',converters={i: str for i in range(200)})
parameter = pd.read_excel('y.xlsx', sheet_name="Tags")

def extractData(i):
    for n in i:
        for row in parameter.columns:

            print(n.apply(lambda color: [c for c in parameter[row].tolist() if ( str(c)!='nan' and c in color)]))
s=time.time()
extractData([data[row] for row in  data.columns[3:4]  ] )
e=time.time()
print(e-s)

the output

       name           description            attribute  attribute2
0      my blue color  it is a great color    blue           
1      red chair      made with wood         red        wood
2      green rod      made with metal        green      metal
``

CodePudding user response：

You could craft a regex from the first dataframe, then use it to search the values:

import re

regex = '|'.join(map(re.escape, df1.stack().drop_duplicates()))
# 'red|wood|blue|metal|green|plastic'

out = (df2.stack().str.extractall(f'({regex})')[0]
          .groupby(level=[0,1]).agg(list)
          .unstack(fill_value=[])
       )

Output:

  description     name
0          []   [blue]
1      [wood]    [red]
2     [metal]  [green]

Alternative output:

import re

regex = '|'.join(map(re.escape, df1.stack().drop_duplicates()))
# 'red|wood|blue|metal|green|plastic'

out = (df2
 .stack()
 .str.extractall(f'({regex})')
 .reset_index()
 .assign(col=lambda d: d.groupby('level_0').cumcount().add(1))
 .pivot('level_0', 'col', 0)
 .add_prefix('attribute').rename_axis(index=None, columns=None)
 )

Output:

  attribute1 attribute2
0       blue        NaN
1        red       wood
2      green      metal

CodePudding user response：

I blew up parameter to be 90k distinct values, and data to be 180k rows. This whole thing runs in a matter of seconds.

# Split all words into lists.
data = data.apply(lambda x: x.str.split())

# Create sets for each parameter.
colors = set(parameter.color)
materials = set(parameter.material)

# Use set logic of `intersection` to find matching words.
data['color'] = data.name.apply(lambda x: colors.intersection(x))
data['material'] = data.description.apply(lambda x:materials.intersection(x))

# Change the lists back to strings.
data = data.apply(lambda x: x.str.join(' '))

Summarized to:

for param, col in zip(parameter.columns, data.columns):
    words = set(parameter[param])
    data[param] = (data[col].str.split()
                       .apply(lambda x: ' '.join(words.intersection(x))))

Output:

                 name          description  color material
0       my blue color  it is a great color   blue
1           red chair       made with wood    red     wood
2           green rod      made with metal  green    metal
3       my blue color  it is a great color   blue
4           red chair       made with wood    red     wood
...               ...                  ...    ...      ...
179995      red chair       made with wood    red     wood
179996      green rod      made with metal  green    metal
179997  my blue color  it is a great color   blue
179998      red chair       made with wood    red     wood
179999      green rod      made with metal  green    metal

[180000 rows x 4 columns]