I have two big dataframe with size of 125953 and 174808. the total check takes round 30 mins and i want to speed it up. this is a sample how my dataframes looks like:
color material
0 red wood
1 blue metal
2 green plastic
name description
0 my blue color it is a great color
1 red chair made with wood
2 green rod made with metal
and i want to check every cell from the data if it has any word from the parameter dataframe cell by cell.
this is my current code.
import pandas as pd
import time
data = pd.read_csv('x.csv',converters={i: str for i in range(200)})
parameter = pd.read_excel('y.xlsx', sheet_name="Tags")
def extractData(i):
for n in i:
for row in parameter.columns:
print(n.apply(lambda color: [c for c in parameter[row].tolist() if ( str(c)!='nan' and c in color)]))
s=time.time()
extractData([data[row] for row in data.columns[3:4] ] )
e=time.time()
print(e-s)
the output
name description attribute attribute2
0 my blue color it is a great color blue
1 red chair made with wood red wood
2 green rod made with metal green metal
``
CodePudding user response:
You could craft a regex from the first dataframe, then use it to search the values:
import re
regex = '|'.join(map(re.escape, df1.stack().drop_duplicates()))
# 'red|wood|blue|metal|green|plastic'
out = (df2.stack().str.extractall(f'({regex})')[0]
.groupby(level=[0,1]).agg(list)
.unstack(fill_value=[])
)
Output:
description name
0 [] [blue]
1 [wood] [red]
2 [metal] [green]
Alternative output:
import re
regex = '|'.join(map(re.escape, df1.stack().drop_duplicates()))
# 'red|wood|blue|metal|green|plastic'
out = (df2
.stack()
.str.extractall(f'({regex})')
.reset_index()
.assign(col=lambda d: d.groupby('level_0').cumcount().add(1))
.pivot('level_0', 'col', 0)
.add_prefix('attribute').rename_axis(index=None, columns=None)
)
Output:
attribute1 attribute2
0 blue NaN
1 red wood
2 green metal
CodePudding user response:
I blew up parameter
to be 90k distinct values, and data
to be 180k rows. This whole thing runs in a matter of seconds.
# Split all words into lists.
data = data.apply(lambda x: x.str.split())
# Create sets for each parameter.
colors = set(parameter.color)
materials = set(parameter.material)
# Use set logic of `intersection` to find matching words.
data['color'] = data.name.apply(lambda x: colors.intersection(x))
data['material'] = data.description.apply(lambda x:materials.intersection(x))
# Change the lists back to strings.
data = data.apply(lambda x: x.str.join(' '))
Summarized to:
for param, col in zip(parameter.columns, data.columns):
words = set(parameter[param])
data[param] = (data[col].str.split()
.apply(lambda x: ' '.join(words.intersection(x))))
Output:
name description color material
0 my blue color it is a great color blue
1 red chair made with wood red wood
2 green rod made with metal green metal
3 my blue color it is a great color blue
4 red chair made with wood red wood
... ... ... ... ...
179995 red chair made with wood red wood
179996 green rod made with metal green metal
179997 my blue color it is a great color blue
179998 red chair made with wood red wood
179999 green rod made with metal green metal
[180000 rows x 4 columns]