so i'm trying to filter a dataframe using a list of words. The problem is that some words could be not there but anyway could be useful.
These dataframe is a catalog that I'm getting from a web scraping process. For every single row, I have a different product and unique.
The list that I'm using came from another process and could have words that are not useful because it not appear in the string and i can modify it.
Is there a way to avoid those word that doesn't appear in the string, in the way that if my string don't have the word, skip it?
For example let's think that we have the next dataframe:
mycolumn = ['Products']
products = ['Kitadol 500 mg x 24 Comprimidos',
'Paracetamol 500 mg',
'Prestat 75 mg x 40 Comprimidos',
'Pedialyte 60 Manzana x 500 mL Solución Oral',
'Panadol Niños 100mg/Ml Gotas 15ml']
df = pd.DataFrame(products, columns=mycolumn)
And i have the next list of words:
list_words = ['PARACETAMOL','KITADOL','500','MG','LIB']
My final table need to contains two products:
- Kitadol 500 mg x 24 Comprimidos
- Paracetamol 500 mg
I wonder if someone know how to deal with this question or give some ideas. King regards and thanks!
CodePudding user response:
If is possible specify what exactly need for each match - here is necessary match 3 values of tuples use Series.str.findall
with re.I
for ignore case and test length of unique values is same like length of each tuple in list comprehension:
import re
tups = [('PARACETAMOL','500','MG'), ('KITADOL','500','MG')]
L = [df['Products'].str.findall("|".join(i),flags=re.I).apply(lambda x: len(set(x)))==len(i)
for i in tups]
df = df[np.logical_or.reduce(L)]
print (df)
Products
0 Kitadol 500 mg x 24 Comprimidos
1 Paracetamol 500 mg
CodePudding user response:
Your question not clear. I am therefore making assumptions as follows. Going by your output, you want to find medication. Capacity of medication has to be exempt. my attempt below
s=[x for x in list_words if x.isalpha() if x not in ['MG','ML','X']]
df[df['Products'].str.upper().str.split('\s').map(set).apply(lambda x: len(x.intersection(set(s))))>0]
Products
0 Kitadol 500 mg x 24 Comprimidos
1 Paracetamol 500 mg