Filter out entries of datasets based on string matching-CodePudding

I'm working with a dataframe of chemical formulas (str objects). Example

formula

Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...

I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb' therefore the desired output should result in:

formula

Na0.2Cl0.4O0.7Rb1

What I've tried to do is the following

 for i, formula in enumerate(df['formula'])

    if ('Na' and 'Cl' and 'Rb' not in formula):
       
          df = df.drop(index=i)

but it seems not to work.

CodePudding user response：

Your requirements are unclear, but assuming you want to filter based on a set of elements.

Keeping formulas where all elements from the set are used:

s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
 df['formula']
 .str.extractall(regex)[0]
 .groupby(level=0).nunique().eq(len(s))
)

df.loc[mask[mask].index]

output:

             formula
0  Na0.2Cl0.4O0.7Rb1

Keeping formulas where only elements from the set are used:

s = {'Na','Cl','Rb'}

mask = (df['formula']
 .str.extractall('([A-Z][a-z]*)')[0]
 .isin(s)
 .groupby(level=0).all()
)

df[mask]

output: no rows for this dataset

CodePudding user response：

You can use use contains with or condition for multiple string pattern matching for matching only one of them

df[df['formula'].str.contains("Na|Cl|Rb", na=False)]

Or you can use pattern with contains if you want to match all of them

df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]