I'm working with a dataframe of chemical formulas (str
objects). Example
formula
Na0.2Cl0.4O0.7Rb1
Hg0.04Mg0.2Ag2O4
Rb0.2AgO
...
I want to filter it out based on specified elements. For example I want to produce an output which only contains the elements 'Na','Cl','Rb'
therefore the desired output should result in:
formula
Na0.2Cl0.4O0.7Rb1
What I've tried to do is the following
for i, formula in enumerate(df['formula'])
if ('Na' and 'Cl' and 'Rb' not in formula):
df = df.drop(index=i)
but it seems not to work.
CodePudding user response:
Your requirements are unclear, but assuming you want to filter based on a set of elements.
Keeping formulas where all elements from the set are used:
s = {'Na','Cl','Rb'}
regex = f'({"|".join(s)})'
mask = (
df['formula']
.str.extractall(regex)[0]
.groupby(level=0).nunique().eq(len(s))
)
df.loc[mask[mask].index]
output:
formula
0 Na0.2Cl0.4O0.7Rb1
Keeping formulas where only elements from the set are used:
s = {'Na','Cl','Rb'}
mask = (df['formula']
.str.extractall('([A-Z][a-z]*)')[0]
.isin(s)
.groupby(level=0).all()
)
df[mask]
output: no rows for this dataset
CodePudding user response:
You can use use contains
with or
condition for multiple string pattern matching for matching only one of them
df[df['formula'].str.contains("Na|Cl|Rb", na=False)]
Or you can use pattern with contains
if you want to match all of them
df[df['formula'].str.contains(r'^(?=.*Na)(?=.*Cl)(?=.*Rb)')]