Question is a bit of a mouthful but here's what I'm working with:
I have a dataframe like this:
CO2 | INVALIDS |
---|---|
35.3 | [SO2] |
2.5 | [CO2, SO2, NO2] |
(Index is in DatetimeIndex
format but it's inconvenient to show using GitHub table markdown format)
Essentially, I would like to extract those rows in which (in this case) 'CO2' is contained in 'INVALIDS' column.
So here I'd extract the second row and I'd be left with:
CO2 | INVALIDS |
---|---|
2.5 | [CO2, SO2, NO2] |
I tried doing it by using this approach:
df.loc[df.columns[~df.columns.isin(['INVALIDS'])][0] in df['INVALIDS']]
However, this returns a list of True
/ False
values which gives the following exception:
'False: boolean label can not be used without a boolean index'
I think I could do this with iterrows
approach but surely there's a better (vectorized) way.
Thanks.
CodePudding user response:
A vectorized solution would be possible with Series.explode
and using the index of the booleans with .loc
:
m = df["INVALIDS"].explode().eq("CO2")
df.loc[m[m].index]
CO2 INVALIDS
1 2.5 [CO2, SO2, NO2]
Edit to match the column names:
m = df["INVALIDS"].explode().isin(df.columns)
df.loc[m[m].index]
CO2 INVALIDS
1 2.5 [CO2, SO2, NO2]
CodePudding user response:
As you have lists, you must loop. Use a list comprehension with any
for boolean indexing:
out = df[[any('CO2' in x for x in l) for l in df['INVALIDS']]]
output:
CO2 INVALIDS
1 2.5 [CO2, SO2, NO2]