Home > Enterprise >  How to check if a column name is contained in an array in a different column and extract only those
How to check if a column name is contained in an array in a different column and extract only those

Time:08-08

Question is a bit of a mouthful but here's what I'm working with:

I have a dataframe like this:

CO2 INVALIDS
35.3 [SO2]
2.5 [CO2, SO2, NO2]

(Index is in DatetimeIndex format but it's inconvenient to show using GitHub table markdown format)

Essentially, I would like to extract those rows in which (in this case) 'CO2' is contained in 'INVALIDS' column.

So here I'd extract the second row and I'd be left with:

CO2 INVALIDS
2.5 [CO2, SO2, NO2]

I tried doing it by using this approach:

df.loc[df.columns[~df.columns.isin(['INVALIDS'])][0] in df['INVALIDS']]

However, this returns a list of True / False values which gives the following exception:

'False: boolean label can not be used without a boolean index'

I think I could do this with iterrows approach but surely there's a better (vectorized) way.

Thanks.

CodePudding user response:

A vectorized solution would be possible with Series.explode and using the index of the booleans with .loc:

m = df["INVALIDS"].explode().eq("CO2")
df.loc[m[m].index]
   CO2         INVALIDS
1  2.5  [CO2, SO2, NO2]

Edit to match the column names:

m = df["INVALIDS"].explode().isin(df.columns)
df.loc[m[m].index]
   CO2         INVALIDS
1  2.5  [CO2, SO2, NO2]

CodePudding user response:

As you have lists, you must loop. Use a list comprehension with any for boolean indexing:

out = df[[any('CO2' in x for x in l) for l in df['INVALIDS']]]

output:

   CO2         INVALIDS
1  2.5  [CO2, SO2, NO2]
  • Related