I have a list of taxonomic classifications that are not uniform across their levels within the list and I want to filter all the columns of a data-frame against each item of the list to produce a singular sub-data-frame.
An example list would be
['Sk1','Sub1','Family 3','Clade C']
my actual list has 23 items.
An example data-frame that looks like this and my actual dataframe has 1100 rows:
Species | Super kingdom | Subkingdom | Clade | Class | Family |
---|---|---|---|---|---|
Species 1 | SK1 | Sub1 | Clade A | Class I | Family 1 |
Species 2 | SK2 | Sub2 | Clade B | Class II | Family 2 |
Species 3 | SK3 | Sub3 | Clade C | Class III | Family 3 |
Species 4 | Sk4 | Sub4 | Clade D | Class IV | Family 4 |
So filtering the frame against the list against the list should give:
Species | Super kingdom | Subkingdom | Clade | Class | Family |
---|---|---|---|---|---|
Species 1 | SK1 | Sub1 | Clade A | Class I | Family 1 |
Species 3 | SK3 | Sub3 | Clade C | Class III | Family 3 |
The closest I have come to some solutions is to use np and pandas as:
mask = np.column_stack([taxonframe[col].str.contains(Query, na=False) for col in taxonframe])
taxonframe.loc[mask.any(axis=1)]
But this one would need me to loop though the list of queries and make a dataframe for each item in the list and somehow merge 23 dataframes (This one I am trying to avoid).
OR
I could use itertuples()
on each row and do a string search on each row as a list and then convert them back to a new dataframe-if there is nothing else I guess I can work with this. Is there an elegant pandas method to do this?
CodePudding user response:
Assuming other columns cannot contain the same strings from other columns (e.g. Clade
column cannot contain Family 3
, etc.), you can use isin
any
to create a boolean mask to filter df
:
out = df[df.isin(['Sk1','Sub1','Family 3','Clade C']).any(axis=1)]
Output:
Species Super kingdom Subkingdom Clade Class Family
0 Species 1 SK1 Sub1 Clade A Class I Family 1
2 Species 3 SK3 Sub3 Clade C Class III Family 3