Subset rows based on multiple iterative columns in Pandas-CodePudding

I need to subset rows of df based on several columns (c1 through c100 columns) which have strings. Subset rows which equal a particular value for any of the 100 columns. Minimal example with 3 columns is:

df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8],
                     "c1": ["abc1", "", "dfg", "abc1", "","dfg","ghj","abc1","abc1"], 
                     "c2": ["abc1", "abc1", "dfg", "dfg", "","dfg","","ghj","abc1"], 
                     "c3": ["abc1", "", "dfg", "dfg", "dfg","dfg","abc1","ghj","abc1"]})

        c1      c2      c3
0   0   abc1    abc1    abc1
1   1           abc1    
2   2   dfg     dfg     dfg
3   3   abc1    dfg     dfg
4   4                   dfg
5   5   dfg     dfg     dfg
6   6   ghj             abc1
7   7   abc1    ghj     ghj
8   8   abc1    abc1    abc1

The loc command gives us the answer but too many 'or' conditions to write. I am looking for something iterative over 100 columns. Also what if I want to check for a list of strings instead of just 1 string. For example ['abc1', 'ghj'], instead of just 'abc1'.

df.loc[(df['c1'] == "abc1") | (df['c2'] == "abc1") | (df['c3'] == "abc1")]

CodePudding user response：

1. Searching for a single string

You can just check each "cell" with df.eq('abc1') or df=='abc1' and then take an OR over columns using .any(1) to check if each row has at least 1 instance of 'abc1'. Then use that boolean series for indexing the original data frame to get the rows you need.

output = df[df.eq('abc1').any(1)]

print(output)

        c1    c2    c3
0  0  abc1  abc1  abc1
1  1        abc1      
3  3  abc1   dfg   dfg
6  6   ghj        abc1
7  7  abc1   ghj   ghj
8  8  abc1  abc1  abc1

If you want to check only a specific set of columns then use df[list_of_columns] == 'abc1' instead.

2. Searching for multiple strings

For multiple strings use df.isin([list-of-strings]). Here is an example of how that would work.

output = df[df.isin(['abc1', 'dfg']).any(1)]

print(output)

        c1    c2    c3
0  0  abc1  abc1  abc1
1  1        abc1      
2  2   dfg   dfg   dfg
3  3  abc1   dfg   dfg
4  4               dfg
5  5   dfg   dfg   dfg
6  6   ghj        abc1
7  7  abc1   ghj   ghj
8  8  abc1  abc1  abc1

3. Searching for multiple strings across specific columns

If you want to search for multiple strings in a list of specific columns, you can just modify the previous .isin or .eq method with the list of columns filtered, and then continue the same method. Here is an example for the above script working for just 2 columns c1 and c3

output = df[df[['c1','c3']].isin(['abc1','ghj']).any(1)]

print(output)

        c1    c2    c3
0  0  abc1  abc1  abc1
3  3  abc1   dfg   dfg
6  6   ghj        abc1
7  7  abc1   ghj   ghj
8  8  abc1  abc1  abc1

References

Read more here on the official documentation -