I need to subset rows of df based on several columns (c1 through c100 columns) which have strings. Subset rows which equal a particular value for any of the 100 columns. Minimal example with 3 columns is:
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8],
"c1": ["abc1", "", "dfg", "abc1", "","dfg","ghj","abc1","abc1"],
"c2": ["abc1", "abc1", "dfg", "dfg", "","dfg","","ghj","abc1"],
"c3": ["abc1", "", "dfg", "dfg", "dfg","dfg","abc1","ghj","abc1"]})
c1 c2 c3
0 0 abc1 abc1 abc1
1 1 abc1
2 2 dfg dfg dfg
3 3 abc1 dfg dfg
4 4 dfg
5 5 dfg dfg dfg
6 6 ghj abc1
7 7 abc1 ghj ghj
8 8 abc1 abc1 abc1
The loc command gives us the answer but too many 'or' conditions to write. I am looking for something iterative over 100 columns. Also what if I want to check for a list of strings instead of just 1 string. For example ['abc1', 'ghj'], instead of just 'abc1'.
df.loc[(df['c1'] == "abc1") | (df['c2'] == "abc1") | (df['c3'] == "abc1")]
CodePudding user response:
1. Searching for a single string
You can just check each "cell" with df.eq('abc1')
or df=='abc1'
and then take an OR
over columns using .any(1)
to check if each row has at least 1 instance of 'abc1'. Then use that boolean series for indexing the original data frame to get the rows you need.
output = df[df.eq('abc1').any(1)]
print(output)
c1 c2 c3
0 0 abc1 abc1 abc1
1 1 abc1
3 3 abc1 dfg dfg
6 6 ghj abc1
7 7 abc1 ghj ghj
8 8 abc1 abc1 abc1
If you want to check only a specific set of columns then use df[list_of_columns] == 'abc1'
instead.
2. Searching for multiple strings
For multiple strings use df.isin([list-of-strings])
. Here is an example of how that would work.
output = df[df.isin(['abc1', 'dfg']).any(1)]
print(output)
c1 c2 c3
0 0 abc1 abc1 abc1
1 1 abc1
2 2 dfg dfg dfg
3 3 abc1 dfg dfg
4 4 dfg
5 5 dfg dfg dfg
6 6 ghj abc1
7 7 abc1 ghj ghj
8 8 abc1 abc1 abc1
3. Searching for multiple strings across specific columns
If you want to search for multiple strings in a list of specific columns, you can just modify the previous .isin
or .eq
method with the list of columns filtered, and then continue the same method. Here is an example for the above script working for just 2 columns c1
and c3
output = df[df[['c1','c3']].isin(['abc1','ghj']).any(1)]
print(output)
c1 c2 c3
0 0 abc1 abc1 abc1
3 3 abc1 dfg dfg
6 6 ghj abc1
7 7 abc1 ghj ghj
8 8 abc1 abc1 abc1
References
Read more here on the official documentation -