Pandas: Filter rows by multiple occurrences of specific substrings in column cells-CodePudding

Assuming the dataFrame:

df = pd.DataFrame({
    'column_1': ['Apple', 'Apple Apple', 'Apple Apple', 'Peach Peach', 'Banana', 'Banana Banana'],
    'column_2': ['Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value']
})

which gives:

        column_1    column_2
0          Apple  Some value
1    Apple Apple  Some value
2    Apple Apple  Some value
3    Peach Peach  Some value
4         Banana  Some value
5  Banana Banana  Some value

How do I filter the rows (reassign the df) to only include rows where either the substrings 'Apple' or 'Banana' occur more than once in each string in each cell of column_1?

Note that I'd like to specify the filter to only look for multiple occurrences of 'Apple' & 'Banana' (not programmatically look for all multiple occurrences of any substring, e.g. 'Peach Peach' should not be included).

I.e. the filtering should result in:

        column_1    column_2
1    Apple Apple  Some value
2    Apple Apple  Some value
5  Banana Banana  Some value

CodePudding user response：

Check Below answer (kept column Apple_Banana_count for sanity check):

import pandas as pd
import re
df = pd.DataFrame({
    'column_1': ['Apple', 'Apple Apple', 'Apple Apple', 'Peach Peach', 'Banana', 'Banana Banana'],
    'column_2': ['Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value']
})

df.assign(Apple_Banana_count =  df['column_1'].apply(lambda x: len([m.start() for m in re.finditer('Apple|Banana', x)]))).\
query('Apple_Banana_count > 1')

Output:

Updated code as per the Op's comment - Leaving unfiltered for Sense Check (You can filter on these columns to get the required result):

import pandas as pd

df = pd.DataFrame({
    'column_1': ['Apple', 'Apple Apple', 'Apple Apple', 'Peach Peach', 'Banana', 'Banana Banana', 'Apple Banana','Banana blash blah Apple '],
    'column_2': ['Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value']
})

df.assign(Apple_Banana_count =  df['column_1'].apply(lambda x: len([m.start() for m in re.finditer('Apple|Banana', x)])),
Apple_and_Banana=  df['column_1'].apply(lambda x: len([m.start() for m in re.finditer('(?=.*Apple)(?=.*Banana)', x)]))
)

OutPut:

CodePudding user response：

Keeping the rows where 'Apple' or occurs more than once amounts to keeping the rows with at least 2 times 'Apple' and therefore containing 'Apple Apple'. This can be done with

words_to_keep = 'Apple', 'Banana'
words_to_keep = [' '.join([word] * 2) for word in words_to_keep]
# repeat the words 2 times
df = df.loc[df['column_1'].str.contains('|'.join(words_to_keep))]
# keep only the rows containing the repeated words using regex

CodePudding user response：

If the string to match is only Apple and Bananas you can use regex with pandas.Series.str.count.

df[df['column_1'].str.count('Apple|Banana') > 1]

This gives us the expected output

        column_1    column_2
1    Apple Apple  Some value
2    Apple Apple  Some value
5  Banana Banana  Some value

EDIT: As @ouroboros1, suggested the above won't work in case OP has Apple Banana in the row.

Here is the case where I have tried to cover such cases.

df = pd.DataFrame({
    'column_1': ['Apple Banana','Apple', 'Apple Apple', 'Apple Apple', 'Peach Peach', 'Banana', 'Banana Banana', 'Apple Banana Apple'],
    'column_2': ['Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value', 'Some value']
})

             column_1    column_2
0        Apple Banana  Some value
1               Apple  Some value
2         Apple Apple  Some value
3         Apple Apple  Some value
4         Peach Peach  Some value
5              Banana  Some value
6       Banana Banana  Some value
7  Apple Banana Apple  Some value

We can use extractall to get our pattern group in multiple columns then we apply count to get all not nan values count then we use reindex to get the dataframe containing false value for those rows that did not have a match and filled those value with False. We then select only those values containing atleast one True value. Then we filter our original dataframe with True values in the index.

df[(df['column_1'].str.extractall('(Apple)|(Banana)').groupby(level=0).agg('count')>1).reindex(df.index, fill_value=False).any(axis=1)]

This gives is the expected output

             column_1    column_2
2         Apple Apple  Some value
3         Apple Apple  Some value
6       Banana Banana  Some value
7  Apple Banana Apple  Some value

CodePudding user response：

This appears to work for my purposes (thanks to @Himanshuman for pointing in this direction):

import re

df = df[
        (df['column_1'].str.count(r'apple', re.I) > 1) | \
        (df['column_1'].str.count(r'banana', re.I) > 1)
        ]

It uses the bitwise OR operator (|) between the conditions [with \ as line break character] and will give the desired result):

                             column_1    column_2
1                         Apple Apple  Some value
2                         Apple Apple  Some value
5                       Banana Banana  Some value
6                      Apple is Apple  Some value
7  Banana is not Apple but is Banana   Some value