How can I filter a dataframe based on the values in two different columns?-CodePudding

I have a dataframe with three columns: Email, Date, and Region. Email is just an email, Date is a datetime (in string dytpe), and Region is either Region 1 or Region 2. I am trying to filter the dataframe three different ways.

Only emails that have row(s) that are Region 1 and other row(s) that are Region 2.
Only emails that have rows where Region is only 1.
Only emails that have rows where Region is only 2.

For example, if [email protected] has a row where Region is "Region 1" and another row where row is "Region 2", then that would be applicable to the first bullet point. If all the rows are "Region 1" or "Region 2" that that would belong to 2 or 3, respectively.

I know how to filter the dataframe just based off of a column value, but I don't know how to bring in the Email element. I am also doing this in Google Colab, if that changes anything, but I can use a Jupyter Notebook if need be.

CodePudding user response：

Here's a solution that uses groupby to create sets of emails in each region, then you can use set operations to get Region1-only emails or Region2-only or both

df = pd.DataFrame({
    'email':[
        '[email protected]','[email protected]','[email protected]',
        '[email protected]','[email protected]','[email protected]',
        '[email protected]','[email protected]','[email protected]',
        '[email protected]','[email protected]','[email protected]',
    ],
    'date':[
        1,2,3,
        1,2,3,
        1,2,3,
        1,2,3,
    ],
    'region':[
        'Region1','Region1','Region1',
        'Region2','Region1','Region1',
        'Region1','Region2','Region2',
        'Region2','Region2','Region2',
    ],
})

#groupby region to get a set of emails for that region
emails_per_region = df.groupby('region')['email'].apply(set)

#set operations to get email sets of r1-only, r2-only, and both
r1_only_emails = emails_per_region.loc['Region1'].difference(emails_per_region.loc['Region2'])
r2_only_emails = emails_per_region.loc['Region2'].difference(emails_per_region.loc['Region1'])
both_r_emails = emails_per_region.loc['Region1'].intersection(emails_per_region.loc['Region2'])
    
#filter the table to only rows where email is both R1 and R2 (for example)
both_r_df = df[df['email'].isin(both_r_emails)]

print(both_r_df)

output:

           email  date   region
3  [email protected]     1  Region2
4  [email protected]     2  Region1
5  [email protected]     3  Region1
6  [email protected]     1  Region1
7  [email protected]     2  Region2
8  [email protected]     3  Region2