Home > other >  How to subset pandas df based on two columns?
How to subset pandas df based on two columns?

Time:10-27

I need to subset rows of df based on two columns (c1 and c2 columns) which have strings. I need to be able to return rows where one value in c1 is associated with only 2 different values in c2. col4-6 are irrelevant for subsetting and just need to be returned.

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8],
                     "c1": ["ABC", "ABC", "dfg", "dfg", "dfg","dfg","ghj","ghj","ghj"], 
                     "c2": ["delta", "delta", "alpha", "bravo", "alpha","bravo","bravo","delta","alpha"], 
                     "c3": [1, 2, 2, 3, 5,6,3,3,3], 
                     "col4": [786, 787, 777, 775, 767,715,772,712,712], 
                     "col5": [10, 11, 13, 12, 13,12,14,12,12], 
                     "col6": [1,2,4, 3, 4,3, 5, 8,8]})
 df
     c1      c2     c3   col4 col5 col6
 0   ABC     delta   1    786  10   1
 1   ABC     delta   2    787  11   2
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3
 6   ghj     bravo   3    772  14   5
 7   ghj     delta   3    712  12   8
 8   ghj     alpha   3    712  12   8

Answer df should be:

 finaldf
     c1      c2     c3   col4 col5 col6
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3

What if the rows where one value in c1 is associated with 2 and 3 different values in c2 is of interest like in df below?

 finaldf
     c1      c2     c3   col4 col5 col6
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3
 6   ghj     bravo   3    772  14   5
 7   ghj     delta   3    712  12   8
 8   ghj     alpha   3    712  12   8

I think some kind of groupby and transform operation could help achieve this.

CodePudding user response:

Get the index where your condition is true:

cond = (df.groupby("c1")["c2"]
        .nunique().where(lambda x: x == 2)
        .dropna().index)

and then use .loc

df.loc[df["c1"].isin(cond)]

CodePudding user response:

here is one way to do it

#using transform, return a boolean on unique values matching the count
# and filter the dataframe
df[df.groupby('c1')['c2'].transform(lambda x: x.nunique()==2)]
c1  c2  c3  col4    col5    col6
2   2   dfg     alpha   2   777     13  4
3   3   dfg     bravo   3   775     12  3
4   4   dfg     alpha   5   767     13  4
5   5   dfg     bravo   6   715     12  3
df[df.groupby('c1')['c2'].transform(lambda x: x.nunique()>=2)]
c1  c2  c3  col4    col5    col6
2   2   dfg     alpha   2   777     13  4
3   3   dfg     bravo   3   775     12  3
4   4   dfg     alpha   5   767     13  4
5   5   dfg     bravo   6   715     12  3
6   6   ghj     bravo   3   772     14  5
7   7   ghj     delta   3   712     12  8
8   8   ghj     alpha   3   712     12  8
  • Related