Home > Back-end >  How to subset rows of df based on unique values?
How to subset rows of df based on unique values?

Time:10-28

I need to drop rows which do not match the criteria of equal values in nunique. Every value in "lot" column is associated with two values in "shipment" column. For every value of "lot", number of unique "cargotype" values may/may not be different for each shipment. I want to drop the rows for "lot" which have unequal "cargotype" values for 2 shipments. col4-6 are irrelevant for subsetting and just need to be returned.

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8,9,10],
         "lot": ["dfg", "dfg", "dfg","dfg","ghj","ghj","ghj","abc","abc","abc","abc"], 
          "shipment": ["a", "b", "a","b","c","d","d","e","f","e","e"], 
"cargotype": ["adam", "chris", "bob","tom","chris","hanna","chris","charlie","king","su","min"], 
        "col4": [ 777, 775, 767,715,772,712,712, 123, 122, 121,120], 
         "col5": [ 13, 12, 13,12,14,12,12, 15, 16, 17,18], 
         "col6": [4, 3, 4,3, 5, 8,8, 7,7,0,0]})
 df
     lot   shipment  cargotype   col4 col5 col6
 0   dfg     a       adam        777  13   4
 1   dfg     b       chris       775  12   3
 2   dfg     a       bob         767  13   4
 3   dfg     b       tom         715  12   3
 4   ghj     c       chris       772  14   5
 5   ghj     d       hanna       712  12   8
 6   ghj     d       chris       712  12   8
 7   abc     e       charlie     123  15   7
 8   abc     f       king        122  16   7
 9   abc     e       su          121  17   0
10   abc     e       min         120  18   0

To check uniqueness in "cargotype" column, I use

pd.DataFrame((df.groupby(["lot","shipment"])["cargotype"].nunique()))

lot    shipment    cargotype
abc    e           3
       f           1
dfg    a           2
       b           2
ghj    c           1
       d           2

Answer df should be:

 finaldf
     lot   shipment  cargotype   col4 col5 col6
 0   dfg     a       adam        777  13   4
 1   dfg     b       chris       775  12   3
 2   dfg     a       bob         767  13   4
 3   dfg     b       tom         715  12   3

Only "dfg" lot remains because unique "cargotype" values for 2 "shipments" are equal to each other. Thank you!

CodePudding user response:

Don't ask me how, but this creates your desired outcome

def squeeze_nan(x):
    original_columns = x.index.tolist()
    squeezed = x.dropna()
    squeezed.index = [original_columns[n] for n in range(squeezed.count())]
    return squeezed.reindex(original_columns, fill_value=np.nan)

df1 = df.groupby(["lot","shipment"])["cargotype"].nunique().unstack().apply(squeeze_nan, axis=1).dropna(how='all', axis=1)
lot = df1[df1['a'] == df1['b']].index
print(df[df['lot'].isin(lot)])

Caveat: not sure if this will work when a lot has more than 2 types of shipment values

  • Related