I need to drop rows which do not match the criteria of equal values in nunique. Every value in "lot" column is associated with two values in "shipment" column. For every value of "lot", number of unique "cargotype" values may/may not be different for each shipment. I want to drop the rows for "lot" which have unequal "cargotype" values for 2 shipments. col4-6 are irrelevant for subsetting and just need to be returned.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8,9,10],
"lot": ["dfg", "dfg", "dfg","dfg","ghj","ghj","ghj","abc","abc","abc","abc"],
"shipment": ["a", "b", "a","b","c","d","d","e","f","e","e"],
"cargotype": ["adam", "chris", "bob","tom","chris","hanna","chris","charlie","king","su","min"],
"col4": [ 777, 775, 767,715,772,712,712, 123, 122, 121,120],
"col5": [ 13, 12, 13,12,14,12,12, 15, 16, 17,18],
"col6": [4, 3, 4,3, 5, 8,8, 7,7,0,0]})
df
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
4 ghj c chris 772 14 5
5 ghj d hanna 712 12 8
6 ghj d chris 712 12 8
7 abc e charlie 123 15 7
8 abc f king 122 16 7
9 abc e su 121 17 0
10 abc e min 120 18 0
To check uniqueness in "cargotype" column, I use
pd.DataFrame((df.groupby(["lot","shipment"])["cargotype"].nunique()))
lot shipment cargotype
abc e 3
f 1
dfg a 2
b 2
ghj c 1
d 2
Answer df should be:
finaldf
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
Only "dfg" lot remains because unique "cargotype" values for 2 "shipments" are equal to each other. Thank you!
CodePudding user response:
Don't ask me how, but this creates your desired outcome
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df1 = df.groupby(["lot","shipment"])["cargotype"].nunique().unstack().apply(squeeze_nan, axis=1).dropna(how='all', axis=1)
lot = df1[df1['a'] == df1['b']].index
print(df[df['lot'].isin(lot)])
Caveat: not sure if this will work when a lot has more than 2 types of shipment values