I have a data frame with 70k rows and two columns. Col1 contains bill of materials name and customer, Col2 contains a part number (which is a part of the BOM).
Col1 Col2
0 TUR, Cust1 1001
1 GAR, Cust2 1001
2 FOR, Cust3 1001
3 ERB, Cust1 1002
4 PNR, Cust1 1002
5 DUL, Cust2 1003
6 COC, Cust3 1003
7 ETM, Cust1 1004
8 ROW, Cust3 1005
9 HON, Cust3 1005
When searching for Cust1, I want to see in column3 if the part number is purchased exclusively for that customer. Something like this:
Col1 Col2 Col3
0 TUR, Cust1 1001 false
1 GAR, Cust2 1001 false
2 FOR, Cust3 1001 false
3 ERB, Cust1 1002 true
4 PNR, Cust1 1002 true
5 DUL, Cust2 1003 false
6 COC, Cust3 1003 false
7 ETM, Cust1 1004 true
8 ROW, Cust3 1005 false
9 HON, Cust3 1005 false
I already tried to extract duplicates with df.duplicated and to evaluate the customer name with str.contains, but without satisfying result. Is there a smart solution that I don't know? I am new to python and getting nowhere with this problem.
CodePudding user response:
Extract the customer from Col1
then groupby
part number and count the number of unique customers which should be equal to 1
df['Cust'] = df['Col1'].str.split(', ').str[-1]
df['Col3'] = df.groupby('Col2')['Cust'].transform('nunique').eq(1)
If only interested in checking the dupes considering one customer at a time, here is a simpler version
m = df['Col1'].str.split(', ').str[-1] == 'Cust1' # is cust1?
df['Col3'] = m.groupby(df['Col2']).transform('all') # are all cust1 per part?
Result
Col1 Col2 Cust Col3
0 TUR, Cust1 1001 Cust1 False
1 GAR, Cust2 1001 Cust2 False
2 FOR, Cust3 1001 Cust3 False
3 ERB, Cust1 1002 Cust1 True
4 PNR, Cust1 1002 Cust1 True
5 DUL, Cust2 1003 Cust2 False
6 COC, Cust3 1003 Cust3 False
7 ETM, Cust1 1004 Cust1 True
CodePudding user response:
One approach could be as follows:
df['Col3'] = df.Col1.str.extract(r',\s(.*)$')[0]
# or: `df['Col1'].str.split(', ', expand=True)[1]`
df['Col3'] = df.Col2.map(df.drop_duplicates(subset=['Col2','Col3'])\
['Col2'].value_counts().eq(1))
print(df)
Col1 Col2 Col3
0 TUR, Cust1 1001 False
1 GAR, Cust2 1001 False
2 FOR, Cust3 1001 False
3 ERB, Cust1 1002 True
4 PNR, Cust1 1002 True
5 DUL, Cust2 1003 False
6 COC, Cust3 1003 False
7 ETM, Cust1 1004 True
Explanation
- Use
Series.str.extract
to get the customer names fromCol1
and assign to a new column,Col3
. (Alternatively, useSeries.str.split
.) - Next, use
df.drop_duplicates
withsubset
parameter set to['Col2','Col3']
. E.g. for a duplicate like1002, Cust1
, we will keep only the first. - Now, select
Col2
and applySeries.value_counts
. Result at this stage fordf.drop_duplicates(subset=['Col2','Col3'])['Col2'].value_counts()
would look as follows:
1001 3
1003 2
1002 1
1004 1
Name: Col2, dtype: int64
- The values from
Col2
with1
as the count (i.e.1002, 1004
) will occur only for one customer, so we can chainSeries.eq
to get backTrue
for these values only, andFalse
for the others. - Finally, we want to place this result inside
Series.map
, applied toCol2
to match the correct booleans. OverwriteCol3
again with the result.