Search under condition in one column if a value in another column is unique, and write the results i-CodePudding

I have a data frame with 70k rows and two columns. Col1 contains bill of materials name and customer, Col2 contains a part number (which is a part of the BOM).

         Col1  Col2
0  TUR, Cust1  1001
1  GAR, Cust2  1001
2  FOR, Cust3  1001
3  ERB, Cust1  1002
4  PNR, Cust1  1002
5  DUL, Cust2  1003
6  COC, Cust3  1003
7  ETM, Cust1  1004
8  ROW, Cust3  1005
9  HON, Cust3  1005

When searching for Cust1, I want to see in column3 if the part number is purchased exclusively for that customer. Something like this:

         Col1  Col2   Col3
0  TUR, Cust1  1001  false
1  GAR, Cust2  1001  false
2  FOR, Cust3  1001  false
3  ERB, Cust1  1002   true
4  PNR, Cust1  1002   true
5  DUL, Cust2  1003  false
6  COC, Cust3  1003  false
7  ETM, Cust1  1004   true
8  ROW, Cust3  1005  false
9  HON, Cust3  1005  false

I already tried to extract duplicates with df.duplicated and to evaluate the customer name with str.contains, but without satisfying result. Is there a smart solution that I don't know? I am new to python and getting nowhere with this problem.

CodePudding user response：

Extract the customer from Col1 then groupby part number and count the number of unique customers which should be equal to 1

df['Cust'] = df['Col1'].str.split(', ').str[-1]
df['Col3'] = df.groupby('Col2')['Cust'].transform('nunique').eq(1)

If only interested in checking the dupes considering one customer at a time, here is a simpler version

m = df['Col1'].str.split(', ').str[-1] == 'Cust1' # is cust1?
df['Col3'] = m.groupby(df['Col2']).transform('all') # are all cust1 per part?

Result

         Col1  Col2   Cust   Col3
0  TUR, Cust1  1001  Cust1  False
1  GAR, Cust2  1001  Cust2  False
2  FOR, Cust3  1001  Cust3  False
3  ERB, Cust1  1002  Cust1   True
4  PNR, Cust1  1002  Cust1   True
5  DUL, Cust2  1003  Cust2  False
6  COC, Cust3  1003  Cust3  False
7  ETM, Cust1  1004  Cust1   True

CodePudding user response：

One approach could be as follows:

df['Col3'] = df.Col1.str.extract(r',\s(.*)$')[0]
# or: `df['Col1'].str.split(', ', expand=True)[1]`

df['Col3'] = df.Col2.map(df.drop_duplicates(subset=['Col2','Col3'])\
                         ['Col2'].value_counts().eq(1))

print(df)

         Col1  Col2   Col3
0  TUR, Cust1  1001  False
1  GAR, Cust2  1001  False
2  FOR, Cust3  1001  False
3  ERB, Cust1  1002   True
4  PNR, Cust1  1002   True
5  DUL, Cust2  1003  False
6  COC, Cust3  1003  False
7  ETM, Cust1  1004   True

Explanation

Use Series.str.extract to get the customer names from Col1 and assign to a new column, Col3. (Alternatively, use Series.str.split.)
Next, use df.drop_duplicates with subset parameter set to ['Col2','Col3']. E.g. for a duplicate like 1002, Cust1, we will keep only the first.
Now, select Col2 and apply Series.value_counts. Result at this stage for df.drop_duplicates(subset=['Col2','Col3'])['Col2'].value_counts() would look as follows:

1001    3
1003    2
1002    1
1004    1
Name: Col2, dtype: int64

The values from Col2 with 1 as the count (i.e. 1002, 1004) will occur only for one customer, so we can chain Series.eq to get back True for these values only, and False for the others.
Finally, we want to place this result inside Series.map, applied to Col2 to match the correct booleans. Overwrite Col3 again with the result.