Append Index to Duplicate Pairs-CodePudding

I have a list of transactions that contain a header (H) and detail (D) row for each transaction. There is an Invoice Number column that must match for the H and D rows, but sometimes there are duplicates of that Invoice Number if there are several transactions on the same day. What I would like to do is add an index to the end of the Invoice Number for any duplicates. See below for a visual representation of my data frame and what I am trying to accomplish. I have used letters as the index in the example to help see the difference but a number index would work too, just trying to eliminate duplicate naming.

I'm not sure if I have to iterate over each record or if there is something that can be done as a whole.

Original File

ID	type	invoice number	amount
01	H	BOB12192022	10.00
01	D	BOB12192022	10.00
02	H	JOE12182022	30.00
02	D	JOE12182022	30.00
03	H	BOB12192022	15.00
03	D	BOB12192022	15.00
04	H	BOB12192022	75.00
04	D	BOB12192022	25.00
04	D	BOB12192022	25.00
04	D	BOB12192022	25.00

Expected Result

ID	type	invoice number	amount
01	H	BOB12192022A	10.00
01	D	BOB12192022A	10.00
02	H	JOE12182022	30.00
02	D	JOE12182022	30.00
03	H	BOB12192022B	15.00
03	D	BOB12192022B	15.00
04	H	BOB12192022C	75.00
04	D	BOB12192022C	25.00
04	D	BOB12192022C	25.00
04	D	BOB12192022C	25.00

CodePudding user response：

You should add a column indicating a duplicated. It will be much easier. Otherwise there will be a trouble when tracing back the name for calculation

df['dup'] = df.groupby(['type, invoice_number'])['invoice_number'].cumcount()

You can easily identify any invoice is duplicated if dup value is not 0

CodePudding user response：

You can use groupby.transform() with a bit of arithmetic and integer-to-character conversion:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "type": ["H", "D", "H", "D", "H", "D"],
        "invoice": ["ABC", "ABC", "DEF", "DEF", "ABC", "ABC"],
        "amount": ["10", "10", "20", "20", "10", "10"],
    }
)

df["suffix"] = df.groupby("invoice")["type"].transform(
    lambda x: ["_"   chr(65   int(np.floor(i / 2))) for i in range(len(x))]
)
df["invoice"] = df["invoice"]   df["suffix"]

Result:

  type invoice amount suffix
0    H   ABC_A     10     _A
1    D   ABC_A     10     _A
2    H   DEF_A     20     _A
3    D   DEF_A     20     _A
4    H   ABC_B     10     _B
5    D   ABC_B     10     _B

Note this approach assumes you have 2 entries per transaction.

CodePudding user response：

Update:

df['invoice number'] = (df['invoice number']   
                        (df[df.duplicated(['type', 'invoice number'], keep=False)].groupby('ID').ngroup())
                              .map(dict(enumerate(ascii_uppercase)))).fillna(df['invoice number'])

Output:

   ID type invoice number  amount
0   1    H   BOB12192022A    10.0
1   1    D   BOB12192022A    10.0
2   2    H    JOE12182022    30.0
3   2    D    JOE12182022    30.0
4   3    H   BOB12192022B    15.0
5   3    D   BOB12192022B    15.0
6   4    H   BOB12192022C    75.0
7   4    D   BOB12192022C    25.0
8   4    D   BOB12192022C    25.0
9   4    D   BOB12192022C    25.0

Try this:

from string import ascii_uppercase

df['invoice number']  = np.where(df.groupby(['invoice number'])['invoice number'].transform('count') > 2,
                                 df.groupby(['type', 'invoice number']).cumcount().map(dict(enumerate(ascii_uppercase))), '')

Output:

  type invoice number  amount
0    H   BOB12192022A    10.0
1    D   BOB12192022A    10.0
2    H    JOE12182022    30.0
3    D    JOE12182022    30.0
4    H   BOB12192022B    15.0
5    D   BOB12192022B    15.0
6    H   BOB12192022C    50.0
7    D   BOB12192022C    50.0