Home > OS >  Append Index to Duplicate Pairs
Append Index to Duplicate Pairs

Time:12-20

I have a list of transactions that contain a header (H) and detail (D) row for each transaction. There is an Invoice Number column that must match for the H and D rows, but sometimes there are duplicates of that Invoice Number if there are several transactions on the same day. What I would like to do is add an index to the end of the Invoice Number for any duplicates. See below for a visual representation of my data frame and what I am trying to accomplish. I have used letters as the index in the example to help see the difference but a number index would work too, just trying to eliminate duplicate naming.

I'm not sure if I have to iterate over each record or if there is something that can be done as a whole.

Original File

ID type invoice number amount
01 H BOB12192022 10.00
01 D BOB12192022 10.00
02 H JOE12182022 30.00
02 D JOE12182022 30.00
03 H BOB12192022 15.00
03 D BOB12192022 15.00
04 H BOB12192022 75.00
04 D BOB12192022 25.00
04 D BOB12192022 25.00
04 D BOB12192022 25.00

Expected Result

ID type invoice number amount
01 H BOB12192022A 10.00
01 D BOB12192022A 10.00
02 H JOE12182022 30.00
02 D JOE12182022 30.00
03 H BOB12192022B 15.00
03 D BOB12192022B 15.00
04 H BOB12192022C 75.00
04 D BOB12192022C 25.00
04 D BOB12192022C 25.00
04 D BOB12192022C 25.00

CodePudding user response:

You should add a column indicating a duplicated. It will be much easier. Otherwise there will be a trouble when tracing back the name for calculation

df['dup'] = df.groupby(['type, invoice_number'])['invoice_number'].cumcount()

You can easily identify any invoice is duplicated if dup value is not 0

CodePudding user response:

You can use groupby.transform() with a bit of arithmetic and integer-to-character conversion:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "type": ["H", "D", "H", "D", "H", "D"],
        "invoice": ["ABC", "ABC", "DEF", "DEF", "ABC", "ABC"],
        "amount": ["10", "10", "20", "20", "10", "10"],
    }
)

df["suffix"] = df.groupby("invoice")["type"].transform(
    lambda x: ["_"   chr(65   int(np.floor(i / 2))) for i in range(len(x))]
)
df["invoice"] = df["invoice"]   df["suffix"]

Result:

  type invoice amount suffix
0    H   ABC_A     10     _A
1    D   ABC_A     10     _A
2    H   DEF_A     20     _A
3    D   DEF_A     20     _A
4    H   ABC_B     10     _B
5    D   ABC_B     10     _B

Note this approach assumes you have 2 entries per transaction.

CodePudding user response:

Update:

df['invoice number'] = (df['invoice number']   
                        (df[df.duplicated(['type', 'invoice number'], keep=False)].groupby('ID').ngroup())
                              .map(dict(enumerate(ascii_uppercase)))).fillna(df['invoice number'])

Output:

   ID type invoice number  amount
0   1    H   BOB12192022A    10.0
1   1    D   BOB12192022A    10.0
2   2    H    JOE12182022    30.0
3   2    D    JOE12182022    30.0
4   3    H   BOB12192022B    15.0
5   3    D   BOB12192022B    15.0
6   4    H   BOB12192022C    75.0
7   4    D   BOB12192022C    25.0
8   4    D   BOB12192022C    25.0
9   4    D   BOB12192022C    25.0

Try this:

from string import ascii_uppercase

df['invoice number']  = np.where(df.groupby(['invoice number'])['invoice number'].transform('count') > 2,
                                 df.groupby(['type', 'invoice number']).cumcount().map(dict(enumerate(ascii_uppercase))), '')

Output:

  type invoice number  amount
0    H   BOB12192022A    10.0
1    D   BOB12192022A    10.0
2    H    JOE12182022    30.0
3    D    JOE12182022    30.0
4    H   BOB12192022B    15.0
5    D   BOB12192022B    15.0
6    H   BOB12192022C    50.0
7    D   BOB12192022C    50.0
  • Related