I have a list of transactions that contain a header (H) and detail (D) row for each transaction. There is an Invoice Number column that must match for the H and D rows, but sometimes there are duplicates of that Invoice Number if there are several transactions on the same day. What I would like to do is add an index to the end of the Invoice Number for any duplicates. See below for a visual representation of my data frame and what I am trying to accomplish. I have used letters as the index in the example to help see the difference but a number index would work too, just trying to eliminate duplicate naming.
I'm not sure if I have to iterate over each record or if there is something that can be done as a whole.
Original File
ID | type | invoice number | amount |
---|---|---|---|
01 | H | BOB12192022 | 10.00 |
01 | D | BOB12192022 | 10.00 |
02 | H | JOE12182022 | 30.00 |
02 | D | JOE12182022 | 30.00 |
03 | H | BOB12192022 | 15.00 |
03 | D | BOB12192022 | 15.00 |
04 | H | BOB12192022 | 75.00 |
04 | D | BOB12192022 | 25.00 |
04 | D | BOB12192022 | 25.00 |
04 | D | BOB12192022 | 25.00 |
Expected Result
ID | type | invoice number | amount |
---|---|---|---|
01 | H | BOB12192022A | 10.00 |
01 | D | BOB12192022A | 10.00 |
02 | H | JOE12182022 | 30.00 |
02 | D | JOE12182022 | 30.00 |
03 | H | BOB12192022B | 15.00 |
03 | D | BOB12192022B | 15.00 |
04 | H | BOB12192022C | 75.00 |
04 | D | BOB12192022C | 25.00 |
04 | D | BOB12192022C | 25.00 |
04 | D | BOB12192022C | 25.00 |
CodePudding user response:
You should add a column indicating a duplicated. It will be much easier. Otherwise there will be a trouble when tracing back the name for calculation
df['dup'] = df.groupby(['type, invoice_number'])['invoice_number'].cumcount()
You can easily identify any invoice
is duplicated if dup
value is not 0
CodePudding user response:
You can use groupby.transform()
with a bit of arithmetic and integer-to-character conversion:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"type": ["H", "D", "H", "D", "H", "D"],
"invoice": ["ABC", "ABC", "DEF", "DEF", "ABC", "ABC"],
"amount": ["10", "10", "20", "20", "10", "10"],
}
)
df["suffix"] = df.groupby("invoice")["type"].transform(
lambda x: ["_" chr(65 int(np.floor(i / 2))) for i in range(len(x))]
)
df["invoice"] = df["invoice"] df["suffix"]
Result:
type invoice amount suffix
0 H ABC_A 10 _A
1 D ABC_A 10 _A
2 H DEF_A 20 _A
3 D DEF_A 20 _A
4 H ABC_B 10 _B
5 D ABC_B 10 _B
Note this approach assumes you have 2 entries per transaction.
CodePudding user response:
Update:
df['invoice number'] = (df['invoice number']
(df[df.duplicated(['type', 'invoice number'], keep=False)].groupby('ID').ngroup())
.map(dict(enumerate(ascii_uppercase)))).fillna(df['invoice number'])
Output:
ID type invoice number amount
0 1 H BOB12192022A 10.0
1 1 D BOB12192022A 10.0
2 2 H JOE12182022 30.0
3 2 D JOE12182022 30.0
4 3 H BOB12192022B 15.0
5 3 D BOB12192022B 15.0
6 4 H BOB12192022C 75.0
7 4 D BOB12192022C 25.0
8 4 D BOB12192022C 25.0
9 4 D BOB12192022C 25.0
Try this:
from string import ascii_uppercase
df['invoice number'] = np.where(df.groupby(['invoice number'])['invoice number'].transform('count') > 2,
df.groupby(['type', 'invoice number']).cumcount().map(dict(enumerate(ascii_uppercase))), '')
Output:
type invoice number amount
0 H BOB12192022A 10.0
1 D BOB12192022A 10.0
2 H JOE12182022 30.0
3 D JOE12182022 30.0
4 H BOB12192022B 15.0
5 D BOB12192022B 15.0
6 H BOB12192022C 50.0
7 D BOB12192022C 50.0