I have a data frame like this. Invoices are transformed into a table where each row has a single product for each customer. This table consists of millions of rows -
customer | invoice | products |
---|---|---|
a | INVC001 | xx |
a | INVC001 | yy |
a | INVC002 | zz |
a | INVC003 | aa |
b | INVC004 | xx |
b | INVC004 | aa |
b | INVC004 | bb |
b | INVC005 | cc |
Now I want to create a column where I want to rank each customers invoices. The table I want
customer | invoice | products | rank |
---|---|---|---|
a | INVC001 | xx | 1 |
a | INVC001 | yy | 1 |
a | INVC002 | zz | 2 |
a | INVC003 | aa | 3 |
b | INVC004 | xx | 1 |
b | INVC004 | aa | 1 |
b | INVC004 | bb | 1 |
b | INVC005 | cc | 2 |
I tried to use cumcount, with grouping customer and invoices with this code df.groupby(['customer','invoice'])['invoice'].cumcount() 1
, also tried nggroup()
but couldnot generate my expected output.
Is there any numpy/pandas way to create a table like this? Or, is there any optimized way to do this. Will appreciate the help.
CodePudding user response:
Use factorize
in GroupBy.transform
:
df['rank'] = df.groupby('customer')['invoice'].transform(lambda x: pd.factorize(x)[0]) 1
For me rank
failed:
df['rank'] = df.groupby('customer')['invoice'].rank(method='dense')
DataError: No numeric types to aggregate
Solution should be like mentioned @pavel:
df['rank'] = df.groupby('customer')['invoice'].rank(method='dense', numeric_only=False)