I have a dataframe with this structure
df = pd.DataFrame([
(1000, 'Frank', 'Hugo'),
(100, 'Peter', 'Frank'),
(10000, 'Hugo', 'James'),
(300, 'James', 'Jennifer'),
(250, 'Frank', 'Hugo'),
(550, 'Peter', 'Frank'),
(10000, 'Frank', 'Hugo')
], columns=['Amount', 'Payer', 'Payee'])
And, basing upon the information contained in dataframe (approx 2M rows) I should create a column with the information on how many transactions the payer did previously to that same payee. The final goal should be to put a flag on every new transaction between two users. The output I should get is this :
df = pd.DataFrame([
(1000, 'Frank', 'Hugo', 0),
(100, 'Peter', 'Frank', 0),
(10000, 'Hugo', 'James', 0),
(300, 'James', 'Jennifer', 0),
(250, 'Frank', 'Hugo', 1),
(550, 'Peter', 'Frank', 1),
(10000, 'Frank', 'Hugo', 2)
], columns=['Amount', 'Payer', 'Payee', 'Previous_transactions'])
I really can't figure it out, I just got stuck with
count = df['Payee'].groupby(df['Payer']).value_counts()
but I feel I am far away from the solution, since this only generates a list of a different length.
Thanks in advantage :)
CodePudding user response:
# Group the data by payer and payee and sort the values by the amount
df = df.groupby(['Payer', 'Payee']).sort_values('Amount')
# Shift the values in the 'Amount' column by 1 to get the previous number of transactions
df['Previous_transactions'] = df['Amount'].shift(1)
# Drop the rows with missing values
df.dropna(inplace=True)
# Reset the index
df.reset_index(drop=True, inplace=True)
CodePudding user response:
You can use cumcount
:
df['Previous_transactions']=df.groupby('Payer').cumcount()
'''
| | Amount | Payer | Payee | x | Previous_transactions |
|---:|---------:|:--------|:---------|----:|------------------------:|
| 0 | 1000 | Frank | Hugo | 0 | 0 |
| 1 | 100 | Peter | Frank | 0 | 0 |
| 2 | 10000 | Hugo | James | 0 | 0 |
| 3 | 300 | James | Jennifer | 0 | 0 |
| 4 | 250 | Frank | Hugo | 1 | 1 |
| 5 | 550 | Peter | Frank | 1 | 1 |
| 6 | 10000 | Frank | Hugo | 2 | 2 |
'''