i have a dataframe which contains a client code, the number of the contract and the products from the contract.
Something like this :
client_code | contract_number | product |
---|---|---|
AAAA | 1000 | Water |
AAAA | 1000 | Soda |
AAAA | 1000 | Food |
BACD | 1001 | Water |
BACD | 1001 | Soda |
DAMR | 1002 | Food |
And I want to add a column which contains a count to see how many products are on a contract from 1 to n . Something like this:
client_code | contract_number | product | count |
---|---|---|---|
AAAA | 1000 | Water | 1 |
AAAA | 1000 | Soda | 2 |
AAAA | 1000 | Food | 3 |
BACD | 1001 | Water | 1 |
BACD | 1001 | Soda | 2 |
DAMR | 1002 | Food | 1 |
I've tried with a for loop but it's too slow ( like an hour ).
df['count']=0
k=1
for i in range(0,len(df)-1):
if(df.at[i,'contract_number']==df.at[i 1,'contract_number']):
df.at[i,'count']=k
k=k 1
else :
df.at[i,'count']=k
k=1
df.at[i 1,'count']=k
Any ideea how to speed up my code ?
PS : My data frame contains 500.000 lines .
Thank you !
CodePudding user response:
IIUC, you want a cumulative count by each client_code
(or probably contract_number
) - you can do that with the cumcount
function:
df.loc[:, 'count'] = df.groupby('client_code').cumcount() 1
Output
client_code contract_number product count
0 AAAA 1000 Water 1
1 AAAA 1000 Soda 2
2 AAAA 1000 Food 3
3 BACD 1001 Water 1
4 BACD 1001 Soda 2
5 DAMR 1002 Food 1