I have a dataframe that looks like this:
info_version commits commitdates
18558 17.1.3 42 2017-07-14
20783 17.1.3 57 2017-07-14
20782 17.2.2 57 2017-09-27
18557 17.2.2 42 2017-09-27
18556 17.2.3 42 2017-10-30
20781 17.2.3 57 2017-10-30
20780 17.2.4 57 2017-11-27
18555 17.2.4 42 2017-11-27
20779 17.2.5 57 2018-01-10
I have a trivial issue, but somehow I am not able to find the function,I want to count the commits starting from value 42 till the last one, my desired output is something like this:
info_version commits commitdates Commit_growth
18558 17.1.3 42 2017-07-14 42
20783 17.1.3 57 2017-07-14 109
20782 17.2.2 57 2017-09-27 166
18557 17.2.2 42 2017-09-27. 208
18556 17.2.3 42 2017-10-30 250
20781 17.2.3 57 2017-10-30 307
20780 17.2.4 57 2017-11-27 364
18555 17.2.4 42 2017-11-27. 406
20779 17.2.5 57 2018-01-10 463
This is what I tried so far:
data2 = data1[['info_version', 'commits', 'commitdates']].sort_values(by='info_version', ascending=True)
sum_row = data2.sum(axis=0)
But this gives me the entire count. This seems to be easy, but I am a bit stuck.
CodePudding user response:
A simple .cumsum()
should be suffice,
because it looks like the df
is already sorted by info_version
data1['Commit_growth'] = data1['commits'].cumsum()
Here is the example code:
import pandas as pd
data1 = pd.DataFrame({ 'info_version': ['17.1.3', '17.1.3', '17.2.2', '17.2.2', '17.2.3', '17.2.3', '17.2.4', '17.2.4', '17.2.5'],
'commits': [42, 57, 57, 42, 42, 57, 57, 42, 57],
'commitdates': ['2017-07-14', '2017-07-14', '2017-09-27', '2017-09-27', '2017-10-30', '2017-10-30', '2017-11-27', '2017-11-27', '2018-01-10']})
data1['Commit_growth'] = data1['commits'].cumsum()
print(data1)
OUTPUT:
info_version commits commitdates Commit_growth
0 17.1.3 42 2017-07-14 42
1 17.1.3 57 2017-07-14 99
2 17.2.2 57 2017-09-27 156
3 17.2.2 42 2017-09-27 198
4 17.2.3 42 2017-10-30 240
5 17.2.3 57 2017-10-30 297
6 17.2.4 57 2017-11-27 354
7 17.2.4 42 2017-11-27 396
8 17.2.5 57 2018-01-10 453
CodePudding user response:
You can use sort_values
with cumsum
but the output is different from yours :
data1["commitdates"]= pd.to_datetime(data1["commitdates"]) #only if not parsed yet
data2= (
data1
.loc[:, ["info_version", "commits", "commitdates"]]
.sort_values(by=["info_version", "commitdates"])
.assign(Commit_growth= lambda x: x["commits"].cumsum())
)
# Output :
print(data2)
info_version commits commitdates Commit_growth
18558 17.1.3 42 2017-07-14 42
20783 17.1.3 57 2017-07-14 99
20782 17.2.2 57 2017-09-27 156
18557 17.2.2 42 2017-09-27 198
18556 17.2.3 42 2017-10-30 240
20781 17.2.3 57 2017-10-30 297
20780 17.2.4 57 2017-11-27 354
18555 17.2.4 42 2017-11-27 396
20779 17.2.5 57 2018-01-10 453