Count value column iteratively for rows within column-CodePudding

I have a dataframe that looks like this:

      info_version  commits commitdates
18558       17.1.3       42  2017-07-14
20783       17.1.3       57  2017-07-14
20782       17.2.2       57  2017-09-27
18557       17.2.2       42  2017-09-27
18556       17.2.3       42  2017-10-30
20781       17.2.3       57  2017-10-30
20780       17.2.4       57  2017-11-27
18555       17.2.4       42  2017-11-27
20779       17.2.5       57  2018-01-10

I have a trivial issue, but somehow I am not able to find the function,I want to count the commits starting from value 42 till the last one, my desired output is something like this:

      info_version  commits commitdates    Commit_growth
18558       17.1.3       42  2017-07-14       42
20783       17.1.3       57  2017-07-14       109
20782       17.2.2       57  2017-09-27       166
18557       17.2.2       42  2017-09-27.      208
18556       17.2.3       42  2017-10-30       250
20781       17.2.3       57  2017-10-30       307
20780       17.2.4       57  2017-11-27       364
18555       17.2.4       42  2017-11-27.      406
20779       17.2.5       57  2018-01-10       463

This is what I tried so far:

data2 = data1[['info_version', 'commits', 'commitdates']].sort_values(by='info_version', ascending=True)
sum_row = data2.sum(axis=0)

But this gives me the entire count. This seems to be easy, but I am a bit stuck.

CodePudding user response：

A simple .cumsum() should be suffice,
because it looks like the df is already sorted by info_version

data1['Commit_growth'] = data1['commits'].cumsum()

Here is the example code:

import pandas as pd

data1 = pd.DataFrame({ 'info_version': ['17.1.3', '17.1.3', '17.2.2', '17.2.2', '17.2.3', '17.2.3', '17.2.4', '17.2.4', '17.2.5'],
                    'commits': [42, 57, 57, 42, 42, 57, 57, 42, 57],
                    'commitdates': ['2017-07-14', '2017-07-14', '2017-09-27', '2017-09-27', '2017-10-30', '2017-10-30', '2017-11-27', '2017-11-27', '2018-01-10']})

data1['Commit_growth'] = data1['commits'].cumsum()
print(data1)

OUTPUT:

  info_version  commits commitdates  Commit_growth
0       17.1.3       42  2017-07-14             42
1       17.1.3       57  2017-07-14             99
2       17.2.2       57  2017-09-27            156
3       17.2.2       42  2017-09-27            198
4       17.2.3       42  2017-10-30            240
5       17.2.3       57  2017-10-30            297
6       17.2.4       57  2017-11-27            354
7       17.2.4       42  2017-11-27            396
8       17.2.5       57  2018-01-10            453

CodePudding user response：

You can use sort_values with cumsum but the output is different from yours :

data1["commitdates"]= pd.to_datetime(data1["commitdates"]) #only if not parsed yet

data2= (
         data1
            .loc[:, ["info_version", "commits", "commitdates"]]
            .sort_values(by=["info_version", "commitdates"])
            .assign(Commit_growth= lambda x: x["commits"].cumsum())
        )

# Output :

print(data2)

          info_version  commits commitdates  Commit_growth
    18558       17.1.3       42  2017-07-14             42
    20783       17.1.3       57  2017-07-14             99
    20782       17.2.2       57  2017-09-27            156
    18557       17.2.2       42  2017-09-27            198
    18556       17.2.3       42  2017-10-30            240
    20781       17.2.3       57  2017-10-30            297
    20780       17.2.4       57  2017-11-27            354
    18555       17.2.4       42  2017-11-27            396
    20779       17.2.5       57  2018-01-10            453