I have a dataframe that looks like this:
api_spec_id commitdates commits Year-Month API Age info_version
84 2014-12-15 110 2014-12 110 6.0.1
84 2014-11-06 33 2014-11 33 6.0.2
84 2014-10-15 110 2014-10 110 6.0.3
84 2014-12-02 110 2014-12 110 6.0.5
84 2014-11-19 33 2014-11 33 7.0.2
api_spec_id
is the id for every API in the dataframe, now the same API can have different versions within the same id, as it keeps changing for every commit date.
I want to count that for api_spec_id
= 84, how many total versions are there, like here there are 5 in total.
My desired output is :
api_spec_id commitdates commits Year-Month API Age info_version Total_versions
84 2014-12-15 110 2014-12 110 6.0.1 5
84 2014-11-06 33 2014-11 33 6.0.2 5
84 2014-10-15 110 2014-10 110 6.0.3. 5
84 2014-12-02 110 2014-12 110 6.0.5. 5
84 2014-11-19 33 2014-11 33 7.0.2. 5
I tried using value_counts.()
, sum()
and few other solutions on similar questions found here on stack, however none of the solutions gave me the correct numbers which I want to achieve. What would be the best way to go about this? Any guidance will be really helpful.
CodePudding user response:
You can use pd.groupby and nunique
for this:
df['Total_versions'] = df.groupby('api_spec_id').info_version.transform('nunique')
It counts the number of unique values in the column 'info_version'
for each 'api_spec_id'
.
Output:
api_spec_id commitdates commits Year-Month API_Age info_version Total_versions
0 84 2014-12-15 110 2014-12 110 6.0.1 5
1 84 2014-11-06 33 2014-11 33 6.0.2 5
2 84 2014-10-15 110 2014-10 110 6.0.3 5
3 84 2014-12-02 110 2014-12 110 6.0.5 5
4 84 2014-11-19 33 2014-11 33 7.0.2 5