Home > Software design >  pandas dataframe groupby columns and aggregate on custom function
pandas dataframe groupby columns and aggregate on custom function

Time:09-13

I am trying to group a dataframe by certain columns and then for each group, pass its column series as a list to a custom function or lambda and get a single aggregated result.

Here's a df:

orgid.      appid.  p.  type.   version
-------------------------------------------------
24e78b      4ef36d  1   None    3.3.7
24e78b      4ef36d  2   None    3.4.1
24e78b      4ef36d  1   None    3.3.7-beta-1
24e78b      4ef36d  1   None    3.4.0-mvn.1
24e78b      4ef36d  2   None    3.4.0-beta.5
24e78b      4ef36d  1   None    3.4.0-beta.1
24e78b      4ef36d  1   None    3.4.0
24e78b      4ef36d  1   None    3.3.5

So I have a function that takes a list of versions and returns a max version string.

>> versions = ['3.4.0-mvn.1', '3.4.0-beta.1', '3.4.0', '3.3.7-beta-1', '3.3.7', '3.3.5', '3.4.0-beta-1']
>> str(max(map(semver.VersionInfo.parse, versions)))
'3.4.0'

Now I want to group the dataframe and then each group's version series is passed to this function as a list and return a single version string.

I tried:

>> g = df.groupby(['orgid', 'appid', 'p', 'type'])
>> g['version'].apply(lambda x: str(max(map(semver.VersionInfo.parse, x.tolist()))))
Series([], Name: version, dtype: float64)

I get a empty series.

Expected output:

orgid.      appid.  p.  type.   version
24e78b      4ef36d  1   None    3.4.0
24e78b      4ef36d  2   None    3.4.1

I am also referencing this Pandas group by multiple custom aggregate function on multiple columns post here.

But couldn't get it right.

CodePudding user response:

Try:

import semver

df["version"] = df["version"].apply(semver.VersionInfo.parse)
out = df.groupby(["orgid", "appid", "p", "type"], as_index=False).max()

print(out)

Prints:

    orgid   appid  p  type version
0  24e78b  4ef36d  1  None   3.4.0
1  24e78b  4ef36d  2  None   3.4.1

CodePudding user response:

out = (df.groupby(['orgid', 'appid', 'p', 'type'], as_index=False)['version']
         .agg(lambda x: max(semver.VersionInfo.parse(v) for v in x)))
print(out)

# Output:

    orgid   appid  p  type version
0  24e78b  4ef36d  1  None   3.4.0
1  24e78b  4ef36d  2  None   3.4.1

CodePudding user response:

This happens because of the None values in the column you are passing to the groupby method.

Try to do:

df = df.fillna('None')

Before calling df.groupby(...), it should work.

  • Related