I will try to be right to the point. So I have a df with 100 columns, and I want to groupby two of them while getting the sum of another 2 columns. For that matter I already used the groupby agg function. The problem is that while doing that I still want to keep the remaining 96 columns and by that I'd like to keep the first occurrence of each value for those 96. For that I was thinking something such as dropping duplicates and keep = 'first'.
I searched for a method that could do that in one action but at this point I'm open to any recommendations you might have.
Note: I didn't include a sample df and a desired output for a reason as I'd like to try it myself based on a recommendation, I'm not trying to get a solution directly from someone.
Thanks in advance,
CodePudding user response:
I think that using two separate operations on the groupby object and join them afterwards is clearer than a one-liner. Here is a minimal example, grouping on 1 column:
df = pd.DataFrame(
[
("bird", "Falconiformes", 389.0, 5.5, 1),
("bird", "Psittaciformes", 24.0, 4.5, 2),
("mammal", "Carnivora", 80.2, 33.3, 1),
("mammal", "Primates", np.nan, 33.7, 2),
("mammal", "Carnivora", 58, 23, 3),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "family", "max_speed", "height", "order"),
)
print(df, "\n")
grouped = df.groupby('class')
df_sum = grouped[['max_speed', 'height']].agg(sum)
df_first = grouped['order'].first()
df_out = pd.concat([df_sum, df_first], axis=1)
print(df_out)
Output:
class family max_speed height order
falcon bird Falconiformes 389.0 5.5 1
parrot bird Psittaciformes 24.0 4.5 2
lion mammal Carnivora 80.2 33.3 1
monkey mammal Primates NaN 33.7 2
leopard mammal Carnivora 58.0 23.0 3
max_speed height order
class
bird 413.0 10.0 1
mammal 138.2 90.0 1