How to group a dataframe by a column while combining strings and summing floats-CodePudding

I am having trouble grouping my df by ZIP Code, and other answers I've found don't seem to work for my data frame, or I am not good enough to adjust the code for my needs.

I have a table that looks like the following:

    ID      Name       City               ZIP          LAT   LNG     Sum1
0   100     Muffin     Parkwoods          99101        48    117     100
1   101     Cake       Victoria Village   12512        41    74      250
2   102     Donut      Parkwoods          99101        48    117     150
3   103     Milk       Victoria Village   12512        41    74      75
4   104     Cookies    Wharf              44101        41    81      25
5   105     Candy      Wharf              44101        41    81      115

I am hoping to create the following output:

    ZIP     ID          Name              City              LAT   LNG     Sum1
0   99101   100, 102    Muffin, Donut     Parkwoods         48    117     250
1   12512   101, 103    Cake, Milk        Victoria Village  41    74      325
2   44101   104, 105    Cookies Candy     Wharf             48    117     140

I am thinking that I could apply this process to only the columns I need combined and then merge it back with the full data after the fact.

For example, ZIP will give me the city, lat, and lng columns so I do not need to worry about these in my grouping.

The issue I am having is combining the string columns (such as ID and name) together, and then summing the float columns. Any help would be greatly appreciated.

CodePudding user response：

You can use the groupby() function in pandas to group the data by ZIP, and then use the agg() function to aggregate the data for each group. Here is an example of how you might achieve the desired result:

import pandas as pd

# Create the original DataFrame
data = {'ID': [100, 101, 102, 103, 104, 105],
        'Name': ['Muffin', 'Cake', 'Donut', 'Milk', 'Cookies', 'Candy'],
        'City': ['Parkwoods', 'Victoria Village', 'Parkwoods', 'Victoria Village', 'Wharf', 'Wharf'],
        'ZIP': [99101, 12512, 99101, 12512, 44101, 44101],
        'LAT': [48, 41, 48, 41, 41, 41],
        'LNG': [117, 74, 117, 74, 81, 81],
        'Sum1': [100, 250, 150, 75, 25, 115]}
df = pd.DataFrame(data)

# Group the data by ZIP and aggregate the ID, Name, City, LAT, and LNG columns
df = df.groupby('ZIP').agg({'ID': lambda x: ', '.join(map(str, x)),
                           'Name': ', '.join,
                            'City': ', '.join,
                            'LAT': 'first',
                            'LNG': 'first',
                            'Sum1': 'sum'}).reset_index()

# Rearrange the columns as desired
df = df[['ZIP', 'ID', 'Name', 'City', 'LAT', 'LNG', 'Sum1']]

The keys of the dictionary passed as argument to agg() are column names, and the values are the aggregation function to be applied to that column.

In this example:

'ID': ', '.join applies the join function to the 'ID' column, which concatenates all the elements of the 'ID' column with a ', ' separator.
'Name': ', '.join applies the join function to the 'Name' column.
'City': ', '.join applies the join function to the 'City' column.
'LAT': 'first': Selects the first element in each group of 'LAT' column
'LNG': 'first' : Selects the first element in each group of 'LNG' column
'Sum1': 'sum' applies the sum function to the 'Sum1' column, which sums all the values in each group of 'Sum1' column.

Eventually, if you print out the dataframe, you get to the following,

     ZIP        ID            Name                                City  LAT  LNG  Sum1
0  12512  101, 103      Cake, Milk  Victoria Village, Victoria Village   41   74   325
1  44101  104, 105  Cookies, Candy                        Wharf, Wharf   41   81   140
2  99101  100, 102   Muffin, Donut                Parkwoods, Parkwoods   48  117   250

CodePudding user response：

You can use .groupby() with .agg(), predefining a lambda for the desired string operation:

string_lambda = lambda x: ", ".join(map(str, x))
df = df.groupby("ZIP").agg({
    "ID": string_lambda,
    "Name": string_lambda,
    "City": "first",
    "LAT": "first",
    "LNG": "first",
    "Sum1": "sum"
})
print(df)

This outputs:

             ID            Name              City  LAT  LNG  Sum1
ZIP
12512  101, 103      Cake, Milk  Victoria Village   41   74   325
44101  104, 105  Cookies, Candy             Wharf   41   81   140
99101  100, 102   Muffin, Donut         Parkwoods   48  117   250