I am having trouble grouping my df by ZIP Code, and other answers I've found don't seem to work for my data frame, or I am not good enough to adjust the code for my needs.
I have a table that looks like the following:
ID Name City ZIP LAT LNG Sum1
0 100 Muffin Parkwoods 99101 48 117 100
1 101 Cake Victoria Village 12512 41 74 250
2 102 Donut Parkwoods 99101 48 117 150
3 103 Milk Victoria Village 12512 41 74 75
4 104 Cookies Wharf 44101 41 81 25
5 105 Candy Wharf 44101 41 81 115
I am hoping to create the following output:
ZIP ID Name City LAT LNG Sum1
0 99101 100, 102 Muffin, Donut Parkwoods 48 117 250
1 12512 101, 103 Cake, Milk Victoria Village 41 74 325
2 44101 104, 105 Cookies Candy Wharf 48 117 140
I am thinking that I could apply this process to only the columns I need combined and then merge it back with the full data after the fact.
For example, ZIP will give me the city, lat, and lng columns so I do not need to worry about these in my grouping.
The issue I am having is combining the string columns (such as ID and name) together, and then summing the float columns. Any help would be greatly appreciated.
CodePudding user response:
You can use the groupby()
function in pandas to group the data by ZIP, and then use the agg()
function to aggregate the data for each group. Here is an example of how you might achieve the desired result:
import pandas as pd
# Create the original DataFrame
data = {'ID': [100, 101, 102, 103, 104, 105],
'Name': ['Muffin', 'Cake', 'Donut', 'Milk', 'Cookies', 'Candy'],
'City': ['Parkwoods', 'Victoria Village', 'Parkwoods', 'Victoria Village', 'Wharf', 'Wharf'],
'ZIP': [99101, 12512, 99101, 12512, 44101, 44101],
'LAT': [48, 41, 48, 41, 41, 41],
'LNG': [117, 74, 117, 74, 81, 81],
'Sum1': [100, 250, 150, 75, 25, 115]}
df = pd.DataFrame(data)
# Group the data by ZIP and aggregate the ID, Name, City, LAT, and LNG columns
df = df.groupby('ZIP').agg({'ID': lambda x: ', '.join(map(str, x)),
'Name': ', '.join,
'City': ', '.join,
'LAT': 'first',
'LNG': 'first',
'Sum1': 'sum'}).reset_index()
# Rearrange the columns as desired
df = df[['ZIP', 'ID', 'Name', 'City', 'LAT', 'LNG', 'Sum1']]
The keys of the dictionary passed as argument to agg()
are column names, and the values are the aggregation function to be applied to that column.
In this example:
- 'ID': ', '.join applies the join function to the 'ID' column, which concatenates all the elements of the 'ID' column with a ', ' separator.
- 'Name': ', '.join applies the join function to the 'Name' column.
- 'City': ', '.join applies the join function to the 'City' column.
- 'LAT': 'first': Selects the first element in each group of 'LAT' column
- 'LNG': 'first' : Selects the first element in each group of 'LNG' column
- 'Sum1': 'sum' applies the sum function to the 'Sum1' column, which sums all the values in each group of 'Sum1' column.
Eventually, if you print out the dataframe, you get to the following,
ZIP ID Name City LAT LNG Sum1
0 12512 101, 103 Cake, Milk Victoria Village, Victoria Village 41 74 325
1 44101 104, 105 Cookies, Candy Wharf, Wharf 41 81 140
2 99101 100, 102 Muffin, Donut Parkwoods, Parkwoods 48 117 250
CodePudding user response:
You can use .groupby()
with .agg()
, predefining a lambda for the desired string operation:
string_lambda = lambda x: ", ".join(map(str, x))
df = df.groupby("ZIP").agg({
"ID": string_lambda,
"Name": string_lambda,
"City": "first",
"LAT": "first",
"LNG": "first",
"Sum1": "sum"
})
print(df)
This outputs:
ID Name City LAT LNG Sum1
ZIP
12512 101, 103 Cake, Milk Victoria Village 41 74 325
44101 104, 105 Cookies, Candy Wharf 41 81 140
99101 100, 102 Muffin, Donut Parkwoods 48 117 250