I have a dataframe that contains values for corn production values (maiz_a
) across the U.S. The data is organized by latitude and longitude (y
and x
), but also by state and county. Many of the specific latitudes and longitudes fall within the same county, which gives many duplicate entries for the same state and county, but with different values. Here is an example:
x y maiz_a maiz_pct name_adm1 name_adm2
62748 -89.875000 41.625000 49642.6 0.000155 Illinois Whiteside
29499 -98.041667 40.791667 49621.1 0.000309 Nebraska Hamilton
28101 -98.208333 40.791667 48866.9 0.000462 Nebraska Hamilton
24910 -98.041667 40.958333 48741.3 0.000614 Nebraska Hamilton
28615 -97.958333 40.958333 48538.8 0.000765 Nebraska Hamilton
24697 -97.875000 40.791667 48536.3 0.000916 Nebraska Hamilton
64127 -89.791667 41.625000 48448.2 0.001067 Illinois Whiteside
Notice how there are many name_adm1
and name_adm2
of the combination "Nebraska" and "Hamilton". I would like to drop the duplicates, but sum all of said duplicates into one line based on maiz_a
and maiz_pct
. Additionally, the newly formatted dataframe should only contain the first instance of x
and y
and obviously should retain name_adm1
and name_adm2
.
CodePudding user response:
use groupby
and agg
This allows you to specify how you want to aggregate each specific column.
df.groupby(['name_adm1', 'name_adm2'], as_index=False).agg(
{'x': 'first', 'y': 'first', 'maiz_a': 'sum', 'maiz_pct': 'sum'}
)
name_adm1 name_adm2 x y maiz_a maiz_pct
0 Illinois Whiteside -89.875000 41.625000 98090.8 0.001222
1 Nebraska Hamilton -98.041667 40.791667 244304.4 0.003066
CodePudding user response:
Try using groupby
and sum
:
compacted_df = df.groupby(['name_adm1', 'name_adm2']).sum().reset_index()
Output:
>>> compacted_df
name_adm1 name_adm2 x y maiz_a maiz_pct
0 Illinois Whiteside -179.666667 83.250000 98090.8 0.001222
1 Nebraska Hamilton -490.125000 204.291667 244304.4 0.003066