I have the two following dataframes.
df1:
code name region
0 AFG Afghanistan Middle East
1 NLD Netherlands Western Europe
2 AUT Austria Western Europe
3 IRQ Iraq Middle East
4 USA United States North America
5 CAD Canada North America
df2:
code year gdp per capita
0 AFG 2010 547.35
1 NLD 2010 44851.27
2 AUT 2010 3577.10
3 IRQ 2010 4052.06
4 USA 2010 52760.00
5 CAD 2010 41155.32
6 AFG 2015 578.47
7 NLD 2015 45175.23
8 AUT 2015 3952.80
9 IRQ 2015 4688.32
10 USA 2015 56863.37
11 CAD 2015 43635.10
I want to return the code
, year
, gdp per capita
, and average
(gdp per capita per region per year) for 2015 for countries with gdp above average for their region (should be NLD, IRQ, USA).
The result should look something like this:
code year gdp per capita average
3 NLD 2015 45175.23 24564.015
7 IRQ 2015 4688.32 2633.395
9 USA 2015 56863.37 50249.235
I wanted to try this in Python because I recently completed an introductory course to SQL and was amazed at the simplicity of the solution in SQL. While I managed to make it work in Python, it seems overly complicated to me. Is there any way to achieve the same result with less code or without the need for .groupby and helper columns? Please see my solution below.
data = pd.merge(df1, df2, how="inner", on="code")
grouper = data.groupby(["region", "year"])["gdp per capita"].mean().reset_index()
for i in range(len(data)):
average = (grouper.loc[(grouper["year"] == data.loc[i, "year"]) & (grouper["region"] == data.loc[i, "region"]), "gdp per capita"]).to_list()[0]
data.loc[i, "average"] = average
result = data.loc[(data["year"] == 2015) & (data["gdp per capita"] > data["average"]), ["code", "year", "gdp per capita", "average"]]
print(result)
CodePudding user response:
Loops are basically never the right answer when it comes to pandas.
# This is your join and where clause.
df = df1.merge(df2, on='code')[lambda x: x.year.eq(2015)]
# This is your aggregate function.
df['average'] = df.groupby(['region'])['gdp per capita'].transform('mean')
# This is your select and having clause.
out = df[df['gdp per capita'].gt(df['average'])][['code', 'year', 'gdp per capita', 'average']]
print(out)
Output:
code year gdp per capita average
3 NLD 2015 45175.23 24564.015
7 IRQ 2015 4688.32 2633.395
9 USA 2015 56863.37 50249.235