Home > Software engineering >  Groupby with apply
Groupby with apply

Time:05-02

I'm not able to figure out the error - AttributeError: 'numpy.int64' object has no attribute 'startswith' generated from the line - "year_indices = [x for x in sum_series if x.startswith("emissions")]"

import pandas as pd

URL = "https://www.epa.gov/sites/default/files/2021-03/state_tier1_caps.xlsx"

df = pd.read_excel(URL, sheet_name="State_Trends", header=1)

def minmaxyear(subdf):
    sum_series = subdf.sum()
    year_indices = [x for x in sum_series if x.startswith("emissions")]
    minyear = sum_series[year_indices].astype(float).idxmin()
    maxyear = sum_series[year_indices].astype(float).idxmax()
    return pd.Series({"min year": minyear[-2:], "max year": maxyear[-2:]})

df_years = df[df["Pollutant"]=="CO"].groupby("State").apply(minmaxyear)
print(df_years)

CodePudding user response:

Since x is an integer, in order to do a use a string function on it (startswith), you should first convert it to str:

year_indices = [x for x in sum_series if str(x).startswith("emissions")]

CodePudding user response:

@lytseeker's answer does solve the error you were getting but it returns an empty series. Meaning the lines getting the minyear and maxyear would throw an error.

This is because it iterates through the values of sum_series and not the indices.

So do this instead.

year_indices = [x for x in sum_series.index if x.startswith("emissions")]

After making this change I was able to get this output -

       min year max year
State                  
AK          07       02
AL          21       90
AR          21       90
AZ          21       90
CA          21       90
CO          16       90
CT          21       90
DC          17       90
DE          21       90
FL          17       01
GA          17       90

CodePudding user response:

You can pass mask for compare indices of Series:

def minmaxyear(subdf):
    sum_series = subdf.sum()
    year = sum_series[sum_series.index.str.startswith("emissions")].astype(float)
    minyear = year.idxmin()
    maxyear = year.idxmax()
    return pd.Series({"min year": minyear[-2:], "max year": maxyear[-2:]})

df_years = df[df["Pollutant"]=="CO"].groupby("State").apply(minmaxyear)
print(df_years.head())
      min year max year
State                  
AK          07       02
AL          21       90
AR          21       90
AZ          21       90
CA          21       90
  • Related