Home > Software design >  Filter certain column by year
Filter certain column by year

Time:05-27

I have data that I am grouping by and after that I want to filter it by current year.

df['PERIOD'] = pd.to_datetime(df['PERIOD'], format='%Y%m', errors='coerce').dt.strftime('%Y/%m')
groupedResult = df.groupby('PERIOD', dropna=True)['ACTUALS'].sum().reset_index()

I have tried to add this line

groupedResult = groupedResult[groupedResult['PERIOD'].dt.year == datetime.today().year]

but getting

AttributeError: Can only use .dt accessor with datetimelike values

What I am doing wrong?

Example dataset:

import pandas as pd

data = {'PERIOD':['202201','202201','202201','202201','202202','202202','202203'], 'ACTUALS':[10,20,30,40,50,60,70]}
    
df = pd.DataFrame(data)
print("BEFORE:")

CodePudding user response:

What I am doing wrong?

Here

df['PERIOD'] = pd.to_datetime(df['PERIOD'], format='%Y%m', errors='coerce').dt.strftime('%Y/%m')

you are converting PERIOD which is string to datetimelike, which is in turn converted (formatted) into string, thus you

AttributeError: Can only use .dt accessor with datetimelike values

as PERIOD hold string at that moment.

Consider converting PERIOD into datetime-like without further conversion into string. You might find pd.Grouper useful for this task, consider following example which calculate sum of ACTUAL for each Month (freq='M')

import pandas as pd

data = {'PERIOD':['202201','202201','202201','202201','202202','202202','202203'], 'ACTUALS':[10,20,30,40,50,60,70]}
    
df = pd.DataFrame(data)
df['PERIOD'] = pd.to_datetime(df['PERIOD'],format='%Y%m')
df.set_index('PERIOD', inplace=True)
total = df.groupby(pd.Grouper(freq='M')).sum()
print(total)

output

            ACTUALS
PERIOD
2022-01-31      100
2022-02-28      110
2022-03-31       70

CodePudding user response:

This is because you are comparing two different types of data. In your dataframe, the column 'PERIOD' is an object. But you are using a datetime to filter it. You need to convert your 'PERIOD' column to a datetime column after creating the dataframe, not before and don't turn it to a string after. This will work:

import pandas as pd
from datetime import datetime
data = {'PERIOD':['202201','202201','202201','202201','202202','202202','202203'], 'ACTUALS':[10,20,30,40,50,60,70]}
df = pd.DataFrame(data)
df['PERIOD'] = pd.to_datetime(df['PERIOD'], format='%Y%m', errors='coerce')
groupedResult = df.groupby('PERIOD', dropna=True)['ACTUALS'].sum().reset_index()
  • Related