Filtering large data set by year-CodePudding

Working with a very large dataset that I need to be able to filter by year. I read the text file as a csv:

df1=pd.read_csv(filename,
                    sep="\t",
                    error_bad_lines=False,
                    usecols=['ID','Date', 'Value1', 'Value2'])

And convert the Date column to a date:

df1['Date'] = pd.to_datetime(df1['Date'], errors='coerce')

I also convert all nulls to zeroes:

df2=df1.fillna(0)

At this point, my 'Date' field is listed as dtype "Object", and the dates are formatted like this:

2018-02-09 00:00:00

However, I'm not sure how to filter by year. When I try this code:

df3 = df2[df2['Date'].dt.year == 2018]

I get this error:

AttributeError: Can only use .dt accessor with datetimelike values

I think what is happening is some dates have been read in as null values, but I'm not sure if that's the case, and I'm not sure how to convert them to dates (a zero date is fine).

Is my code to filter the data set correct? How can I get around this attribute error?

Thanks!

CodePudding user response：

You could also specify to parse Date when reading it. As @ALollz mentioned you have some NaN values in Date and when you replace them with 0 this changes the type of the column. If you just want to filter by the year then the code below should work. If you wanted to filter by year/month then use '%Y-%m and year/month/date use '%Y-%m-%d'.

df1=pd.read_csv(filename,
                    sep="\t",
                    error_bad_lines=False,
                    usecols=['ID','Date', 'Value1', 'Value2']
                    parse_dates=['Date'])

df_filtered = df1[df1['Date'].dt.strftime('%Y') == '2018']