pandas isn't recognising my datetime column-CodePudding

I exported this from a postgres table as a tab-separated csv, like so:

\copy (select * from mytable) to 'labels.csv' csv DELIMITER E'\t' header

Which is (file head)

user_id  session_id   start_time           mode
  2       715      2016-04-01 01:07:49 01   car
  2       716      2016-04-01 03:09:53 01   car
  2      1082      2016-04-02 13:05:16 01   car
  2      1090      2016-04-02 15:16:32 01   car

I read this into pandas and wanted to remove timezone info, this way:

df = pd.read_csv('labels.csv', sep='\t',parse_dates=['start_time']) 
df['start_time'] = df['start_time'].dt.tz_localize(None)

But gives the error:

AttributeError: Can only use .dt accessor with datetimelike values

df.head() gives:

user_id  session_id     start_time              mode
0    2         715  2016-04-01 01:07:49 01:00     car
1    2         716  2016-04-01 03:09:53 01:00     car
2    2        1082  2016-04-02 13:05:16 01:00     car
3    2        1090  2016-04-02 15:16:32 01:00     car
4    2        1601  2016-04-04 13:56:13 01:00     foot

However,

df.info()
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   user_id              5374 non-null   int64 
 1   session_id           5374 non-null   int64 
 2   start_time           5374 non-null   object
 3   transportation_mode  5374 non-null   object
dtypes: int64(3), object(2)

CodePudding user response：

See the docs for pd.read_csv:

parse_dates : bool or list of int or names or list of lists or dict, default False

...

If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pd.to_datetime with utc=True. See Parsing a CSV with mixed timezones for more.

You likely have an unparseable date in your data. Try to coerce to datetime after you read using pandas.to_datetime, to cause an error on the bad value, as this will raise errors on bad values by default:

df["start_time"] = pd.to_datetime(df["start_time"])

Once you identify the issue, you can then handle the value in your code. Something like:

# explicitly handle known invalid values
df["start_time"] = df["start_time"].replace({"--": pd.NaT})
df["start_time"] = pd.to_datetime(df["start_time"])