Is pandas.read_spss misreading datetime into unix?-CodePudding

I have a sav file with a datetime column in %m/%d/%Y string format. When I read it in with pd.read_spss(), which doesn't seem to have any datetime-related arguments, it ends up in what looks like unix time, except that the time would be a few centuries from now with unique values including 13778726400, 13841884800, etc.

When I feed the read column into pd.to_datetime, however, it's not interpreted as the date I would expect, but rather a few seconds after the original unix date in 1970:

pd.to_datetime(df.col)

0      1970-01-01 00:00:13.778726400
1      1970-01-01 00:00:13.841798400
2      1970-01-01 00:00:13.778726400
3      1970-01-01 00:00:13.778035200
4      1970-01-01 00:00:13.841798400

Why is the datetime column getting read in in this weird format, and why is pd.to_datetime not able to convert it back?

(My current workaround is just to set date columns to string manually in SPSS. Then pyreadstat/pandas.read_spss has no trouble reading it as a string and pandas.to_dateime can transform.)

CodePudding user response：

Dates, times and datetimes are always stored in SPSS as a number and then you add a format for displaying. SPSS continuously adds new formats while removes others. New formats have to be added manually to the pyreadstat code, while old formats stay in the code for backward compatibility. So the problem is you have found a new Date/datetime/time format that is not registered in pyreadstat.

Another workaround would have been to open the file in SPSS and store it as a date/datetime/time, but with a different format pyreadstat would recognise, for example DATE11, DATETIME20 etc (the current list that pyreadstat accepts is [https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])

The best when this is found is to submit a github issue describing the new format found for it to be added. I just added a few I found in the most recent SPSS documentation, and hopefully your problem should be solved in the next release (already available on dev). If not, please submit an issue with a reproducible example.

The numbers SPSS uses to store the dates are not unix time, but either the number of seconds (in the case of datetimes or time) or days (in the case of dates) since 1582-10-14 (the start of the Gregorian Calendar. So you would need something like this to calculate it manually:

from datetime import datetime, date, timedelta
origin = date(1582, 10, 14)
myspssvalue = 13778726400
delta = timedelta(seconds=myspssvalue) # or days=myspssvalue if date
python_date = origin   delta
print(python_date)
#datetime.date(2019, 6, 1)

Also, if you would assume this number is number of seconds since 1970:

>>> datetime.fromtimestamp(13778726400)
datetime.datetime(2406, 8, 19, 2, 0)

What pandas is doing is thinking that the number you are giving is the number of nanoseconds since 1970-01-01 (it transforms it to datetime64[ns]) and that is why you get a date very close to 1970