I am pretty new to Panda's and am having trouble with Pandas/time series analysis and Daylight Savings time.
I have a 1-minute frequency txt file with NY Daylight Savings Time data.
When I use pytz to localize and convert to UTC and then downsample to 2Hr, 4H, all data and times match for rows during DST, but do not match for those rows during Stamdard Time (mid november- mid march). What I need is for it all to match.
So what I (believe) I need please is to somehow normalize the Standard Time rows. As it is, when downsampled to say 2H as an example, it is obvious that time changes from even times (midnight, 2am,4am, 6am, etc.) to odd times (1pm, 3pm, 5pm, etc.). If I that can get addressed, after that, converting to UTC and/or other Time Zones and resampling (all of which I think I have somewhat figured out) should work.
The closest I have been able to get so far is to:
Take 1min data, Localize to US/Eastern, Converted to UTC. When reampled to 2 Hr, this matches (all even hours) but is incorrect for dates that fall during standard time.
I then tried:
Take 1 min data, Localize to America/New_York, Converted to UTC.
This matches the entire year in resampled 2H and is CORRECT for Standard time, but not for DST.
The code is below, and you'll note that I have commented out several of the blocks of code. That is because I have tried many different combinations of things to try work this out myself.
Should I be using something other than pytz? or? Thanks for the help!
import pandas as pd
import datetime as datetime
import pytz as pytz
from pytz import all_timezones
colnames=['Date', 'Open', 'High', 'Low','Close','Volume']
df = pd.read_csv ("/Users/aiki/Desktop/GC_1min_full.txt", sep=',',names=colnames, header=None,index_col='Date',parse_dates=True)
#Make this naive NY data TZ aware: America/New_York handles DST, US/Eastern does not.
df = df.tz_localize('America/New_York')
#Convert the NY TZ aware data to UTC
df = df.tz_convert(tz= 'UTC')
#Make this naive NY data TZ aware
#df = df.tz_localize('US/Eastern')
#Make this UTC NY localized (again)
#df = df.tz_convert('US/Eastern')
#Convert this data to central time
#df = df.tz_convert(tz= 'America/Chicago')`
Adding more info here after guidance from other posters:
My data source, says: "All data is in US Eastern Timezone (ie EST/EDT depending on the time of year)".
If I read in original data, df.index says "datetime" but there is no TZ info, and just to check, I: print (df.index.tz) and get "None". This means I have naive TZ data that they say is in DST/ST format.
Since my original post I've learned that:
1 - EST is UTC -5 hours. This is pytz (US/Eastern).
2 - But pytz (America/New_York) is, essentially, EST in the winter and EDT in the summer. And so, importantly, America/New_York deals with DST.
(I believe this is right, please correct if wrong)
After (many) more tries, what I now know is:
EST Convert - Localize to America/New_York, Convert to UTC, Do Resample, Convert to America/Chicago.
With validation material set to Chicago tz, this works for EST, but not DST.
#Make this naive NY data TZ aware using America/New_York which handles DST.
df = df.tz_localize(tz= 'America/New_York')
df = df.tz_convert(tz= 'UTC')
(do resample code, etc)
df = df.tz_convert(tz= 'America/Chicago')
DST Convert Localized to US/Eastern, Convert to US/Central, Do Resample.
With validation material set to Chicago tz, this works for DST but not EST.
df = df.tz_localize('US/Eastern')
df = df.tz_convert(tz= 'US/Central')
(do resample code, etc)
This does not solve the question of how to get the whole year in one shot, but I can live with a two-part solution that loses some data if I have to. Not ideal, but time is finite...
Thank you all for your good ideas- I do appreciate it very much. If you have any other thoughts, I am all ears.
Addressing Ultra909's comments below:
1- Yes, done per my method(s) above, my resampled DST and my (separate) EST data match my charting platform, and also match another public facing charting system.
2- Data is 1min data, and since it's market related, it has no data (or time stamps) during the actual (missing /ambiguous time) 0200 DST/EST switch. So for me, it's hard to tell how it's being handled.
DST in 2021 starts Sunday, 3.14.21 @0200. 2021-03-12 is Friday, NY 2021-03-14 is Sunday, NY
This from the data:
2021-03-12 16:59:00,65.57,65.60,65.55,65.56,28
#(Market reopens Sunday @1800, so we have timestamps.
2021-03-14 18:00:00,65.56,65.75,65.54,65.68,238
#(So...the actual DST change is not visible here...)
DST in 2021 ends Sunday 2021-11-07 2021-11-05 is Friday, NY 2021-11-07 is Sunday, NY
2021-11-05 16:59:00,81.15,81.22,81.13,81.17,96
2021-11-07 18:00:00,81.13,81.65,81.05,81.60,974
It could be me (so very likely). Or it could be the data. Either way, I think I've gone as far as I can. Should anyone know data providers where this isn't an issue, I'd be grateful to know about it.
Also, if I continue on this path, I'll need to split out Dec1-Feb 28th for every year I have. I'll lose the 2 months involved in the DST/EST switch, but it will have to do. So if anyone has thoughts on how to do that programarically in Pandas en masse, that would be great.
Thank you again for all your input!
CodePudding user response:
So if I understand correctly, the data is in Eastern Standard Time (GMT-5) without any daylight savings?
Then the way I would solve it is to add 5:00:00 to the index across the board and then localise as UTC.
ix = df.index pd.Timedelta(hours=5)
df_utc = df.set_index(ix).tz_localize("UTC")
You can then tz_convert(..)
further, if desired.