I have a data set containing Open, High, Low, Close prices for every 1 minute,for a one year interval, like below:
Open High Low Close
Date_time
2020-01-01 17:00:00 1.12120 1.12121 1.12117 1.12120
2020-01-01 17:01:00 1.12106 1.12135 1.12106 1.12135
2020-01-01 17:02:00 1.12136 1.12139 1.12136 1.12139
2020-01-01 17:03:00 1.12135 1.12135 1.12120 1.12122
2020-01-01 17:04:00 1.12122 1.12125 1.12122 1.12125
There is no NaN or Null in this data set and I checked it with:
df.isnull().sum()
output:
Open 0
High 0
Low 0
Close 0
dtype: int64
I want to resample these to 15min intervals, by using:
dff = df.resample('15min').agg({
'Open': 'first',
'High': 'max',
'Low': 'min',
'Close': 'last'})
the output will be:
Open High Low Close
Date_time
2020-01-01 17:00:00 1.12120 1.12139 1.12106 1.12127
2020-01-01 17:15:00 1.12127 1.12154 1.12127 1.12146
2020-01-01 17:30:00 1.12147 1.12156 1.12138 1.12154
2020-01-01 17:45:00 1.12155 1.12166 1.12143 1.12143
2020-01-01 18:00:00 1.12143 1.12218 1.12142 1.12202
which sounds OK but when I run :
dff.isnull().sum()
the output is:
Open 10121
High 10121
Low 10121
Close 10121
dtype: int64
which means that almost half of the dataset is Null. I searched a lot but did't find anything useful. Is there another way for resampling the dataset into 15min intervals? and where is the problem that causes Null values in dataset?
More info: size of dff
is :
dff.count()
output:
Open 24919
High 24919
Low 24919
Close 24919
dtype: int64
Update: I just used
dff = df.groupby(pd.Grouper(freq='15Min')).agg({"Open": "first",
"Low": "min",
"High": "max",
"Close": "last"})
but still half of the dataset is Null.
I attached dataset csv
file here
update#2 : thanks to @not_speshal, I found out that the size of df
is (372335,4)
but the size of df.resample("1min").last()
is
(372275,4)
which I think means that I don't have data for some of the 1min intervals, How can I fix this?
CodePudding user response:
Your csv file doesn't have data for many days (probably when the markets are closed?). You can instead resample
and then only keep rows that exist in the original data like so:
df = pd.read_csv("EURUSD_2020.csv")
df["datetime"] = pd.to_datetime(df["Date"] " " df["time"])
df = df.drop(["Date", "time"], axis=1).set_index("datetime").astype("float")
dff = df.resample("15min").agg({'Open':'first', 'High':'max', 'Low':'min', 'Close': 'last'})
dff = dff[dff.index.isin(df.index)]