I have a DF like so:
timestamp open high low close volume trade_count vwap symbol
0 2021-11-02 13:30:00 00:00 12.000 12.055 12.00 12.0500 47208 62 12.003693 AMCR
1 2021-11-02 13:31:00 00:00 12.040 12.070 12.04 12.0500 20448 117 12.055675 AMCR
2 2021-11-02 13:32:00 00:00 12.050 12.050 12.03 12.0343 6755 64 12.040067 AMCR
3 2021-11-02 13:33:00 00:00 12.035 12.035 12.01 12.0300 9186 95 12.021537 AMCR
4 2021-11-02 13:34:00 00:00 12.035 12.035 12.02 12.0200 1258 16 12.028696 AMCR
... ... ... ... ... ... ... ... ... ...
429000 2022-10-21 19:59:00 00:00 29.850 29.940 29.85 29.9200 35847 298 29.894083 YETI
429001 2022-10-21 20:00:00 00:00 29.890 29.890 29.89 29.8900 88226 15 29.890000 YETI
429002 2022-10-21 20:08:00 00:00 29.890 29.890 29.89 29.8900 4266 1 29.890000 YETI
429003 2022-10-21 22:06:00 00:00 30.060 30.060 30.06 30.0600 1200 3 30.060000 YETI
429004 2022-10-21 22:44:00 00:00 30.060 30.060 30.06 30.0600 800 1 30.060000 YETI
[429005 rows x 9 columns]
Setting timestamp column to datetime:
df['timestamp'] = pd.to_datetime(df['timestamp'])
Group the minute bars by individual days
seperate_days = df.groupby(['symbol', df['timestamp'].dt.date])
I then create an empty DF:
new_df = pd.DataFrame()
And look through the different days, manipulating the data and appending to the empty df:
for name, ohlc in seperate_days:
ohlc.index = ohlc['timestamp']
# ohlc['cum_volume'] = ohlc.cumsum('volume')
ohlc['cum_volume'] = ohlc["volume"].cumsum()
ohlc['cum_volume_is_>_25k'] = np.where(ohlc['cum_volume'] > 25000, True, False)
# return True omly when cum_volume is above 25k in pre market
ohlc['cum_volume_is_>_25k'][~ohlc['cum_volume_is_>_25k'].index.isin(open_hours_index)] = False
# mask for premarket
open_hours_indices = ohlc.index.indexer_between_time('04:00', '09:29')
open_hours_index = ohlc.index[open_hours_indices]
new_df = new_df.append(ohlc)
Which gives:
timestamp open high low close ... trade_count vwap symbol cum_volume cum_volume_is_>_25k
timestamp ...
2021-11-02 13:30:00 00:00 2021-11-02 13:30:00 00:00 12.000 12.055 12.00 12.0500 ... 62 12.003693 AMCR 47208 False
2021-11-02 13:31:00 00:00 2021-11-02 13:31:00 00:00 12.040 12.070 12.04 12.0500 ... 117 12.055675 AMCR 67656 False
2021-11-02 13:32:00 00:00 2021-11-02 13:32:00 00:00 12.050 12.050 12.03 12.0343 ... 64 12.040067 AMCR 74411 False
2021-11-02 13:33:00 00:00 2021-11-02 13:33:00 00:00 12.035 12.035 12.01 12.0300 ... 95 12.021537 AMCR 83597 False
2021-11-02 13:34:00 00:00 2021-11-02 13:34:00 00:00 12.035 12.035 12.02 12.0200 ... 16 12.028696 AMCR 84855 False
... ... ... ... ... ... ... ... ... ... ... ...
2022-10-21 19:59:00 00:00 2022-10-21 19:59:00 00:00 29.850 29.940 29.85 29.9200 ... 298 29.894083 YETI 688656 False
2022-10-21 20:00:00 00:00 2022-10-21 20:00:00 00:00 29.890 29.890 29.89 29.8900 ... 15 29.890000 YETI 776882 False
2022-10-21 20:08:00 00:00 2022-10-21 20:08:00 00:00 29.890 29.890 29.89 29.8900 ... 1 29.890000 YETI 781148 False
2022-10-21 22:06:00 00:00 2022-10-21 22:06:00 00:00 30.060 30.060 30.06 30.0600 ... 3 30.060000 YETI 782348 False
2022-10-21 22:44:00 00:00 2022-10-21 22:44:00 00:00 30.060 30.060 30.06 30.0600 ... 1 30.060000 YETI 783148 False
[429005 rows x 11 columns]
When I save it to my sql db:
new_df.to_sql("ohlc_minutes_filtered", conn, if_exists='replace')
Sadly I get the following error:
Traceback (most recent call last):
File "/home/dan/Documents/code/wolfhound/filter_minutes.py", line 40, in <module>
new_df.to_sql("ohlc_minutes_filtered", conn, if_exists='replace')
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 2951, in to_sql
return sql.to_sql(
File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 698, in to_sql
return pandas_sql.to_sql(
File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 2192, in to_sql
table.create()
File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 843, in create
self._execute_create()
File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 1876, in _execute_create
conn.execute(stmt)
sqlite3.OperationalError: duplicate column name: timestamp
How can I fix this?
CodePudding user response:
The name of new_df index is timestamp
.
print(new_df.index.name)
And also there is a column name timestamp
. Update the name of the index or update the column timestamp
to a different value before running to_sql
.
new_df.index.name = 'index' # for example