sqlite3.OperationalError: duplicate column name: timestamp-CodePudding

I have a DF like so:

                        timestamp    open    high    low    close  volume  trade_count       vwap symbol
0       2021-11-02 13:30:00 00:00  12.000  12.055  12.00  12.0500   47208           62  12.003693   AMCR
1       2021-11-02 13:31:00 00:00  12.040  12.070  12.04  12.0500   20448          117  12.055675   AMCR
2       2021-11-02 13:32:00 00:00  12.050  12.050  12.03  12.0343    6755           64  12.040067   AMCR
3       2021-11-02 13:33:00 00:00  12.035  12.035  12.01  12.0300    9186           95  12.021537   AMCR
4       2021-11-02 13:34:00 00:00  12.035  12.035  12.02  12.0200    1258           16  12.028696   AMCR
...                           ...     ...     ...    ...      ...     ...          ...        ...    ...
429000  2022-10-21 19:59:00 00:00  29.850  29.940  29.85  29.9200   35847          298  29.894083   YETI
429001  2022-10-21 20:00:00 00:00  29.890  29.890  29.89  29.8900   88226           15  29.890000   YETI
429002  2022-10-21 20:08:00 00:00  29.890  29.890  29.89  29.8900    4266            1  29.890000   YETI
429003  2022-10-21 22:06:00 00:00  30.060  30.060  30.06  30.0600    1200            3  30.060000   YETI
429004  2022-10-21 22:44:00 00:00  30.060  30.060  30.06  30.0600     800            1  30.060000   YETI

[429005 rows x 9 columns]

Setting timestamp column to datetime:

df['timestamp'] = pd.to_datetime(df['timestamp'])

Group the minute bars by individual days

seperate_days = df.groupby(['symbol', df['timestamp'].dt.date])

I then create an empty DF:

    new_df = pd.DataFrame()

And look through the different days, manipulating the data and appending to the empty df:

    for name, ohlc in seperate_days:
        ohlc.index = ohlc['timestamp']
    
        # ohlc['cum_volume'] = ohlc.cumsum('volume')
        ohlc['cum_volume'] = ohlc["volume"].cumsum()
        ohlc['cum_volume_is_>_25k'] = np.where(ohlc['cum_volume'] > 25000, True, False)
        # return True omly when cum_volume is above 25k in pre market 
        ohlc['cum_volume_is_>_25k'][~ohlc['cum_volume_is_>_25k'].index.isin(open_hours_index)] = False
    
        # mask for premarket
        open_hours_indices = ohlc.index.indexer_between_time('04:00', '09:29')
        open_hours_index = ohlc.index[open_hours_indices]
    
        new_df = new_df.append(ohlc)

Which gives:

                                          timestamp    open    high    low    close  ...  trade_count       vwap  symbol cum_volume  cum_volume_is_>_25k
timestamp                                                                            ...                                                                
2021-11-02 13:30:00 00:00 2021-11-02 13:30:00 00:00  12.000  12.055  12.00  12.0500  ...           62  12.003693    AMCR      47208                False
2021-11-02 13:31:00 00:00 2021-11-02 13:31:00 00:00  12.040  12.070  12.04  12.0500  ...          117  12.055675    AMCR      67656                False
2021-11-02 13:32:00 00:00 2021-11-02 13:32:00 00:00  12.050  12.050  12.03  12.0343  ...           64  12.040067    AMCR      74411                False
2021-11-02 13:33:00 00:00 2021-11-02 13:33:00 00:00  12.035  12.035  12.01  12.0300  ...           95  12.021537    AMCR      83597                False
2021-11-02 13:34:00 00:00 2021-11-02 13:34:00 00:00  12.035  12.035  12.02  12.0200  ...           16  12.028696    AMCR      84855                False
...                                             ...     ...     ...    ...      ...  ...          ...        ...     ...        ...                  ...
2022-10-21 19:59:00 00:00 2022-10-21 19:59:00 00:00  29.850  29.940  29.85  29.9200  ...          298  29.894083    YETI     688656                False
2022-10-21 20:00:00 00:00 2022-10-21 20:00:00 00:00  29.890  29.890  29.89  29.8900  ...           15  29.890000    YETI     776882                False
2022-10-21 20:08:00 00:00 2022-10-21 20:08:00 00:00  29.890  29.890  29.89  29.8900  ...            1  29.890000    YETI     781148                False
2022-10-21 22:06:00 00:00 2022-10-21 22:06:00 00:00  30.060  30.060  30.06  30.0600  ...            3  30.060000    YETI     782348                False
2022-10-21 22:44:00 00:00 2022-10-21 22:44:00 00:00  30.060  30.060  30.06  30.0600  ...            1  30.060000    YETI     783148                False

[429005 rows x 11 columns]

When I save it to my sql db:

new_df.to_sql("ohlc_minutes_filtered", conn, if_exists='replace')

Sadly I get the following error:

Traceback (most recent call last):
  File "/home/dan/Documents/code/wolfhound/filter_minutes.py", line 40, in <module>
    new_df.to_sql("ohlc_minutes_filtered", conn, if_exists='replace')
  File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 2951, in to_sql
    return sql.to_sql(
  File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 698, in to_sql
    return pandas_sql.to_sql(
  File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 2192, in to_sql
    table.create()
  File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 843, in create
    self._execute_create()
  File "/home/dan/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 1876, in _execute_create
    conn.execute(stmt)
sqlite3.OperationalError: duplicate column name: timestamp

How can I fix this?

CodePudding user response：

The name of new_df index is timestamp.

print(new_df.index.name)

And also there is a column name timestamp. Update the name of the index or update the column timestamp to a different value before running to_sql.

new_df.index.name = 'index'  # for example