Pandas changing value of columns before saving to mysql-CodePudding

I've got a mysql table defined as follows:

CREATE TABLE series_info (
    series_id INT NOT NULL PRIMARY KEY,
    s_area VARCHAR(40) NOT NULL,
    s_code VARCHAR(10) NOT NULL,
    dataset_version VARCHAR(20) NOT NULL
);

In my Python module, I'm trying to save a Pandas DataFrame into the table with the following code:

.
. insert 3 records in df_series_info (see output)
.

conn = sqlalchemy.create_engine( \
    f"mysql mysqlconnector://{DATABASER_USER}:{DATABASER_PASSWORD}@{DATABASE_HOST}/{DATABASE}")

print(df_series_info.dtypes)
print(df_series_info.head())
df_series_info.to_sql(name='series_info', con=conn, if_exists='append', index=False)

But I get the following output:

series_id           int64
s_area             object
s_code             object
dataset_version    object
dtype: object

        series_id    s_area  s_code     dataset_version
0  7179385449897108  AFRG    170202     2022-12-29-SYN-01
0  6756567457160340  AFRG    170203     2022-12-29-SYN-01
0  7692362580057236  AFRG    160305     2022-12-29-SYN-01

Traceback (most recent call last):
File "D:\Projects\AIGreenFleet\dev\aigreenfleet-series-data-generator\.venv\lib\site-packages\mysql\connector\connection_cext.py", line 565, in cmd_query
    self._cmysql.query(
_mysql_connector.MySQLInterfaceError: Duplicate entry '2147483647' for key 'PRIMARY'
.
.
.
The above exception was the direct cause of the following exception:
.
.
.
sqlalchemy.exc.IntegrityError: (mysql.connector.errors.IntegrityError) 1062 (23000): Duplicate entry '2147483647' for key 'PRIMARY'
[SQL: INSERT INTO series_info (series_id, s_area, s_code, dataset_version) VALUES (%(series_id)s, %(s_area)s, %(s_code)s, %(dataset_version)s)]
[parameters: ({'series_id': '2952743774805140', 's_area': 'AFRG', 's_code': 170202, 'dataset_version': '2022-12-29-SYN-01'}, {'series_id': '9080550407298196', 's_area': 'AFRG', 's_code': 170203, 'dataset_version': '2022-12-29-SYN-01'}, {'series_id': '2972479528656020', 's_area': 'AFRG', 's_code': 160305, 'dataset_version': '2022-12-29-SYN-01'})]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

Now, as you see from the output of print(df_series_info.head()), the value in the series_id field is definetly not 2147483647.

If I remove the PRIMARY KEY clause from the table creation statement, the code works (but puts '2147483647' as value for series_id in all rows).

It seems that pandas or the mysql connector are converting the value of the series_id column prior to saving it to the database.

Any idea what I might be doing wrong?

CodePudding user response：

The problem is that your series_id values are too high for storing them in an INT column in MySql (the max value is 2147483647). You should change your create statement to modify the series_id data type to BIGINT:

CREATE TABLE series_info (
    series_id BIGINT NOT NULL PRIMARY KEY,
    s_area VARCHAR(40) NOT NULL,
    s_code VARCHAR(10) NOT NULL,
    dataset_version VARCHAR(20) NOT NULL
);