Data Frame .to_sql() causes error when INSERTING into SQL Server database-CodePudding

I am attempting to generate an INSERT statement into an SQL Server database from a Data Frame:

data_frame.to_sql(table, engine, if_exists='append', index=False)

The Data Frame is created from a JSON Response.

I am having issues inserting String values with single quotes in them into the database. The reason appears to be that the Data Frame fields are double quoted if there is a single quote anywhere in the field. Example of the behavior:

INSERT INTO my_table(column1, column2, column3) VALUES ('value1', "value2's", 'value3');

Because value2's has a single quote in it, .to_sql() surrounds the field with double quotes instead. However, SQL Server believes this value to be a column instead because it's double quoted and the insert fails.

How can I get .to_sql() to instead format the INSERT statement like below:

INSERT INTO my_table(column1, column2, column3) VALUES ('value1', 'value2''s', 'value3');

The Error message .to_sql() generates:

The identifier that starts with 'value2's' is too long. Maximum length is 128.

The error message above suggests that the database is treating the Data Frame value as a column name. SQL Server often gives this error if a value is double quoted instead of single quoted.

Libraries (some may not be relevant to this issue):

import pyodbc
import sqlalchemy
from sqlalchemy import MetaData
import pandas as pd
from sqlalchemy.engine import URL

Any help is appreciated!

CodePudding user response：

How have you determined that the error message applies to the df.to_sql() statement?

Consider the following example:

import os
import pandas as pd
from sqlalchemy import Column, MetaData, Table, VARCHAR, create_engine
import urllib

# Connection setup...
conn_params = urllib.parse.quote_plus(
    "Driver=ODBC Driver 18 for SQL Server;"  
    "Server=tcp:%s,%s;" % (os.getenv("DB_SERVER"), os.getenv("DB_PORT"))  
    "Database=%s;" % os.getenv("DB_NAME")  
    "Uid=%s;" % os.getenv("DB_USER")  
    "Pwd={%s};" % os.getenv("DB_PASSWORD")  
    "Encrypt=yes;"  
    "TrustServerCertificate=yes;"
)
url = f"mssql pyodbc:///?odbc_connect={conn_params}"
engine = create_engine(
    url,
    isolation_level="REPEATABLE READ"
)

# Database setup...
m = MetaData()
t = Table(
    "my_table",
    m,
    Column("column1", VARCHAR(20)),
    Column("column2", VARCHAR(20)),
    Column("column3", VARCHAR(20)),
    )
m.create_all(engine)

# Dataframe setup...
df = pd.DataFrame(data={
    "column1": [1, 2, 3],
    "column2": ["Sam", "Daniel", "Jack"],
    "column3": ["Carter", "Jackson", "O'Neill"]
})

# Data insertion...
with engine.begin() as conn:
    conn.execute(t.delete())
    engine.echo = True
    df.to_sql(t.name, engine, if_exists="append", index=False)
    for row in conn.execute(t.select()).fetchall():
        print(row)

Executing this on the console gives us the following output:

# python3 demo.py
2022-08-27 05:43:11,069 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max))
2022-08-27 05:43:11,069 INFO sqlalchemy.engine.Engine [cached since 0.01563s ago] ('BASE TABLE', 'my_table', 'dbo')
2022-08-27 05:43:11,074 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-08-27 05:43:11,076 INFO sqlalchemy.engine.Engine INSERT INTO my_table (column1, column2, column3) VALUES (?, ?, ?)
2022-08-27 05:43:11,076 INFO sqlalchemy.engine.Engine [generated in 0.00060s] ((1, 'Sam', 'Carter'), (2, 'Daniel', 'Jackson'), (3, 'Jack', "O'Neill"))
2022-08-27 05:43:11,079 INFO sqlalchemy.engine.Engine COMMIT
('1', 'Sam', 'Carter')
('2', 'Daniel', 'Jackson')
('3', 'Jack', "O'Neill")

Notice that the O'Neill string has been wrapped in double quotes, i.e. "O'Neill", but this is purely SQLAlchemy's echo presentation. The actual data has been inserted into the table correctly, as shown with sqlcmd:

# /opt/mssql-tools18/bin/sqlcmd \
        -S $DB_SERVER \
        -U $DB_USER \
        -P $DB_PASSWORD \
        -C \
        -d $DB_NAME \
        -Q "select * from dbo.my_table"
column1              column2              column3             
-------------------- -------------------- --------------------
1                    Sam                  Carter              
2                    Daniel               Jackson             
3                    Jack                 O'Neill             

(3 rows affected)

CodePudding user response：

I think the answer provided by "AlwaysLearning" is correct. This is not strictly an answer, but an alternative way to approach the same problem, i.e. DataFrame to ODBC database.

from arrow_odbc import insert_into_table
import pyarrow as pa
import pandas


def dataframe_to_table(df):
    table = pa.Table.from_pandas(df)
    reader = pa.RecordBatchReader.from_batches(table.schema, table.to_batches())
    insert_into_table(
        connection_string=connection_string,
        user="user",
        password="password",
        chunk_size=100,
        table="my_table",
        reader=reader,
    )

Full disclosure: I am the author of arrow-odbc so my views may be biased.