I am attempting to generate an INSERT statement into an SQL Server database from a Data Frame:
data_frame.to_sql(table, engine, if_exists='append', index=False)
The Data Frame is created from a JSON Response.
I am having issues inserting String values with single quotes in them into the database. The reason appears to be that the Data Frame fields are double quoted if there is a single quote anywhere in the field. Example of the behavior:
INSERT INTO my_table(column1, column2, column3) VALUES ('value1', "value2's", 'value3');
Because value2's has a single quote in it, .to_sql() surrounds the field with double quotes instead. However, SQL Server believes this value to be a column instead because it's double quoted and the insert fails.
How can I get .to_sql()
to instead format the INSERT statement like below:
INSERT INTO my_table(column1, column2, column3) VALUES ('value1', 'value2''s', 'value3');
The Error message .to_sql()
generates:
The identifier that starts with 'value2's' is too long. Maximum length is 128.
The error message above suggests that the database is treating the Data Frame value as a column name. SQL Server often gives this error if a value is double quoted instead of single quoted.
Libraries (some may not be relevant to this issue):
import pyodbc
import sqlalchemy
from sqlalchemy import MetaData
import pandas as pd
from sqlalchemy.engine import URL
Any help is appreciated!
CodePudding user response:
How have you determined that the error message applies to the df.to_sql()
statement?
Consider the following example:
import os
import pandas as pd
from sqlalchemy import Column, MetaData, Table, VARCHAR, create_engine
import urllib
# Connection setup...
conn_params = urllib.parse.quote_plus(
"Driver=ODBC Driver 18 for SQL Server;"
"Server=tcp:%s,%s;" % (os.getenv("DB_SERVER"), os.getenv("DB_PORT"))
"Database=%s;" % os.getenv("DB_NAME")
"Uid=%s;" % os.getenv("DB_USER")
"Pwd={%s};" % os.getenv("DB_PASSWORD")
"Encrypt=yes;"
"TrustServerCertificate=yes;"
)
url = f"mssql pyodbc:///?odbc_connect={conn_params}"
engine = create_engine(
url,
isolation_level="REPEATABLE READ"
)
# Database setup...
m = MetaData()
t = Table(
"my_table",
m,
Column("column1", VARCHAR(20)),
Column("column2", VARCHAR(20)),
Column("column3", VARCHAR(20)),
)
m.create_all(engine)
# Dataframe setup...
df = pd.DataFrame(data={
"column1": [1, 2, 3],
"column2": ["Sam", "Daniel", "Jack"],
"column3": ["Carter", "Jackson", "O'Neill"]
})
# Data insertion...
with engine.begin() as conn:
conn.execute(t.delete())
engine.echo = True
df.to_sql(t.name, engine, if_exists="append", index=False)
for row in conn.execute(t.select()).fetchall():
print(row)
Executing this on the console gives us the following output:
# python3 demo.py
2022-08-27 05:43:11,069 INFO sqlalchemy.engine.Engine SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
FROM [INFORMATION_SCHEMA].[TABLES]
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max))
2022-08-27 05:43:11,069 INFO sqlalchemy.engine.Engine [cached since 0.01563s ago] ('BASE TABLE', 'my_table', 'dbo')
2022-08-27 05:43:11,074 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2022-08-27 05:43:11,076 INFO sqlalchemy.engine.Engine INSERT INTO my_table (column1, column2, column3) VALUES (?, ?, ?)
2022-08-27 05:43:11,076 INFO sqlalchemy.engine.Engine [generated in 0.00060s] ((1, 'Sam', 'Carter'), (2, 'Daniel', 'Jackson'), (3, 'Jack', "O'Neill"))
2022-08-27 05:43:11,079 INFO sqlalchemy.engine.Engine COMMIT
('1', 'Sam', 'Carter')
('2', 'Daniel', 'Jackson')
('3', 'Jack', "O'Neill")
Notice that the O'Neill
string has been wrapped in double quotes, i.e. "O'Neill"
, but this is purely SQLAlchemy's echo presentation. The actual data has been inserted into the table correctly, as shown with sqlcmd
:
# /opt/mssql-tools18/bin/sqlcmd \
-S $DB_SERVER \
-U $DB_USER \
-P $DB_PASSWORD \
-C \
-d $DB_NAME \
-Q "select * from dbo.my_table"
column1 column2 column3
-------------------- -------------------- --------------------
1 Sam Carter
2 Daniel Jackson
3 Jack O'Neill
(3 rows affected)
CodePudding user response:
I think the answer provided by "AlwaysLearning" is correct. This is not strictly an answer, but an alternative way to approach the same problem, i.e. DataFrame to ODBC database.
from arrow_odbc import insert_into_table
import pyarrow as pa
import pandas
def dataframe_to_table(df):
table = pa.Table.from_pandas(df)
reader = pa.RecordBatchReader.from_batches(table.schema, table.to_batches())
insert_into_table(
connection_string=connection_string,
user="user",
password="password",
chunk_size=100,
table="my_table",
reader=reader,
)
Full disclosure: I am the author of arrow-odbc
so my views may be biased.