Home > Mobile >  pandas.DataFrame.to_sql - column order of source csv file and destination table
pandas.DataFrame.to_sql - column order of source csv file and destination table

Time:12-29

In a notebook inside Azure Databricks, I am loading data from a CSV file to an Azure SQL database table by using pandas.DataFrame.to_sql. The column order in both the CSV file and the SQL table is exactly the same. But their names are different.

Question: would pandas.DataFrame.to_sql still load the data correctly to the corresponding columns? For example, if CSV file has columns F_Name, L_Name, Age, Gender - and the SQL table has columns (in same order) as fname, lname, person_age, sex, will the data still be loaded in the same order (that is, first name column values in CSV file will go to fist name column of sql table, last name column value of CSV file will go to last name column of sql table etc...).

Remarks: The csv file has about 150 columns, and SQL table has exactly same number of columns. The csv file is loaded from an external source on regular basis with same number of columns in exact same order but the column names sometimes very slightly column1 vs. coluumn_1, .... etc. So we don't have control over their column names (although the order is always the same first name, last name, age,...)

Code snippet

import pandas as pd
import sqlalchemy as sq
.............

pandasDFrame = pd.read_csv('/dbfs/FileStore/tables/MyFile.csv', header='infer')

pandasDFrame .to_sql(name='MySQLTable', con=engine, if_exists='replace', method='multi', chunksize=1000, index=False,\
dtype={'fname': sq.VARCHAR(15),\
     'lname': sq.VARCHAR(15),\                                                                                      
     `person_age`: sq.varchar(3),\
     `sex` : sq.varchar(10)})

CodePudding user response:

Consider retrieving the destination table columns and assign it to your data frame before appending data frame to table:

with engine.begin() as conn:
    result = conn.execute("SELECT TOP 0 * FROM MySQLTable")
    cols = [col for col in result.keys()]

pandasDFrame.columns = cols

pandasDFrame.to_sql(
    name = 'MySQLTable', 
    con = engine, 
    if_exists = 'replace', 
    method = 'multi', 
    chunksize = 1000, 
    index = False,
    dtype = {
        'fname': sq.VARCHAR(15),
        'lname': sq.VARCHAR(15),
        'person_age': sq.varchar(3),
        'sex': sq.varchar(10)
    }
)
  • Related