Home > OS >  MYSQL LOAD DATA INFILE statement works in workbench, but not python
MYSQL LOAD DATA INFILE statement works in workbench, but not python

Time:04-26

I am experimenting with MySQL and data analytics using the Fannie Mae mortgage data. To do this, I've created two tables, ("perf" and "acq") and several python functions. I start by dropping the collection and tables if they exist, then create the collection (mortgage_analysis) and the two tables. Then I build a list of files that correspond to the number of years of analysis I want to perform. All of that works just fine.

I then use the following function to load the tables with data from the perf and acq text files from Fannie Mae. The same function is used to load both tables. It works every time with the "perf" table and NEVER works with the "acq" table. If I take the SQL statements and execute them in the mySQL workbench, the statements work every time. I'm stumped and could use some help.

The SQL statement that works in the workbench but not in Python is:

LOAD DATA  INFILE '/Users/<my user name>/github/mortgage-analysis-example/data/text/acq/Acquisition_2000Q1.txt' 
INTO TABLE acq 
FIELDS TERMINATED BY '|' 
LINES TERMINATED BY '\n' 
(loan_id, orig_channel, seller_name, orig_interest_rate, orig_upb, orig_loan_term, 
 orig_date, first_pay_date, orig_ltv, orig_cltv, num_borrowers, dti, 
 borrower_credit_score, first_home_buyer, loan_purpose, property_type, num_units, 
 occupancy_status, property_state, zip, mortgage_insurance_percent, product_type, 
 coborrow_credit_score, mortgage_insurance_type, relocation_mortgage_ind);

The python function to load this is:

def loadTable(env_in, template, file_list):
    # env_in: (i know, uck global variable, holds info for this notebook common to all functions
    # template: SQL template file
    # file_list: python list element with fully qualified file names to use with SQL statement 
    env = env_in # environment info
    from mysql.connector import Error
    _file = open(env["base_path"] env["path_separator"] template, "r")
    _template = _file.readlines()
    try:
        conn = mysql.connector.connect(host=env["mySQL"]["host"],user=env["mySQL"]["user"], passwd=env['pw'])
        if conn.is_connected():
            print('Connected to MySQL database')
    except Error as e:
            print(e)
    cursor = conn.cursor()
    cursor.execute("USE mortgage_analysis;")
    cursor.execute("SET SESSION sql_mode = '';")
    print("starting table load")
    t0 = time.time()
    res = []
    for _file in file_list:
        _sql = _template[0].format(_file)
        print(f"\n{_sql}\n")
        try:
            res = cursor.execute(_sql)
            warn = cursor.fetchwarnings()
            #print(f"warn: {warn}")
        except Error as e:
            print(f"{_sql} \n{e}")

    t1 = time.time()
    print(f"Years: {env['years']} Table load time: {t1-t0}") 
    conn.close
    return env

No errors are identified (try always works) and no warnings are generated (fetchwarnings is always empty).

The SQL statements used to create the two tables are:

DROP TABLE IF EXISTS acq;
CREATE TABLE acq (id DOUBLE AUTO_INCREMENT, loan_id DOUBLE, orig_channel VARCHAR(255), seller_name VARCHAR(255), orig_interest_rate DOUBLE, orig_upb DOUBLE, orig_loan_term DOUBLE, orig_date VARCHAR(255), first_pay_date VARCHAR(255), orig_ltv DOUBLE, orig_cltv DOUBLE, num_borrowers DOUBLE, dti DOUBLE, borrower_credit_score DOUBLE, first_home_buyer VARCHAR(255), loan_purpose VARCHAR(255), property_type VARCHAR(255), num_units DOUBLE, occupancy_status VARCHAR(255), property_state VARCHAR(255), zip DOUBLE, mortgage_insurance_percent DOUBLE, product_type VARCHAR(255), coborrow_credit_score DOUBLE, mortgage_insurance_type DOUBLE, relocation_mortgage_ind VARCHAR(255), PRIMARY KEY (id));
DROP TABLE IF EXISTS perf;
CREATE TABLE perf (id DOUBLE AUTO_INCREMENT, loan_id DOUBLE, monthly_reporting_period VARCHAR(255), servicer VARCHAR(255), interest_rate DECIMAL(6,3), current_actual_upb DECIMAL(12,2), loan_age DOUBLE, remaining_months_to_legal_maturity DOUBLE, adj_remaining_months_to_maturity DOUBLE, maturity_date VARCHAR(255), msa DOUBLE, current_loan_delinquency_status DOUBLE, mod_flag VARCHAR(255), zero_balance_code VARCHAR(255), zero_balance_effective_date VARCHAR(255), last_paid_installment_date VARCHAR(255), foreclosed_after VARCHAR(255), disposition_date VARCHAR(255), foreclosure_costs DOUBLE, prop_preservation_and_reair_costs DOUBLE, asset_recovery_costs DOUBLE, misc_holding_expenses DOUBLE, holding_taxes DOUBLE, net_sale_proceeds DOUBLE, credit_enhancement_proceeds DOUBLE, repurchase_make_whole_proceeds DOUBLE, other_foreclosure_proceeds DOUBLE, non_interest_bearing_upb DOUBLE, principal_forgiveness_upb VARCHAR(255), repurchase_make_whole_proceeds_flag VARCHAR(255), foreclosure_principal_write_off_amount VARCHAR(255), servicing_activity_indicator VARCHAR(255), PRIMARY KEY (id));

CodePudding user response:

I tested the code and I had to make a few changes to get it to work.

Don't change sql_mode. I did not get any errors, I was able to load the data without compromising the sql_mode.

I used test data:

1|2|name1|3|4|4|date|date|5|6|7|8|9|buyer|purpose|type|10|status|state|11|12|type|13|14|ind
1|2|name2|3|4|4|date|date|5|6|7|8|9|buyer|purpose|type|10|status|state|11|12|type|13|14|ind

I urge you to choose more appropriate data types. You should use FLOAT or DOUBLE almost never in MySQL, unless you're storing scientific measurements or something. Definitely not for units of currency, or integers.

I would not use VARCHAR to store dates. MySQL has DATE and DATETIME, and these ensure dates are well-formed so you can do comparisons, sorting, date arithmetic, etc.

If you have errors that suggest you should relax the sql_mode and permit invalid queries or invalid data, I would recommend you fix the data instead. Even if you get data loaded, there's a risk it will become garbage if you allow non-strict SQL mode.

Code changes:

Instead of using format() to try to insert the filename into the query template, I used a query parameter. Delete the line with _template[0].format(_file), and instead use:

res = cursor.execute(_template, [_file])

But the template has to put the placeholder in without quotes:

Correct:

LOAD DATA INFILE %s INTO TABLE...

Incorrect:

LOAD DATA INFILE '%s' INTO TABLE...

Finally, data changes in Python are not committed by default. That is, you can insert data, then when you use conn.close the uncommitted changes are discarded. So I added a line:

conn.commit()

I put this in the try/catch block after executing the SQL.

This was successful in loading the data. Note I had to make assumptions about your input data, since you did not share a sample. I don't know if your file is actually well-formed with the proper field separators and line separators. But I assumed it was, since you said it worked in MySQL Workbench.

  • Related