How to insert only new values in SQLite DB from Python and Pandas DF?-CodePudding

Im trying to "insert" new data/values to my SQLite DB using Python script and data in Pandas DF. The first insert works well, with either df.to_sql() or with cursor.execute("SQL STATEMENT"). But my issue is when trying to import/insert new data.

Problem I either get duplicate values or DB values simply gets replaced when I do a new insert. Using one of the data columns as PK won't work as none of them is truly Unique.

Goal What Im looking for is to only import new values, like; if the row with all its values DONT exist INSERT otherwise IGNORE.

Question What would be the proper way of inserting new values into the DB, when some of the data in the insert are Duplicates and should Not be inserted and some of the data is New data and Should be inserted?

Project background, current code, and What I've done

"Version 1" of the code is working, I'm getting the data as expected and It gets inserted into the DB as expected. I'm fetching FX rates thru an API, normalizing JSON, and creating a Pandas DF. I do some data structure and remove and add some columns. Then import it to DB. (Not sure that the current workflow is best practice, any advice on that would also be appreciated)

# Call API...Get data...Save as JSON...
url = f'https://min-api.cryptocompare.com/data/{timeframe}?fsym={coin}&tsym={fx_converter}&limit={limiter}'
data = json.loads(requests.get(url).text)

# Normalize JSON...Create DataFrame
df = pd.json_normalize(data, ['Data'])

Some additional DF structure-code, don't seem relevant for the Q

# Database Connection
cnxn = sqlite3.connect("fx_rates.db")
cursor = cnxn.cursor()

# Create table 
table = f""" CREATE TABLE IF NOT EXISTS {coin} 
    (
        time                INTEGER NOT NULL,
        high                REAL,
        low                 REAL,
        open                REAL,
        volumefrom          INTEGER,
        volumeto            INTEGER,
        close               REAL,
        conversionType      TEXT,
        conversionSymbol    TEXT,
        date                TEXT
    )"""

cursor.execute(table)
cnxn.commit()

# Insert data 
col = tuple(df.columns)

for i, value in df.iterrows():
    cursor.execute(
    f"""
        INSERT OR IGNORE INTO {coin}{col} 
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, 
    (
        value['time'],
        value['high'],
        value['low'],
        value['open'],
        value['volumefrom'],
        value['volumeto'],
        value['close'],
        value['conversionType'],
        value['conversionSymbol'],
        value['date']
    )
)

cnxn.commit()
cnxn.close()

If I run the code again now it gets duplicate values and rows. Neither of the columns is Unique and could be PK. Tried using df.to_sql(sql, if_exists='append' / 'replace' but that's not correct as it would just add duplicates or simply delete all old values and put in the new load.

Using some sort of Insert and Select seems to be an option but I can't figure out How to code that. Been looking for something like this, but dont know how to use the DF as a 'table'?

""" 
    SELECT date, conversionSymbol
    FROM table1 s
    WHERE NOT EXISTS 
        (
            SELECT 1
            FROM table2 d
            WHERE d.open = s.open
        )
"""

Feels to me that I have missed something fundamental and it's really easy to solve. That there is a "simple" way to use SQL statements to get this done. Or that the solution is not possible when the data is in Pandas and perhaps I need to use some ORM like SQLalchemy, I have no experience in that.

CodePudding user response：

You can create primary key that is based on several columns See here Sqlite primary key on multiple columns

Or you can define a unique index based on several columns. It's more or less equivalent.