Home > Software engineering >  How do I improve the performance of python insert into Postgres
How do I improve the performance of python insert into Postgres

Time:08-30

  • Using execute 40 inserts per minute
  • Using executemany 41 inserts per minute
  • Using extras.execute_Values 42 inserts per minute
def save_return_to_postgres(record_to_insert) -> Any:


    insert_query = """INSERT INTO pricing.xxxx (description,code,unit,price,created_date,updated_date)
             VALUES %sreturning id"""

    records = (record_to_insert[2],record_to_insert[1],record_to_insert[3],record_to_insert[4],record_to_insert[0],datetime.datetime.now())

    # df = df[["description","code","unit","price","created_date","updated_date"]]

    
    try:
        
        conn = psycopg2.connect(database = 'xxxx',
                        user = 'xxxx',
                        password = 'xxxxx',
                        host= 'xxxx',
                        port='xxxx',
                        connect_timeout = 10)

        print("Connection Opened with Postgres")

        cursor = conn.cursor()
        extras.execute_values(cursor, insert_query, [records])
        
        conn.commit()

        # print(record_to_insert)

    finally:
        
        if conn:
            cursor.close()
            conn.close()
            print("Connection to postgres was successfully closed")

valores = df.values

for valor in valores:
    save_return_to_postgres(valor)
    print(valor)

CodePudding user response:

I don't know how much lines-per-insert postgres can take

But many SQL-based databases can take multiples inserts at the same time.

So instead of running

for insert_query in queries:
   sql_execute(insert_query)

Try making several inserts at once in a single command (Test it on pure SQL first to see if it works)

insert_list=[]
for insert_query in queries:
    insert_list.append(insert_query)
sql_execute(insert_list)

I had a similar issue and this link helped me https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-multiple-rows/ (of course mine was not Postgres but the idea is the same, decrease internet time by running multiple inserts in one command)

Tamo Junto

CodePudding user response:

Use execute_batch or execute_values and use them over the entire record set. As of now you are not using the batch capabilities of execute_values because you are inserting a single record at a time. You are further slowing things down by opening and closing a connection for each record as that is a time/resource expensive operation. Below is untested as I don't have the actual data and am assuming what df.values is.

insert_query = """INSERT INTO pricing.xxxx (description,code,unit,price,created_date,updated_date)
             VALUES %s returning id"""
#execute_batch query
#insert_query = """INSERT INTO pricing.xxxx #(description,code,unit,price,created_date,updated_date)
#             VALUES (%s, %s, %s, %s, %s, %s) returning id"""

valores = df.values
#Create a list of lists to pass to query as a batch instead of singly.
records = [[record_to_insert[2],record_to_insert[1],record_to_insert[3],
            record_to_insert[4],record_to_insert[0],datetime.datetime.now()] 
           for record_to_insert in valores]

try:
        
        conn = psycopg2.connect(database = 'xxxx',
                        user = 'xxxx',
                        password = 'xxxxx',
                        host= 'xxxx',
                        port='xxxx',
                        connect_timeout = 10)

        print("Connection Opened with Postgres")

        cursor = conn.cursor()
        extras.execute_values(cursor, insert_query, [records])
        #execute_batch
        #extras.execute_batch(cursor, insert_query, [records])

        conn.commit()

        # print(record_to_insert)

    finally:
        
        if conn:
            cursor.close()
            conn.close()
            print("Connection to postgres was successfully closed")

For more information see Fast execution helpers. Note that both the execute_values and execute_batch functions have a page_size argument of default value 100. This is the batch size for the operations. For large data sets you can reduce the time further by increasing the page_size to make bigger batches and reduce the number of server round trips .

  • Related