Home > Mobile >  pandas read_sql_query with params matching multiple columns
pandas read_sql_query with params matching multiple columns

Time:11-27

I'm trying to query a table using pandas.read_sql_query, where I want to match multiple columns to python lists passed in as param arguments. Running into various psycopg2 errors when trying to accomplish this.

Ideally, I would provide a reproducible example, but unfortunately, that's not possible here due to the SQL connection requirement. If there is some way to provide a reproducible example, please let me know and I will edit the code below. Assume that the entries of col1 are strings and those of col2 are numeric values.

Note that I'm trying to ensure that each row of col1 and col2 matches the corresponding combination of list1 and list2, so it would not be possible to do separate where clauses for each, i.e., where col1 = any(%(list1)s) and col2 = any(%(list2)s).

First, I tried passing the lists as separate parameters and then combining them into an array within the SQL query:

import pandas as pd
list1 = ['a', 'b', 'c']
list2 = [1,2,3]

pd.read_sql_query(
    """
    select * from table
    where (col1, col2) = any(array[(%(list1)s, %(list2)s)])
    """,
    con = conn,
    params = {'list1': list1, 'list2':list2}

)

When I try this, I get Datatype Mismatch: cannot compare dissimilar columns of type text and text[] at column 1.

Also tried the following variant, where I passed a list of lists into param:

pd.read_sql_query(
    """
    select * from table
    where (col1, col2) = any(%(arr)s)
    """,
    con = conn,
    params = {'arr': [[x,y] for x,y in zip(list1,list2)]}

)

Here, I got DataError: (psycopg2.errors.InvalidTextRepresentation) in valid input syntax for integer: "a".

Tried a few other minor variants of the above, but every attempt threw some kind of error. So, what's the syntax needed in order to accomplish this?

EDIT: Including a reproducible example:

import numpy as np
import pandas as pd
from sqlalchemy import create_engine

list1 = ["a", "b", "c"]
list2 = [1, 2, 3]
n = 100

engine = create_engine(
    "postgresql psycopg2://postgres:password@localhost:5432/database"
)

np.random.seed(2022)
df = pd.DataFrame(
    {
        "col1": np.random.choice(list1, n, replace=True),
        "col2": np.random.choice(list2, n, replace=True),
    }
)

# write table to database
df.to_sql("toy_table", engine, if_exists="replace", index=False)

# query with where any
df_query = pd.read_sql_query(
    """
    select * from toy_table
    where col1 = any(%(list1)s) and col2=any(%(list2)s)
    """,
    con=engine,
    params={"list1": list1, "list2": list2},
)

# expected output
rows = [(x, y) for x, y in zip(list1, list2)]
df_expected = df.loc[df.apply(lambda x: tuple(x.values) in rows, axis=1)]

# Throws assertion error
assert df_expected.equals(df_query)

CodePudding user response:

To make a comparison of exact pairs you could convert your array to a dictionary then to JSON, taking advantage of PostgreSQL JSON functions like this:

#combine lists into a dictionary then convert to json
json1 = json.dumps(dict(zip(list1, list2)))

then query request should be

df = pd.read_sql_query(
            """
            select * 
            from "table"
            where concat('{"',col1,'":',col2,'}')::jsonb <@ %s::jsonb
            """,
            con = conn,
            params = (json1,)
        )

Tested with python 3.10.6

CodePudding user response:

Found a solution that works for any number of input lists:

df_query = pd.read_sql_query(
    """
    select * from toy_table
    join (select unnest(%(list1)s) as list1, unnest(%(list2)s) as list2) as tmp
    on col1 = list1 and col2 = list2
    """,
    con=engine,
    params={"list1": list1, "list2": list2},
)

# expected output
rows = [(x, y) for x, y in zip(list1, list2)]
df_expected = df.loc[df.apply(lambda x: tuple(x.values) in rows, axis=1)]

assert df_query.equals(df_expected)
  • Related