Efficient way to create DataFrame with different column types-CodePudding

I need to read data from numeric Postgres table and create DataFrame accordingly.

The default way Pandas is doing it is by using DataFrame.from_records:

df = DataFrame.from_records(data,
                            columns=columns,
                            coerce_float=coerce_float)

When data looks like:

[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...]

And columns looks like:

['a', 'b']

The problem is that the generated DataFrame ignores the original Posgres types: double precision and real.

As I use huge DataFrames and my data is mostly real I'd like to explicitly specify the column types.

So I tried:

df = DataFrame.from_records(np.array(data, dtype=columns),
                            coerce_float=coerce_float)

When data is the same, but columns looks like:

[('a', 'float64'), ('b', 'float32')]

(types are extracted from Postgres as a part of query and converted to Numpy dtypes)

This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array generation is for some reason very slow. In real life I have 10-200 columns mostly float32.

What is the fastest way to construct DataFrame with specified column types?

CodePudding user response：

If you know the data columns and its types already, then following format will help to generate data frame with specified datatypes.

    pd.DataFrame(data, columns = columnList, dtype = np.dtype([('type1','type2')]))

CodePudding user response：

Try connecting to the Postgresql database and read directly to pandas data frame. Not sure if you already tried this way.

import pandas as pd
import psycopg2 as pg
connection= pg.connect("dbname='dbname' user='pguser' host='127.0.0.1' port='15432' password='password'")
df = pd.read_sql('select * from table', connection)