Home > database >  Efficient way to create DataFrame with different column types
Efficient way to create DataFrame with different column types

Time:08-16

I need to read data from numeric Postgres table and create DataFrame accordingly.

The default way Pandas is doing it is by using DataFrame.from_records:

df = DataFrame.from_records(data,
                            columns=columns,
                            coerce_float=coerce_float)

When data looks like:

[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...] 

And columns looks like:

['a', 'b']

The problem is that the generated DataFrame ignores the original Posgres types: double precision and real.

As I use huge DataFrames and my data is mostly real I'd like to explicitly specify the column types.

So I tried:

df = DataFrame.from_records(np.array(data, dtype=columns),
                            coerce_float=coerce_float)

When data is the same, but columns looks like:

[('a', 'float64'), ('b', 'float32')]

(types are extracted from Postgres as a part of query and converted to Numpy dtypes)

This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array generation is for some reason very slow. In real life I have 10-200 columns mostly float32.

What is the fastest way to construct DataFrame with specified column types?

CodePudding user response:

If you know the data columns and its types already, then following format will help to generate data frame with specified datatypes.

    pd.DataFrame(data, columns = columnList, dtype = np.dtype([('type1','type2')]))

CodePudding user response:

Try connecting to the Postgresql database and read directly to pandas data frame. Not sure if you already tried this way.

import pandas as pd
import psycopg2 as pg
connection= pg.connect("dbname='dbname' user='pguser' host='127.0.0.1' port='15432' password='password'")
df = pd.read_sql('select * from table', connection)
  • Related