I need to read data from numeric Postgres table and create DataFrame accordingly.
The default way Pandas is doing it is by using DataFrame.from_records
:
df = DataFrame.from_records(data,
columns=columns,
coerce_float=coerce_float)
When data
looks like:
[(0.16275345863180396, 0.16275346), (0.6356328878675244, 0.6356329)...]
And columns
looks like:
['a', 'b']
The problem is that the generated DataFrame ignores the original Posgres types: double precision
and real
.
As I use huge DataFrames and my data is mostly real
I'd like to explicitly specify the column types.
So I tried:
df = DataFrame.from_records(np.array(data, dtype=columns),
coerce_float=coerce_float)
When data
is the same, but columns
looks like:
[('a', 'float64'), ('b', 'float32')]
(types are extracted from Postgres as a part of query and converted to Numpy dtypes)
This approach works, but DataFrame construction is 2-3 times slower (for 2M rows DataFrames it takes several seconds), because np.array
generation is for some reason very slow. In real life I have 10-200 columns mostly float32
.
What is the fastest way to construct DataFrame with specified column types?
CodePudding user response:
If you know the data columns and its types already, then following format will help to generate data frame with specified datatypes.
pd.DataFrame(data, columns = columnList, dtype = np.dtype([('type1','type2')]))
CodePudding user response:
Try connecting to the Postgresql database and read directly to pandas data frame. Not sure if you already tried this way.
import pandas as pd
import psycopg2 as pg
connection= pg.connect("dbname='dbname' user='pguser' host='127.0.0.1' port='15432' password='password'")
df = pd.read_sql('select * from table', connection)