Pandas declare dtypes before loading data-CodePudding

i have a problem with RAM usage - I fetch quite a lot of data from DB and pour it into a pandas DataFrame, where I do groub_by to list - something DB is not very good at.

Thing is, as I fetch around 40 columns, pandas is not really good in determining the dtypes for each column. I would love to specify dtype for each column separately, so pandas does not use so much memory using object dtype everywhere. I know, I can transform the dataframe afterwards, but that does not solve the RAM overreach.

        import pandas as pd
        import numpy as np

        # Just a sample sql
        sql = "select premise_id, parent_id, addr_ward FROM table;"

        # This is list of tuples from database
        rows = safe_call_db_read(db.conn, sql)
        logger.info("Db fetched dataframe")

        dtype = {
            'premise_id': np.int64,
            'parent_id': np.int64,
            'addr_ward': object
        }

        data_frame = pd.DataFrame(data=rows, dtype=dtype)

This fails, ofc, because only one dtype is allowed as parameter, throwing this

TypeError: object of type 'type' has no len()

This SUCKS.

Is there some way of declaring dtypes for each column before actual loading data, that would save each column optimaly and thus saving me some RAM?

Maybe creating empty data frame, declaring dtype for each column and then appending the rows?

CodePudding user response：

you might wanna try the pandas method read_sql_query to directly read the sql query into a dataframe, you can give the dtype dict that you created exactly as you made it as the dtype arg.

only extra thing you need is to create a connection to your database beforehand through sqlite3 for example.

CodePudding user response：

I would try pandas.from_records which has a coerce_float option, and says that it's useful for SQL result sets. As @maxxel_ pointed out, reading from a SQL database is easiest, because Pandas can use the SQL definitions to get the datatypes, but it seems from your code that you have a subroutine with extra handling/etc.

Here's an example copied from the documentation, it shows the dtype defined independently for each column:

>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],
...                 dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d