How to reduce amount of ram used by pandas-CodePudding

I have a Raspberry Pi 3B with 1Gb ram in which I am running a telegram bot. This bot uses a database I store in csv format, which contains about 100k rows and four columns:

First two are for searching
Third is a result

those use about 20-30MB ram, this is assumable.

The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.

I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.

What do you think is the best way to do this? Thanks!

CodePudding user response：

Setting the dtype according to the data can reduce memory usage a lot.

With read_csv you can directly set the dtype for each column:

dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} 
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Example:

df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)

See the next section on some dtype examples (with an existing df).

To change that on an existing dataframe you can use astype on the respective column.

Use df.info() to check the df memory usage before and after the change.

Some examples:

# Columns with integer values
#   check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')

# Columns with float values
#  mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')

# Columns with categorical values (strings)
#   e.g. when the rows have repeatingly the same strings 
#      like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows  
df['Team_Name'] = df['Team_Name'].astype('category')

# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})

Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.

Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes

CodePudding user response：

Not sure if this is a good idea in this case but it might be worth trying.

The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.

However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).

The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.

Here is a simple example I made before:

dask-examples.ipynb

If you try it let me know if it works, I'm also interested.