Home > Back-end >  Caching data: numpy vs pandas vs MySQL
Caching data: numpy vs pandas vs MySQL

Time:01-11

I am currently processing time series data stored into h5 files, each file containing one hour of data.

In order to move towards real time processing, I would like to process time series data, one second at a time. The plan is to aggregate one second of data, process the data, clear the cache and repeat.

My first idea was to do this using numpy arrays or pandas dataframe, but a colleague suggested caching the data to a MySQL database instead.

In order to benchmark the performance of each approach, I ran a simple timing exercise, trying to access 1,000 samples:

Method Execution time
Pandas 1.36 µs
Numpy 790 ns
MySQL 552 ns

The code used to obtain these results is detailed below.

From this limited exercise, it looks like the MySQL approach is the winner, but since most of the processing relies on numpy and pandas functions anyways, I am not sure whether it would make much sense to cache the data into a database prior to writing them to a numpy array or a pandas dataframe.

So here's my question: apart from improved performance, what are the benefits of using a MySQL database to cache data?


Benchmark

import pandas as pd
import numpy as np
import mysql.connector
from timeit import timeit

Pandas dataframe:

df = pd.DataFrame()
df['test'] = np.arange(1,1000)
%timeit df['test']

This returns 1.36 µs ± 26.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Numpy array:

%timeit np.arange(1,1000)

This returns 790 ns ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

MySQL database:

cnx = mysql.connector.connect(user='root', password='',
                              host='127.0.0.1',
                              database='mydb')

try:
   cursor = cnx.cursor()
   cursor.execute("""
      select * from dummy_data
   """)
   %timeit result_mysql = [item[0] for item in cursor.fetchall()]
finally:
    cnx.close() 

This yields 552 ns ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

CodePudding user response:

Two parts of this answer.


Timings

The first thing that should stand out here is that there is no way that a list comprehension should beat the initialisation of a numpy array. This is immediately suspicious:

%timeit np.arange(1,1000)
790 ns ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit result_mysql = [item[0] for item in cursor.fetchall()]
552 ns ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

It's not impossible that Python might beat numpy on something like this but the probability is vanishingly small (some regression in a new numpy release?).

The issue here is that cursor.fetchall() returns a generator of results, which can only yield records; the first iteration of 1000000 loops goes slow, consumes the iterator, and the remaining 999999 iterations become [item[0] for item in ()].

This becomes more obvious when you run something like this:

import numpy as np

def create_array():
    a = np.arange(1, 1000)
​
​
def list_comp(tups):
    a = [item[0] for item in tups]
         
         
test = [(x,) for x in range(999)]
​
# Results
​
%timeit create_array()
2.12 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
​
%timeit list_comp(test)
61.7 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Actual Approach

I don't really understand your use of cache here. There's a couple of things that don't add up for me:

  1. You throw the intermediate data away - that's not a cache
  2. You don't re-use any pre-allocated memory (less important in python vs. compiled languages, but still can make a difference) - that's not a cache.

Unless you'll explode RAM by pulling all the data, then aggregating on a per-second basis, then I don't see why you're processing it like this. Both pandas and MySQL will benefit from bulk analysis and then reducing it down to results on a per-second basis. Even with the poor timings from your own investigation, MySQL might actually beat pandas in speed, especially if the data is too large to hold in memory.

Bottom line - these speed tests are not suitable on their own to determine anything about what's most appropriate for your actual application. Just don't use MySQL as an interim storage on a per-second basis

  • Related