Home > Software engineering >  Reading Redis Timeseries is slower than Pandas with CSV
Reading Redis Timeseries is slower than Pandas with CSV

Time:07-07

I'm using Redis Timeseries in order to read timeseries data that was previously stored in a CSV file.

The problem: Redis is far slower than Python Pandas reading the same set of data from Redis Server.

I provide here a MWE in order to show the issue. Here, I generate some random data composed by the unix timestamp and a number; then I fill the CSV and Redis with the same data in order to measure the READING time (I'm not concerned about writing in this scenario).

import csv
from datetime import datetime
from random import randrange
from datetime import timedelta
import redis
import pandas as pd
import time


def random_date(start, end):
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60)   delta.seconds
    random_second = randrange(int_delta)
    return start   timedelta(seconds=random_second)

with open('justcsv.csv', mode='w', newline='') as file:
    file_writer = csv.writer(
        file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    # Init Redis
    r = redis.Redis(host="localhost", port="6379")
    r.flushall()
    # Create the key
    r_tsname = "TESTKEY"
    label = { "label" : r_tsname}
    key_name = "TESTKEY1"
    r.ts().create(key_name, labels=label)

    # Init random timestamp between two datetime
    d1 = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')
    d2 = datetime.strptime('1/1/2022 4:50 AM', '%m/%d/%Y %I:%M %p')
    
    # For loop cycle
    for x in range(30000):
        dt = random_date(d1, d2)
        timestamp = int(dt.timestamp())
        random_number = round(random.uniform(1.5, 1000.9), 2)
        # write current row in CSV
        file_writer.writerow([timestamp, random_number])
        # write current row in REDIS
        r.ts().add(key_name, timestamp, random_number)
    
# READ data from CSV with Pandas and benchmark it
start_csv = time.time()
df = pd.read_csv('justcsv.csv') # benchmark
end_csv = time.time()
print("CSV READING TIME IS: "   str(end_csv-start_csv))

# READ data from Redis and benchmark
thelabel = "label="   "TESTKEY"
mrange_filters = [ thelabel ]
start_redis = time.time()
full_range = r.ts().range("TESTKEY1", "-", " ") #benchmark
end_redis = time.time()
print("REDIS READING TIME IS: "   str(end_redis-start_redis))

Benchmark result:

10000 iterations - slower x2
CSV READING TIME IS: 0.0124
REDIS READING TIME IS: 0.052

20000 iterations - slower x4
CSV READING TIME IS: 0.025
REDIS READING TIME IS: 0.102

30000 iterations - slower x10
CSV READING TIME IS: 0.0139
REDIS READING TIME IS: 0.153

I used the latest Docker image from: https://hub.docker.com/r/redislabs/redistimeseries

My observations:

  1. From what I understood, Redis should be extremely faster in this task, even because in this context there is a built-in data structure for dealing with timestamps;
  2. Redis should also be faster because the in-memory feature compared to reading a CSV file from disk
  3. the time gap from CSV rapidly increases as the data size grows
  4. even querying the data via redis-cli doesn't change the elapsed time

My questions:

  1. Why Redis is slower (and so slow)?
  2. Am I missing something?
  3. Is there a way to fix this?

CodePudding user response:

This is not a fair comparison!

  • For Pandas, you are:

    • writing a CSV (text file)
    • reading the CSV back (text file)
  • For RedisTimeSeries, you are

    • adding the samples using TS.ADD. Redis need to create a time series / find the existing time series, convert the data from textual representation binary representation, find where to add the date within the time series (Redis cannot assume the the timestamps are ordered), compress it, index it, and store it on disk if you enable persistence.
    • Then, when you load the data, Redis needs to: parse your TS.RANGE query, find the time series based on its key, find the data within the time series, decompress it, and convert it from binary representation to textual representation (that's a costly operation). Much more work!

    On top of that, the client-server communications of course! (TCP packetization, RESP encoding/decoding, etc.)

  • Related