Home > Software design >  Why polars date time subseting is slow?
Why polars date time subseting is slow?

Time:12-19

I'm looking for a faster way to subset a polars dataframe using datetime. I tried 2 different ways and I believe that is a faster way since I also tried with pandas and it was faster:

datetime.datetime(2009, 8, 3, 0, 0)
datetime.datetime(2009, 11, 3, 0, 0)


# - 1st
%%timeit
df_window = df_stock.filter(
    (df_stock["date_time"] >= start_date)
    & (df_stock["date_time"] <= end_date)
)
# 2.61 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# - 2nd
%%timeit
df_window = df_stock.filter(
    (pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
)
# 12.5 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas's subsetting is faster for datetime.

df_test = df_stock.to_pandas()
df_test.set_index('date_time', inplace=True)

%%timeit
df_test['2009-8-3':'2009-8-3']
# 1.02 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So why pandas faster than polars in this case? Is there a faster form of subsetting datetime on polars?

Enviroment:

  • Spyder version: 5.4.0 (conda)
  • Python version: 3.8.13 64-bit
  • Operating System: Linux 5.15.79.1-microsoft-standard-WSL2

CodePudding user response:

Could this just be an issue of benchmarking a relatively small query? My benchmarking shows the exact opposite.

Data

Let's start with a good amount of data, so that the subsetting process represents the largest share of the work. I'll also shuffle the data, so that issues of "sortedness" are not involved.

import time
import polars as pl
from datetime import datetime

df_stock = pl.DataFrame(
    {
        "date_time": pl.date_range(
            low=datetime(2000, 1, 1, 0, 0),
            high=datetime(2023, 1, 1, 0, 0),
            interval="1s",
        ).shuffle(seed=0)
    }
)
df_stock
shape: (725846401, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2022-01-02 21:46:35 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2003-12-16 18:57:19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-06-20 14:31:08 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2020-04-09 07:18:38 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2013-07-12 00:39:09 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-09-08 17:18:59 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-08-18 16:00:18 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-02-06 05:13:05 │
└─────────────────────┘

725 million records should be a good start.

Test Cases

I've put together the three cases from your code into separate tests.

Test #1

start_date = datetime(2009, 8, 3, 0, 0)
end_date = datetime(2009, 11, 3, 0, 0)

start = time.perf_counter()
_ = df_stock.filter(
    (df_stock["date_time"] >= start_date) & (df_stock["date_time"] <= end_date)
)
print(time.perf_counter() - start)

Test #2

start = time.perf_counter()
_ = df_stock.filter(
    (pl.col("date_time") >= start_date) & (pl.col("date_time") <= end_date)
)
print(time.perf_counter() - start)

Test #3

df_test = df_stock.to_pandas()
df_test.set_index("date_time", inplace=True)
start = time.perf_counter()
_ = df_test["2009-8-3":"2009-11-3"]
print(time.perf_counter() - start)

Here are the timings for three runs of the above on my system (a 32-core system).

Test #1: 1.1206, 1.1199, 1.1202

Test #2: 0.8342, 0.8215, 0.8227

Test #3: 4.6082, 4.5932, 4.6185

On my system, the Pandas indexing runs the slowest - by far. I could run this more than three times, but I think the pattern is quite clear.

Can you run a larger test on your system? Perhaps there is something else at play here...

For reference, version information on my test environment ...

---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.24.0
fsspec: 2022.11.0
connectorx: 0.3.1
xlsx2csv: 0.8
matplotlib: 3.6.2
  • Related