Home > OS >  Efficient way to update a single element of a Polars DataFrame?
Efficient way to update a single element of a Polars DataFrame?

Time:01-03

Polars DataFrame does not provide a method to update the value of a single cell currently. Instead, we have to the method DataFrame.apply or DataFrame.apply_at_idx that updates a whole column / Series. This can be very expensive in situations where an algorithm repeated update a few elements of some columns. Why is DataFrame designed in this way? Looking into the code, it seems to me that Series does provide inner mutability via the method Series._get_inner_mut?

CodePudding user response:

As of polars >= 0.15.9 mutation of any data backed by number is constant complexity O(1) if data is not shared. That is numeric data and dates and duration.

If the data is shared we first must copy it, so that we become the solely owner.

import polars as pl
import matplotlib.pyplot as plt
from time import time

ts = []
ts_shared = []
clone_times = []
ns = []

for n in [1e3, 1e5, 1e6, 1e7, 1e8]:
    s = pl.zeros(int(n))
    
    t0 = time()
    # we are the only owner
    # so mutation is inplace
    s[10] = 10
    
    # time
    t = time() - t0
    
    # store datapoints
    ts.append(t)
    ns.append(n)
    
    # clone is free
    t0 = time()
    s2 = s.clone()
    t = time() - t0
    clone_times.append(t)
    
    
    # now there are two owners of the memory
    # we write to it so we must copy all the data first
    t0 = time()
    s2[11] = 11
    t = time() - t0
    ts_shared.append(t)
    


plt.plot(ns, ts_shared, label="writing to shared memory")
plt.plot(ns, ts, label="writing to owned memory")
plt.plot(ns, clone_times, label="clone time")
plt.legend()

enter image description here

In rust this dispatches to set_at_idx2, but it is not released yet. Note that using the lazy engine this will all be done implicitly for you.

  • Related