Polars DataFrame does not provide a method to update the value of a single cell currently. Instead, we have to the method DataFrame.apply
or DataFrame.apply_at_idx
that updates a whole column / Series. This can be very expensive in situations where an algorithm repeated update a few elements of some columns. Why is DataFrame
designed in this way? Looking into the code, it seems to me that Series
does provide inner mutability via the method Series._get_inner_mut
?
CodePudding user response:
As of polars >= 0.15.9
mutation of any data backed by number is constant complexity O(1)
if data is not shared. That is numeric data and dates and duration.
If the data is shared we first must copy it, so that we become the solely owner.
import polars as pl
import matplotlib.pyplot as plt
from time import time
ts = []
ts_shared = []
clone_times = []
ns = []
for n in [1e3, 1e5, 1e6, 1e7, 1e8]:
s = pl.zeros(int(n))
t0 = time()
# we are the only owner
# so mutation is inplace
s[10] = 10
# time
t = time() - t0
# store datapoints
ts.append(t)
ns.append(n)
# clone is free
t0 = time()
s2 = s.clone()
t = time() - t0
clone_times.append(t)
# now there are two owners of the memory
# we write to it so we must copy all the data first
t0 = time()
s2[11] = 11
t = time() - t0
ts_shared.append(t)
plt.plot(ns, ts_shared, label="writing to shared memory")
plt.plot(ns, ts, label="writing to owned memory")
plt.plot(ns, clone_times, label="clone time")
plt.legend()
In rust this dispatches to set_at_idx2
, but it is not released yet. Note that using the lazy engine this will all be done implicitly for you.