I came across a rather puzzling result: pandas at
assignment is almost 10 times slower than appending a list and converting it to dataframe. Why is this? I understand that pandas at
deals with a little more complicated structure, but 10 times slower performance seems extreme, rendering thereby clumsy list appending more efficient. I see many posts about slowness of loc
but loc
does a lot more than at
so in a way one can accept its sluggishness.
import pandas as pd
import numpy as np
import time
dt = pd.date_range('2010-01-01','2021-01-01', freq='H')
s = pd.Series(data = np.random.randint(0,1000, len(dt)), index = np.random.choice(dt, size=len(dt), replace=False))
d = pd.Series(data = np.nan, index=dt)
t = time.time()
for i in s.iteritems():
d.at[i[0]] = np.random.randint(10)
print(time.time()-t)
m = []
t = time.time()
for i in s.iteritems():
m.append((i[0], np.random.randint(10)))
m = pd.DataFrame(m)
m.columns = ['date_time', 'data']
m.set_index('date_time', inplace=True)
print(time.time()-t)
output:
4.053529500961304
0.3882014751434326
CodePudding user response:
Well... There is no need to use at
here... Pandas isn't meant to be fast...
The append
method automatically appends to the end of the list... Where at
doesn't... Also you're using indexed assignment... Base Python surely would be faster.
The fastest would be bare numpy here:
import pandas as pd
import numpy as np
import time
dt = pd.date_range('2010-01-01','2021-01-01', freq='H')
s = pd.Series(data = np.random.randint(0,1000, len(dt)), index = np.random.choice(dt, size=len(dt), replace=False))
d = pd.Series(data = np.nan, index=dt)
t = time.time()
d[:] = np.random.randint(0, 10, len(dt))
print(d)
print(time.time()-t)
m = []
t = time.time()
for i in s.iteritems():
m.append((i[0], np.random.randint(10)))
m = pd.DataFrame(m)
m.columns = ['date_time', 'load']
m.set_index('date_time', inplace=True)
print(m)
print(time.time()-t)
Output:
2010-01-01 00:00:00 5
2010-01-01 01:00:00 5
2010-01-01 02:00:00 6
2010-01-01 03:00:00 6
2010-01-01 04:00:00 8
..
2020-12-31 20:00:00 7
2020-12-31 21:00:00 6
2020-12-31 22:00:00 4
2020-12-31 23:00:00 2
2021-01-01 00:00:00 1
Freq: H, Length: 96433, dtype: int32
0.03117227554321289
load
date_time
2018-04-14 04:00:00 6
2014-03-03 23:00:00 2
2010-03-11 20:00:00 1
2017-06-27 16:00:00 2
2020-11-08 11:00:00 6
... ...
2016-12-03 08:00:00 9
2020-01-12 16:00:00 5
2014-08-08 10:00:00 5
2012-05-23 04:00:00 8
2010-11-05 06:00:00 2
[96433 rows x 1 columns]
0.4729738235473633
As you can see, NumPy is like 10 times faster already for this minor dataset. NumPy is developed some part in C, and it's super quick. Just printed it out to prove that the outputs are the same.
CodePudding user response:
pandas
will always lose to a list
for these sort of simple operations. pandas
is written in Python, and executes a ton of code at the Python interpreter level. In contrast, list.append
, is written entirely in C. You just have to resolve the method and you are into the C layer.
Both operations are Constant Time
Note, both operations are constant time, but the constant factors for pandas.DataFrame.at.__setitem__
are much higher than those for list.append
. Consider the following timings:
from timeit import timeit
import pandas as pd, matplotlib.pyplot as plt
setup = """
import pandas as pd
d = pd.Series(float('nan'), index=range({}))
def use_append(s):
m = []
for i,v in s.iteritems():
m.append((i, 42))
return pd.DataFrame(m)
def use_at(s):
for i,v in s.iteritems():
s[i] = 42
return pd.DataFrame(s)
"""
N = range(0, 50_000, 1000)
appends = [timeit("use_append(d)", setup.format(n), number=5) for n in N]
ats = [timeit("use_at(d)", setup.format(n), number=5) for n in N]
pd.DataFrame(dict(appends=appends, ats=ats), index=N).plot()
plt.savefig("list_append-vs-df_at.png")
As you can see, the overall algorithm is linear in both cases.
List append Implementation
To reiterate, list.append
is much faster because it is entirely implemented in C. Here is the source code
static int
app1(PyListObject *self, PyObject *v)
{
Py_ssize_t n = PyList_GET_SIZE(self);
assert (v != NULL);
if (n == PY_SSIZE_T_MAX) {
PyErr_SetString(PyExc_OverflowError,
"cannot add more objects to list");
return -1;
}
if (list_resize(self, n 1) < 0)
return -1;
Py_INCREF(v);
PyList_SET_ITEM(self, n, v);
return 0;
}
int
PyList_Append(PyObject *op, PyObject *newitem)
{
if (PyList_Check(op) && (newitem != NULL))
return app1((PyListObject *)op, newitem);
PyErr_BadInternalCall();
return -1;
}
As you can see, there is a wrapper function, and then the main function app1
, executes the classic dynamic array append, resize internal buffer if necessary, then set the item at the corresponding index of the internal array (see here for the PyListObject_SET_ITEM
macro definition, which you will see, just sets an array at a given index).
Pandas `at` implementation
Alternatively, consider *just a little bit* what happens when you do:some_series.at[key] = value
First, some_series.at
creates a pandas.core.indexing._AtIndexer
object, then, for that _AtIndexer
object. __setitem__
is called:
def __setitem__(self, key, value):
if self.ndim == 2 and not self._axes_are_unique:
# GH#33041 fall back to .loc
if not isinstance(key, tuple) or not all(is_scalar(x) for x in key):
raise ValueError("Invalid call for scalar access (setting)!")
self.obj.loc[key] = value
return
return super().__setitem__(key, value)
As we see here, it actually falls back to loc
if the indices aren't unique. But suppose they are, it then makes an expensive call to super().__setitem__(key, value)
, taking us to _ScalarAccessIndexer.__setitem__
:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
# scalar callable may return tuple
key = com.apply_if_callable(key, self.obj)
if not isinstance(key, tuple):
key = _tuplify(self.ndim, key)
key = list(self._convert_key(key))
if len(key) != self.ndim:
raise ValueError("Not enough indexers for scalar access (setting)!")
self.obj._set_value(*key, value=value, takeable=self._takeable)
Here, it handles a special case if the key is a callable (i.e. a function), does some basic bookeeping on the type of the key and checks the length. If all goes well, it delegates to self.obj._set_value(*key, value=value, takeable=self._takeable)
, yet another Python function call. For completion sake, let's look at pd.Series._set_value
:
def _set_value(self, label, value, takeable: bool = False):
"""
Quickly set single value at passed label.
If label is not contained, a new object is created with the label
placed at the end of the result index.
Parameters
----------
label : object
Partial indexing with MultiIndex not allowed.
value : object
Scalar value.
takeable : interpret the index as indexers, default False
"""
if not takeable:
try:
loc = self.index.get_loc(label)
except KeyError:
# set using a non-recursive method
self.loc[label] = value
return
else:
loc = label
self._set_values(loc, value)
Again, more book keeping. Note, if we actually were to grow the pandas object, it falls back to loc
...
def _set_values(self, key, value) -> None:
if isinstance(key, (Index, Series)):
key = key._values
self._mgr = self._mgr.setitem(indexer=key, value=value)
self._maybe_update_cacher()
But if the size doesn't change, it finally delegates to the block manager, the internal guts of pandas objects: self._mgr.setitem(indexer=key, value=value)
... Note this isn't even done yet! But I think you get the idea...