I have a df that I've read from sql:
id stock_id symbol date open high low close volume
0 1 35 ABSI 2022-09-28 3.06 3.33 3.0400 3.27 217040
1 2 35 ABSI 2022-09-29 3.19 3.19 3.0300 3.12 187309
2 3 35 ABSI 2022-09-30 3.11 3.27 3.0700 3.13 196566
3 4 35 ABSI 2022-10-03 3.16 3.16 2.8600 2.97 310441
4 5 35 ABSI 2022-10-04 3.04 3.37 2.9600 3.27 361082
.. ... ... ... ... ... ... ... ... ...
383 384 16 VVI 2022-10-03 31.93 33.85 31.3050 33.60 151357
384 385 16 VVI 2022-10-04 34.41 35.46 34.1900 35.39 105773
385 386 16 VVI 2022-10-05 34.67 35.30 34.5000 34.86 59605
386 387 16 VVI 2022-10-06 34.80 35.14 34.3850 34.50 55323
387 388 16 VVI 2022-10-07 33.99 33.99 33.3409 33.70 45187
[388 rows x 9 columns]
I then try and get the average of the last 5 days and add it to a new column:
df['volume_5_day'] = df.groupby('stock_id')['volume'].rolling(5).mean()
Which gives me the following error:
Traceback (most recent call last):
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11003, in _reindex_for_setitem
reindexed_value = value.reindex(index)._values
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/series.py", line 4672, in reindex
return super().reindex(**kwargs)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4966, in reindex
return self._reindex_axes(
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 4981, in _reindex_axes
new_index, indexer = ax.reindex(
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 4237, in reindex
target = self._wrap_reindex_result(target, indexer, preserve_names)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 2520, in _wrap_reindex_result
target = MultiIndex.from_tuples(target)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 204, in new_meth
return meth(self_or_cls, *args, **kwargs)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/indexes/multi.py", line 559, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas/_libs/lib.pyx", line 2930, in pandas._libs.lib.tuples_to_object_array
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/dan/Documents/code/wolfhound/add_indicators_daily.py", line 10, in <module>
df['volume_10_day'] = df.groupby('stock_id')['volume'].rolling(1).mean()
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3655, in __setitem__
self._set_item(key, value)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3832, in _set_item
value = self._sanitize_column(value)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 4535, in _sanitize_column
return _reindex_for_setitem(value, self.index)
File "/home/dan/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 11010, in _reindex_for_setitem
raise TypeError(
TypeError: incompatible index of inserted column with frame index
Any ideas what's going wrong here? Previously this worked and now it's throwing an error - and I can't seem to figure out why
CodePudding user response:
Chain Series.to_numpy
to add the values as a np.array
and make sure to add sort=False
inside df.groupby
:
df['volume_5_day'] = df.groupby('stock_id', sort=False)['volume']\
.rolling(5).mean().to_numpy()
print(df)
id stock_id symbol date ... low close volume volume_5_day
0 1 35 ABSI 2022-09-28 ... 3.0400 3.27 217040 NaN
1 2 35 ABSI 2022-09-29 ... 3.0300 3.12 187309 NaN
2 3 35 ABSI 2022-09-30 ... 3.0700 3.13 196566 NaN
3 4 35 ABSI 2022-10-03 ... 2.8600 2.97 310441 NaN
4 5 35 ABSI 2022-10-04 ... 2.9600 3.27 361082 254487.6
383 384 16 VVI 2022-10-03 ... 31.3050 33.60 151357 NaN
384 385 16 VVI 2022-10-04 ... 34.1900 35.39 105773 NaN
385 386 16 VVI 2022-10-05 ... 34.5000 34.86 59605 NaN
386 387 16 VVI 2022-10-06 ... 34.3850 34.50 55323 NaN
387 388 16 VVI 2022-10-07 ... 33.3409 33.70 45187 83449.0
Your initial approach fails, because the df.groupby
method that you are using, returns a pd.Series
with a different index than your df
. E.g.:
print(df.groupby('stock_id')['volume'].rolling(5).mean().index)
MultiIndex([(16, 383),
(16, 384),
(16, 385),
(16, 386),
(16, 387),
(35, 0),
(35, 1),
(35, 2),
(35, 3),
(35, 4)],
names=['stock_id', None])
So, it is saying it is unable to map this onto:
print(df.index)
Int64Index([0, 1, 2, 3, 4, 383, 384, 385, 386, 387], dtype='int64')
With a np.array
you don't have this problem. You could also have used:
df['volume_5_day'] = df.groupby('stock_id', as_index=False)['volume']\
.rolling(5).mean()['volume']
In this case, you don't need to add sort=False
, as it will match correctly on the index values.