This is an observation from Most pythonic way to concatenate pandas cells with conditions I am not able to understand why third solution one takes more memory compared to first one.
If I don't sample the third solution does not give runtime error, clearly something is weird
To emulate large dataframe I tried to resample, but never expected to run into this kind of error
Background
Pretty self explanatory, one line, looks pythonic
df['city'] (df['city'] == 'paris')*('_' df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] (df['city'] == 'paris')*('_' df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] '_' df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I dont sample, there is no error and output also match exactly
Error(Updated)(Only happens when I sample from dataframe)
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] = '_' df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64
For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target'] = '_' df['arr'].astype(str)
~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
99 # we are updating inplace so we want to ignore is_copy
100 self._update_inplace(
--> 101 result.reindex_like(self, copy=False), verify_is_copy=False
102 )
103
I rerun them from scratch each time
Update
This is part of what I figured
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df
city arr final_target
1 paris 12 paris_12
0 paris 11 paris_11
2 dallas 22 dallas
2 dallas 22 dallas
3 miami 15 miami
3 miami 15 miami
2 dallas 22 dallas
1 paris 12 paris_12
0 paris 11 paris_11
3 miami 15 miami
Indices are repeated when sampled with replacement
So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with
df.loc[df['city'] == 'paris', 'arr'].astype(str)
will solve it. Just as 2e0byo pointed out.Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!
CodePudding user response:
@2e0byo hit the nail on the head saying pandas' algorithm is "inefficient" in this case.
As far as .loc
, it's not really doing anything remarkable. Its use here is analogous to indexing a numpy array with a boolean array of the same shape, with an added dict-key-like access to a specific column - that is, df['city'] == 'paris'
is itself a dataframe, with the same number of rows and the same indexes as df
, with a single column of boolean values. df.loc[df['city'] == 'paris']
then gives a dataframe consisting of only the rows that are true in df['city'] == 'paris'
(that have 'paris' in the 'city' column). Adding the additional argument 'final_target' then just returns only the 'final_target' column of those rows, instead of all three (and because it only has one column, it's technically a Series
object - the same goes for df['arr']
).
The memory explosion happens when pandas actually tries to add the two Series. As @2e0byo pointed out, it has to reshape the Series to do this, and it does this by calling the first Series' align()
method. During the align
operation, the function pandas.core.reshape.merge.get_join_indexers()
calls pandas._libs.join.full_outer_join()
(line 155) with three arguments: left
, right
, and max_groups
(point of clarification: these are their names inside the function full_outer_join
). left
and right
are integer arrays containing the indexes of the two Series objects (the values in the index column), and max_groups
is the maximum number of unique elements in either left
or right
(in our case, that's five, corresponding to the five original rows in s
).
full_outer_join
immediately turns and calls pandas._libs.algos.groupsort_indexer()
(line 194), once with left
and max_groups
as arguments and once with right
and max_groups
. groupsort_indexer
returns two arrays - generically, indexer
and counts
(for the invocation with left
, these are called left_sorter
and left_count
, and correspondingly for right
). counts
has length max_groups 1
, and each element (excepting the first one, which is unused) contains the count of how many times the corresponding index group appears in the input array. So for our case, with max_groups = 5
, the count
arrays have shape (6,)
, and elements 1-5 represent the number of times the 5 unique index values appear in left
and right
.
The other array, indexer
, is constructed so that indexing the original input array with it returns all the elements grouped in ascending order - hence "sorter." After having done this for both left
and right
, full_outer_join
chops up the two sorters and strings them up across from each other. full_outer_join
returns two arrays of the same size, left_idx
and right_idx
- these are the arrays that get really big and throw the error. The order of elements in the sorters determines the order they appear in the final two output arrays, and the count
arrays determine how often each one appears. Since left
goes first, its elements stay together - in left_idx
, the first left_count[1]
elements in left_sorter
are repeated right_count[1]
times each (aaabbbccc...). At the same place in right_idx
, the first right_count[1]
elements are repeated in a row left_count[1]
times (abcabcabc...). (Conveniently, since the 0
row in s
is a 'paris'
row, left_count[1]
and right_count[1]
are always equal, so you get x amount of repeats x amount of times to start off). Then the next left_count[2]
elements of left_sorter
are repeated right_count[2]
times, and so on... If any of the counts
elements are zero, the corresponding spots in the idx
arrays are filled with -1, to be masked later (as in, right_count[i] = 0
means elements in right_idx
are -1, and vice versa - this is always the case for left_count[3]
and left_count[4]
, because rows 2
and 3
in s
are non-'paris'
).
In the end, the _idx
arrays have an amount of elements equal to N_elements
, which can be calculated as follows:
left_nonzero = (left_count[1:] != 0)
right_nonzero = (right_count[1:] != 0)
left_repeats = left_count[1:]*left_nonzero np.ones(len(left_counts)-1)*(1 - left_nonzero)
right_repeats = right_count[1:]*right_nonzero np.ones(len(right_counts)-1)*(1 - right_nonzero)
N_elements = sum(left_repeats*right_repeats)
The corresponding elements of the count
arrays are multiplied together (with all the zeros replaced with ones), and added together to get N_elements
.
You can see this figure grows pretty quickly (O(n^2)
). For an original dataframe with 1,000,000 sampled rows, each one appearing about equally, then the count
arrays look something like:
left_count = array([0, 2e5, 2e5, 0, 0, 2e5])
right_count = array([0, 2e5, 2e5, 2e5, 2e5, 2e5])
for a total length of about 1.2e11
. In general for an initial sample N
(df = pd.read_csv(io.StringIO(s)).sample(N, replace=True)
), the final size is approximately 0.12*N**2
An Example
It's probably helpful to look at a small example to see what full_outer_join
and groupsort_indexer
are trying to do when they make those ginormous arrays. We'll start with a small sample of only 10 rows, and follow the various arrays to the final output, left_idx
and right_idx
. We'll start by defining the initial dataframe:
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df['final_target'] = df['city'] # this line doesn't change much, but meh
which looks like:
city arr final_target
3 miami 15 miami
1 paris 11 paris
0 paris 12 paris
0 paris 12 paris
0 paris 12 paris
1 paris 11 paris
2 dallas 22 dallas
3 miami 15 miami
2 dallas 22 dallas
4 paris 16 paris
df.loc[df['city'] == 'paris', 'final_target']
looks like:
1 paris
0 paris
0 paris
0 paris
1 paris
4 paris
and df['arr'].astype(str)
:
3 15
1 11
0 12
0 12
0 12
1 11
2 22
3 15
2 22
4 16
Then, in the call to full_outer_join
, our arguments look like:
left = array([1,0,0,0,1,4]) # indexes of df.loc[df['city'] == 'paris', 'final_target']
right = array([3,1,0,0,0,1,2,3,2,4]) # indexes of df['arr'].astype(str)
max_groups = 5 # the max number of unique elements in either left or right
The function call groupsort_indexer(left, max_groups)
returns the following two arrays:
left_sorter = array([1, 2, 3, 0, 4, 5])
left_count = array([0, 3, 2, 0, 0, 1])
left_count
holds the number of appearances of each unique value in left
- the first element is unused, but then there a 3 zeros, 2 ones, 0 twos, 0 threes, and 1 four in left
.
left_sorter
is such that left[left_sorter] = array([0, 0, 0, 1, 1, 4])
- all in order.
Now right
: groupsort_indexer(right, max_groups)
returns
right_sorter = array([2, 3, 4, 1, 5, 6, 8, 0, 7, 9])
right_count = array([0, 3, 2, 2, 2, 1])
Once again, right_count
contains the number of times each count appears: the unused first element, and then 3 zeros, 2 ones, 2 twos, 2 threes, and 1 four (note that elements 1, 2, and 5 of both count
arrays are the same: these are the rows in s
with 'city' = 'paris'
). Also, right[right_sorter] = array([0, 0, 0, 1, 1, 2, 2, 3, 3, 4])
With both count
arrays calculated, we can calculate what size the idx
arrays will be (a bit simpler with actual numbers than with the formula above):
N_total = 3*3 2*2 2 2 1*1 = 18
3
is element 1 for both counts
arrays, so we can expect something like [1,1,1,2,2,2,3,3,3]
to start left_idx
, since [1,2,3]
starts left_sorter
, and [2,3,4,2,3,4,2,3,4]
to start right_idx
, since right_sorter
begins with [2,3,4]
. Then we have twos, so [0,0,4,4]
for left_idx
and [1,5,1,5]
for right_idx
. Then left_count
has two zeros, and right_count
has two twos, so next go 4 -1
's in left_idx
and the next four elements in right_sorter
go into right_idx
: [6,8,0,7]
. Both count
's finish with a one, so one each of the last elements in the sorters
go in the idx
: 5
for left_idx
and 9
for right_idx
, leaving:
left_idx = array([1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 0, 4, 4,-1, -1, -1, -1, 5])
right_idx = array([2, 3, 4, 2, 3, 4, 2, 3, 4, 1, 5, 1, 5, 6, 8, 0 , 7, 9])
which is indeed 18 elements.
With both index arrays the same shape, pandas can construct two Series of the same shape from our original ones to do any operations it needs to, and then it can mask these arrays to get back sorted indexes. Using a simple boolean filter to look at how we just sorted left
and right
with the outputs, we get:
left[left_idx[left_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 4])
right[right_idx[right_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4])
After going back up through all the function calls and modules, the result of the addition at this point is:
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
0 paris_12
1 paris_11
1 paris_11
1 paris_11
1 paris_11
2 NaN
2 NaN
3 NaN
3 NaN
4 paris_16
which is result
in the line result = op(self, other)
in pandas.core.generic.NDFrame._inplace_method
(line 11066), with op = pandas.core.series.Series.__add__
and self
and other
the two Series from before that we're adding.
So, as far as I can tell, pandas basically tries to perform the operation for every combination of identically-indexed rows (like, any and all rows with index 1
in the first Series should be operated with all rows index 1
in the other Series). If one of the Series has indexes that the other one doesn't, those rows get masked out. It just so happens in this case that every row with the same index is identical. It works (albeit redundantly) as long as you don't need to do anything in place - the trouble for the small dataframes arises after this when pandas tries to reindex this result back into the shape of the original dataframe df
.
The split (the line that smaller dataframes make it past, but larger ones don't) is that line result = op(self, other)
from above. Later in the same function (called, note, _inplace_method
), the program exits at self._update_inplace(result.reindex_like(self, copy=False), verify_is_copy=False)
. It tries to reindex result
so it looks like self
, so it can replace self
with result
(self
is the original Series, the first one in the addition, df.loc[df['city'] == 'paris', 'final_target']
). And this is where the smaller case fails, because, obviously, result
has a bunch of repeated indexes, and pandas doesn't want to lose any information when it deletes some of them.
One Last Thing
It's probably worth mentioning that this behaviour isn't particular to the addition operation here. It happens any time you try an arithmetic operation on two large dataframes with a lot of repeated indexes - for example, try just defining a second dataframe the exact same way as the first, df2 = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
, and then try running df.arr*df2.arr
. You'll get the same memory error.
Interestingly, logical and comparison operators have protections against doing this - they require identical indexes, and check for it before calling their operator method.
I did all my stuff in pandas 1.2.4, python 3.7.10, but I've given links to the pandas Github, which is currently in version 1.3.3. As far as I can tell, the differences don't affect the results.
CodePudding user response:
I could certainly be wrong about this, but isn't it because df["arr"]
has a different shape from df.loc[df["city"] == "paris"]
? So something funny is happening in Pandas' internal resampling.
If I explicitly truncate the dataframe myself it works:
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] = "_" df.loc[df['city'] == 'paris', 'arr'].astype(str)
In which case, the answer would be 'because internally pandas has an algorithm for reshaping dataframes when adding different sizes which is inefficient in this case'.
I don't know if that qualifies as an answer as I've not looked more deeply into pandas.