Explosion of memory when using pandas .loc with umatching indices assignment giving duplicate axis-CodePudding

This is an observation from Most pythonic way to concatenate pandas cells with conditions I am not able to understand why third solution one takes more memory compared to first one.

If I don't sample the third solution does not give runtime error, clearly something is weird
To emulate large dataframe I tried to resample, but never expected to run into this kind of error

Background

Pretty self explanatory, one line, looks pythonic

df['city']   (df['city'] == 'paris')*('_'   df['arr'].astype(str))

s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df

Speeds

%%timeit
df['city']   (df['city'] == 'paris')*('_'   df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['final_target'] = np.where(df['city'].eq('paris'), 
                              df['city']   '_'   df['arr'].astype(str), 
                              df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If I dont sample, there is no error and output also match exactly

Error(Updated)(Only happens when I sample from dataframe)

%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target']  =  '_'   df['arr'].astype(str)

MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64

For smaller input(sample size 100) we get different error, telling a problem due to different sizes, but whats up with memory allocations and sampling?

ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-57c5b10090b2> in <module>
      1 df['final_target'] = df['city']
----> 2 df.loc[df['city'] == 'paris', 'final_target']  =  '_'   df['arr'].astype(str)

~/anaconda3/lib/python3.8/site-packages/pandas/core/ops/methods.py in f(self, other)
     99             # we are updating inplace so we want to ignore is_copy
    100             self._update_inplace(
--> 101                 result.reindex_like(self, copy=False), verify_is_copy=False
    102             )
    103

I rerun them from scratch each time

Update

This is part of what I figured

s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df

    city    arr final_target
1   paris   12  paris_12
0   paris   11  paris_11
2   dallas  22  dallas
2   dallas  22  dallas
3   miami   15  miami
3   miami   15  miami
2   dallas  22  dallas
1   paris   12  paris_12
0   paris   11  paris_11
3   miami   15  miami

Indices are repeated when sampled with replacement
So resetting the indices resolved the problem even if df.arr and df.loc have essentially different sizes or replacing with df.loc[df['city'] == 'paris', 'arr'].astype(str) will solve it. Just as 2e0byo pointed out.
Still can someone explain how .loc works and also explosion of memory When indices have duplicates in them and don't match?!

CodePudding user response：

@2e0byo hit the nail on the head saying pandas' algorithm is "inefficient" in this case.

As far as .loc, it's not really doing anything remarkable. Its use here is analogous to indexing a numpy array with a boolean array of the same shape, with an added dict-key-like access to a specific column - that is, df['city'] == 'paris' is itself a dataframe, with the same number of rows and the same indexes as df, with a single column of boolean values. df.loc[df['city'] == 'paris'] then gives a dataframe consisting of only the rows that are true in df['city'] == 'paris' (that have 'paris' in the 'city' column). Adding the additional argument 'final_target' then just returns only the 'final_target' column of those rows, instead of all three (and because it only has one column, it's technically a Series object - the same goes for df['arr']).

The memory explosion happens when pandas actually tries to add the two Series. As @2e0byo pointed out, it has to reshape the Series to do this, and it does this by calling the first Series' align() method. During the align operation, the function pandas.core.reshape.merge.get_join_indexers() calls pandas._libs.join.full_outer_join() (line 155) with three arguments: left, right, and max_groups (point of clarification: these are their names inside the function full_outer_join). left and right are integer arrays containing the indexes of the two Series objects (the values in the index column), and max_groups is the maximum number of unique elements in either left or right (in our case, that's five, corresponding to the five original rows in s).

full_outer_join immediately turns and calls pandas._libs.algos.groupsort_indexer() (line 194), once with left and max_groups as arguments and once with right and max_groups. groupsort_indexer returns two arrays - generically, indexer and counts (for the invocation with left, these are called left_sorter and left_count, and correspondingly for right). counts has length max_groups 1, and each element (excepting the first one, which is unused) contains the count of how many times the corresponding index group appears in the input array. So for our case, with max_groups = 5, the count arrays have shape (6,), and elements 1-5 represent the number of times the 5 unique index values appear in left and right.

The other array, indexer, is constructed so that indexing the original input array with it returns all the elements grouped in ascending order - hence "sorter." After having done this for both left and right, full_outer_join chops up the two sorters and strings them up across from each other. full_outer_join returns two arrays of the same size, left_idx and right_idx - these are the arrays that get really big and throw the error. The order of elements in the sorters determines the order they appear in the final two output arrays, and the count arrays determine how often each one appears. Since left goes first, its elements stay together - in left_idx, the first left_count[1] elements in left_sorter are repeated right_count[1] times each (aaabbbccc...). At the same place in right_idx, the first right_count[1] elements are repeated in a row left_count[1] times (abcabcabc...). (Conveniently, since the 0 row in s is a 'paris' row, left_count[1] and right_count[1] are always equal, so you get x amount of repeats x amount of times to start off). Then the next left_count[2] elements of left_sorter are repeated right_count[2] times, and so on... If any of the counts elements are zero, the corresponding spots in the idx arrays are filled with -1, to be masked later (as in, right_count[i] = 0 means elements in right_idx are -1, and vice versa - this is always the case for left_count[3] and left_count[4], because rows 2 and 3 in s are non-'paris').

In the end, the _idx arrays have an amount of elements equal to N_elements, which can be calculated as follows:

left_nonzero = (left_count[1:] != 0)
right_nonzero = (right_count[1:] != 0)
left_repeats = left_count[1:]*left_nonzero   np.ones(len(left_counts)-1)*(1 - left_nonzero)
right_repeats = right_count[1:]*right_nonzero   np.ones(len(right_counts)-1)*(1 - right_nonzero)
N_elements = sum(left_repeats*right_repeats)

The corresponding elements of the count arrays are multiplied together (with all the zeros replaced with ones), and added together to get N_elements.

You can see this figure grows pretty quickly (O(n^2)). For an original dataframe with 1,000,000 sampled rows, each one appearing about equally, then the count arrays look something like:

left_count = array([0, 2e5, 2e5, 0, 0, 2e5])
right_count = array([0, 2e5, 2e5, 2e5, 2e5, 2e5])

for a total length of about 1.2e11. In general for an initial sample N (df = pd.read_csv(io.StringIO(s)).sample(N, replace=True)), the final size is approximately 0.12*N**2

An Example

It's probably helpful to look at a small example to see what full_outer_join and groupsort_indexer are trying to do when they make those ginormous arrays. We'll start with a small sample of only 10 rows, and follow the various arrays to the final output, left_idx and right_idx. We'll start by defining the initial dataframe:

df = pd.read_csv(io.StringIO(s)).sample(10, replace=True)
df['final_target'] = df['city'] # this line doesn't change much, but meh

which looks like:

     city  arr final_target
3   miami   15        miami
1   paris   11        paris
0   paris   12        paris
0   paris   12        paris
0   paris   12        paris
1   paris   11        paris
2  dallas   22       dallas
3   miami   15        miami
2  dallas   22       dallas
4   paris   16        paris

df.loc[df['city'] == 'paris', 'final_target'] looks like:

1    paris
0    paris
0    paris
0    paris
1    paris
4    paris

and df['arr'].astype(str):

Then, in the call to full_outer_join, our arguments look like:

left = array([1,0,0,0,1,4])            # indexes of df.loc[df['city'] == 'paris', 'final_target']
right = array([3,1,0,0,0,1,2,3,2,4])   # indexes of df['arr'].astype(str)
max_groups = 5                         # the max number of unique elements in either left or right

The function call groupsort_indexer(left, max_groups) returns the following two arrays:

left_sorter = array([1, 2, 3, 0, 4, 5])
left_count = array([0, 3, 2, 0, 0, 1])

left_count holds the number of appearances of each unique value in left - the first element is unused, but then there a 3 zeros, 2 ones, 0 twos, 0 threes, and 1 four in left.

left_sorter is such that left[left_sorter] = array([0, 0, 0, 1, 1, 4]) - all in order.

Now right: groupsort_indexer(right, max_groups) returns

right_sorter = array([2, 3, 4, 1, 5, 6, 8, 0, 7, 9])
right_count = array([0, 3, 2, 2, 2, 1])

Once again, right_count contains the number of times each count appears: the unused first element, and then 3 zeros, 2 ones, 2 twos, 2 threes, and 1 four (note that elements 1, 2, and 5 of both count arrays are the same: these are the rows in s with 'city' = 'paris'). Also, right[right_sorter] = array([0, 0, 0, 1, 1, 2, 2, 3, 3, 4])

With both count arrays calculated, we can calculate what size the idx arrays will be (a bit simpler with actual numbers than with the formula above):

N_total = 3*3   2*2   2   2   1*1 = 18

3 is element 1 for both counts arrays, so we can expect something like [1,1,1,2,2,2,3,3,3] to start left_idx, since [1,2,3] starts left_sorter, and [2,3,4,2,3,4,2,3,4] to start right_idx, since right_sorter begins with [2,3,4]. Then we have twos, so [0,0,4,4] for left_idx and [1,5,1,5] for right_idx. Then left_count has two zeros, and right_count has two twos, so next go 4 -1's in left_idx and the next four elements in right_sorter go into right_idx: [6,8,0,7]. Both count's finish with a one, so one each of the last elements in the sorters go in the idx: 5 for left_idx and 9 for right_idx, leaving:

left_idx  = array([1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 0, 4, 4,-1, -1, -1, -1, 5])
right_idx = array([2, 3, 4, 2, 3, 4, 2, 3, 4, 1, 5, 1, 5, 6,  8,  0 , 7, 9])

which is indeed 18 elements.

With both index arrays the same shape, pandas can construct two Series of the same shape from our original ones to do any operations it needs to, and then it can mask these arrays to get back sorted indexes. Using a simple boolean filter to look at how we just sorted left and right with the outputs, we get:

left[left_idx[left_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 4])
right[right_idx[right_idx != -1]] = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4])

After going back up through all the function calls and modules, the result of the addition at this point is:

0    paris_12
0    paris_12
0    paris_12
0    paris_12
0    paris_12
0    paris_12
0    paris_12
0    paris_12
0    paris_12
1    paris_11
1    paris_11
1    paris_11
1    paris_11
2         NaN
2         NaN
3         NaN
3         NaN
4    paris_16

which is result in the line result = op(self, other) in pandas.core.generic.NDFrame._inplace_method (line 11066), with op = pandas.core.series.Series.__add__ and self and other the two Series from before that we're adding.

So, as far as I can tell, pandas basically tries to perform the operation for every combination of identically-indexed rows (like, any and all rows with index 1 in the first Series should be operated with all rows index 1 in the other Series). If one of the Series has indexes that the other one doesn't, those rows get masked out. It just so happens in this case that every row with the same index is identical. It works (albeit redundantly) as long as you don't need to do anything in place - the trouble for the small dataframes arises after this when pandas tries to reindex this result back into the shape of the original dataframe df.

The split (the line that smaller dataframes make it past, but larger ones don't) is that line result = op(self, other) from above. Later in the same function (called, note, _inplace_method), the program exits at self._update_inplace(result.reindex_like(self, copy=False), verify_is_copy=False). It tries to reindex result so it looks like self, so it can replace self with result (self is the original Series, the first one in the addition, df.loc[df['city'] == 'paris', 'final_target']). And this is where the smaller case fails, because, obviously, result has a bunch of repeated indexes, and pandas doesn't want to lose any information when it deletes some of them.

One Last Thing

It's probably worth mentioning that this behaviour isn't particular to the addition operation here. It happens any time you try an arithmetic operation on two large dataframes with a lot of repeated indexes - for example, try just defining a second dataframe the exact same way as the first, df2 = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True), and then try running df.arr*df2.arr. You'll get the same memory error.

Interestingly, logical and comparison operators have protections against doing this - they require identical indexes, and check for it before calling their operator method.

I did all my stuff in pandas 1.2.4, python 3.7.10, but I've given links to the pandas Github, which is currently in version 1.3.3. As far as I can tell, the differences don't affect the results.

CodePudding user response：

I could certainly be wrong about this, but isn't it because df["arr"] has a different shape from df.loc[df["city"] == "paris"]? So something funny is happening in Pandas' internal resampling.

If I explicitly truncate the dataframe myself it works:

df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target']  = "_"   df.loc[df['city'] == 'paris', 'arr'].astype(str)

In which case, the answer would be 'because internally pandas has an algorithm for reshaping dataframes when adding different sizes which is inefficient in this case'.

I don't know if that qualifies as an answer as I've not looked more deeply into pandas.