Add column to dataframe that has each row's duplicate count value takes too long-CodePudding

I've read SOF posts on how to create a field that contains the number of duplicates that row contains in a pandas DataFrame. Without using any other libraries, I tried writing a function that does this, and it works on small DataFrame objects; however, it takes way too long on larger ones and consumes too much memory.

This is the function:

def count_duplicates(dataframe):
    function = lambda x: dataframe.to_numpy().tolist().count(x.to_list()) - 1
    return dataframe.apply(function, axis=1)

I did a dir into a numpy array from the DataFrame.to_numpy function, and I didn't see a function quite like the list.count function. The reason why this takes so long is because for each row, it needs to compare the row with all of the rows in the numpy array. I'd like a much more efficient way to do this, even if it's not using a pandas DataFrame. I feel like there should be a simple way to do this with numpy, but I'm just not familiar enough. I've been testing different approaches for a while and it's resulting in a lot of errors. I'm going to keep testing different approaches, but felt the community might provide a better way.

Thank you for your help.

Here is an example DataFrame:

   one  two
0    1    1
1    2    2
2    3    3
3    1    1

I'd use it like this:

d['duplicates'] = count_duplicates(d)

The resulting DataFrame is:

   one  two  duplicates
0    1    1           1
1    2    2           0
2    3    3           0
3    1    1           1

The problem is the actual DataFrame will have 1.4 million rows, and each lambda takes an average of 0.148558 seconds, which if multiplied by 1.4 million rows is about 207981.459 seconds or 57.772 hours. I need a much faster way to accomplish this.

Thank you again.

I updated the function which is speeding things up:

def _counter(series_to_count, list_of_lists):
    return list_of_lists.count(series_to_count.to_list()) - 1

def count_duplicates(dataframe):
    df_list = dataframe.to_numpy().tolist()
    return dataframe.apply(_counter, args=(df_list,), axis=1)

This takes only 29.487 seconds. The bottleneck was converting the dataframe on each function call.

I'm still interested in optimizing this. I'd like to get this down to 2-3 seconds if at all possible. It may not be, but I'd like to make sure it is as fast as possible.

Thank you again.

CodePudding user response：

Here is a vectorized way to do this. For 1.4 million rows, with an average of 140 duplicates for each row, it takes under 0.05 seconds. When there are no duplicates at all, it takes about 0.4 second.

d['duplicates'] = d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1

On your example:

>>> d
   one  two  duplicates
0    1    1           1
1    2    2           0
2    3    3           0
3    1    1           1

Speed

Relatively high rate of duplicates:

n = 1_400_000
d = pd.DataFrame(np.random.randint(0, 100, size=(n, 2)), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 48.3 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# how many duplicates on average?
>>> (d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1).mean()
139.995841

# (as expected: n / 100**2)

No duplicates

n = 1_400_000
d = pd.DataFrame(np.arange(2 * n).reshape(-1, 2), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 389 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)