How can I speed up this iteration?-CodePudding

I have a dataframe with over ten million rows containing 2 columns 'left_index' and 'right_index'. 'left_index' is the index of a value and 'right_index' contains indexes of rows that have a possible match. The problem is that this contains duplicate matches (Ex: 0,1 and 1,0). I want to filter this dataframe and only keep one combination of each match.

I'm using a list here for an example.

In: [(0,1), (1,0), (3,567)]

Out: [(0,1), (3, 567)]

The below code produces what I want however it is very slow. Is there a faster way to solve this?

lst2 = []
for i in lst1:
  if(i in lst2):
    lst1.remove(i)
  else:
    lst2.append((i[1],i[0]))

CodePudding user response：

I believe Pandas saves you from using loop.

import pandas as pd

df = pd.DataFrame([
    [(0, 0), (0, 0), 123],
    [(0, 0), (0, 1), 234],
    [(1, 0), (0, 1), 345],
    [(1, 1), (0, 1), 456],
], columns=['left_index', 'right_index', 'value'])

print(df)
  left_index right_index  value
0     (0, 0)      (0, 0)    123
1     (0, 0)      (0, 1)    234
2     (1, 0)      (0, 1)    345
3     (1, 1)      (0, 1)    456

df['left_index_set'] = df['left_index'].apply(set)
df['right_index_set'] = df['right_index'].apply(set)

I am not sure what you need after this point. If you want to filter duplicates you do the following.

df = df[df['left_index_set'] != df['right_index_set']]

df_final1= df[['left_index', 'right_index', 'value']]

print(df_final1)
  left_index right_index  value
1     (0, 0)      (0, 1)    234
3     (1, 1)      (0, 1)    456

However, If you do not want to filter dataframe but to modify it:

df.loc[df['left_index_set'] != df['right_index_set'], 'right_index'] = None     # None, '' or what you want. It's up to you 
df_final2 = df[['left_index', 'right_index', 'value']]

print(df_final2)
  left_index right_index  value
0     (0, 0)      (0, 0)    123
1     (0, 0)        None    234
2     (1, 0)      (0, 1)    345
3     (1, 1)        None    456

CodePudding user response：

You mention the data is in a dataframe and tagged pandas so we can use numpy to do this work for us using vectorization.

First, since you did not provide a way to create the data, I generated a dataframe per your description using:

import numpy as np
import pandas


def build_dataframe():
    def rand_series():
        """Create series of 1 million random integers in range [0, 9999]."""
        return (np.random.rand(1000000) * 10000).astype('int')

    data = pandas.DataFrame({
        'left_index': rand_series(),
        'right_index': rand_series()
    })
    return data.set_index('left_index')

data = build_dataframe()

Since (0,1) is the same as (1,0) per your requirements, lets just create an index that has the values sorted for us. First create two new columns with the minimum and maximum value of left and right index:

data['min_index'] = np.minimum(data.index, data.right_index)
data['max_index'] = np.maximum(data.index, data.right_index)
print(data)

           right_index  min_index  max_index
left_index                                   
4270                438        438       4270
1277               9378       1277       9378
20                 7080         20       7080
4646               6623       4646       6623
3280               4481       3280       4481
...                 ...        ...        ...
3656               2492       2492       3656
2345                210        210       2345
9241               1934       1934       9241
369                8362        369       8362
5251               6047       5251       6047

[1000000 rows x 2 columns]

Then we can reset the index to these two new columns (really we just want a multi-index, and this is one way of getting it for us).

data = data.reset_index().set_index(keys=['min_index', 'max_index'])
print(data)

                     left_index  right_index
min_index max_index                         
438       4270             4270          438
1277      9378             1277         9378
20        7080               20         7080
4646      6623             4646         6623
3280      4481             3280         4481
...                         ...          ...
2492      3656             3656         2492
210       2345             2345          210
1934      9241             9241         1934
369       8362              369         8362
5251      6047             5251         6047

[1000000 rows x 2 columns]

Then we just want the unique values of the index. This is the operation that takes the most time, but should still be significantly faster than the naive implementation using lists.

unique = data.index.unique()
print (unique)

MultiIndex([( 438, 4270),
            (1277, 9378),
            (  20, 7080),
            (4646, 6623),
            (3280, 4481),
            (4410, 9367),
            (1864, 7881),
            ( 516, 3287),
            (1678, 6946),
            (1253, 7890),
            ...
            (6669, 9527),
            (1095, 8866),
            ( 455, 7800),
            (2862, 8587),
            (8221, 9808),
            (2492, 3656),
            ( 210, 2345),
            (1934, 9241),
            ( 369, 8362),
            (5251, 6047)],
           names=['min_index', 'max_index'], length=990197)

CodePudding user response：

Using numpy to keep the first occurrence of a non-unique array:

import numpy as np

lst1 = [(1,0), (0,1), (2, 5), (3,567), (5,2)]
arr = np.array(lst1)

result = arr[np.unique(np.sort(arr), 1, axis=0)[1]]

>>> result
array([[  1,   0],
       [  2,   5],
       [  3, 567]])