I have a dataframe with over ten million rows containing 2 columns 'left_index'
and 'right_index'
.
'left_index'
is the index of a value and 'right_index'
contains indexes of rows that have a possible match.
The problem is that this contains duplicate matches (Ex: 0,1 and 1,0).
I want to filter this dataframe and only keep one combination of each match.
I'm using a list here for an example.
In: [(0,1), (1,0), (3,567)]
Out: [(0,1), (3, 567)]
The below code produces what I want however it is very slow. Is there a faster way to solve this?
lst2 = []
for i in lst1:
if(i in lst2):
lst1.remove(i)
else:
lst2.append((i[1],i[0]))
CodePudding user response:
I believe Pandas saves you from using loop.
import pandas as pd
df = pd.DataFrame([
[(0, 0), (0, 0), 123],
[(0, 0), (0, 1), 234],
[(1, 0), (0, 1), 345],
[(1, 1), (0, 1), 456],
], columns=['left_index', 'right_index', 'value'])
print(df)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) (0, 1) 234
2 (1, 0) (0, 1) 345
3 (1, 1) (0, 1) 456
df['left_index_set'] = df['left_index'].apply(set)
df['right_index_set'] = df['right_index'].apply(set)
I am not sure what you need after this point. If you want to filter duplicates you do the following.
df = df[df['left_index_set'] != df['right_index_set']]
df_final1= df[['left_index', 'right_index', 'value']]
print(df_final1)
left_index right_index value
1 (0, 0) (0, 1) 234
3 (1, 1) (0, 1) 456
However, If you do not want to filter dataframe but to modify it:
df.loc[df['left_index_set'] != df['right_index_set'], 'right_index'] = None # None, '' or what you want. It's up to you
df_final2 = df[['left_index', 'right_index', 'value']]
print(df_final2)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) None 234
2 (1, 0) (0, 1) 345
3 (1, 1) None 456
CodePudding user response:
You mention the data is in a dataframe and tagged pandas
so we can use numpy
to do this work for us using vectorization.
First, since you did not provide a way to create the data, I generated a dataframe per your description using:
import numpy as np
import pandas
def build_dataframe():
def rand_series():
"""Create series of 1 million random integers in range [0, 9999]."""
return (np.random.rand(1000000) * 10000).astype('int')
data = pandas.DataFrame({
'left_index': rand_series(),
'right_index': rand_series()
})
return data.set_index('left_index')
data = build_dataframe()
Since (0,1)
is the same as (1,0)
per your requirements, lets just create an index that has the values sorted for us. First create two new columns with the minimum and maximum value of left and right index:
data['min_index'] = np.minimum(data.index, data.right_index)
data['max_index'] = np.maximum(data.index, data.right_index)
print(data)
right_index min_index max_index
left_index
4270 438 438 4270
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ... ...
3656 2492 2492 3656
2345 210 210 2345
9241 1934 1934 9241
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we can reset the index to these two new columns (really we just want a multi-index, and this is one way of getting it for us).
data = data.reset_index().set_index(keys=['min_index', 'max_index'])
print(data)
left_index right_index
min_index max_index
438 4270 4270 438
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ...
2492 3656 3656 2492
210 2345 2345 210
1934 9241 9241 1934
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we just want the unique values of the index. This is the operation that takes the most time, but should still be significantly faster than the naive implementation using lists.
unique = data.index.unique()
print (unique)
MultiIndex([( 438, 4270),
(1277, 9378),
( 20, 7080),
(4646, 6623),
(3280, 4481),
(4410, 9367),
(1864, 7881),
( 516, 3287),
(1678, 6946),
(1253, 7890),
...
(6669, 9527),
(1095, 8866),
( 455, 7800),
(2862, 8587),
(8221, 9808),
(2492, 3656),
( 210, 2345),
(1934, 9241),
( 369, 8362),
(5251, 6047)],
names=['min_index', 'max_index'], length=990197)
CodePudding user response:
Using numpy to keep the first occurrence of a non-unique array:
import numpy as np
lst1 = [(1,0), (0,1), (2, 5), (3,567), (5,2)]
arr = np.array(lst1)
result = arr[np.unique(np.sort(arr), 1, axis=0)[1]]
>>> result
array([[ 1, 0],
[ 2, 5],
[ 3, 567]])