I have some list full of pandas dataframes. Is their a way to remove duplicates from it. Here some example code:
import pandas as pd
import numpy as np
if __name__ == '__main__':
data1 = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
df1 = pd.DataFrame.from_dict(data1, orient='index', columns=['A', 'B', 'C', 'D'])
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = pd.DataFrame.from_dict(data2)
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df4 = pd.DataFrame(data, columns=['c', 'a'])
l_input = [df1, df2, df1, df3, df4, df4, df1, df3]
# l_aim = [df1, df2, df3, df4]
the input list l_input in the example should removed and l_aim should be the result.
CodePudding user response:
Try df.equals()
:
out = []
while l_input:
d = l_input.pop()
if any(d.equals(df) for df in l_input):
continue
out.append(d)
print(*out[::-1], sep="\n\n")
Prints:
A B C D
row_1 3 2 1 0
row_2 a b c d
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
a b c
0 1 2 3
1 4 5 6
2 7 8 9
c a
0 3 1
1 6 4
2 9 7
CodePudding user response:
If these are literally the same objects (as in your example), then you could use their id
s:
out = list(dict((id(df), df) for df in l_input).values())
If not, you could use equals
:
Output:
[ A B C D
row_1 3 2 1 0
row_2 a b c d,
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d,
a b c
0 1 2 3
1 4 5 6
2 7 8 9,
c a
0 3 1
1 6 4
2 9 7]
CodePudding user response:
An efficient method to find the duplicates in linear time would be to compute a hash of the dataframes. You can't do it with the python hash
function, but there is a helper function in pandas: pandas.util.hash_pandas_object
.
The function computes a hash per row, so you need to aggregate to a single value. sum
could be used but it might lead to collisions. Here I opted for a concatenation of all hashes. If you have huge dataframes this might consume a lot of memory (in such case, maybe hash the list of hashes).
hashes = [pd.util.hash_pandas_object(d).astype(str).str.cat(sep='-')
for d in l_input]
# identify duplicated per index
dups = pd.Series(hashes).duplicated()
Output:
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 True
dtype: bool
To filter the unique dataframes:
out = [d for d,h in zip(l_input, dups) if h]
variant with a hash of the hashes
def df_hash(df):
s = pd.util.hash_pandas_object(df)
return hash(tuple(s))
hashes = [df_hash(d) for d in l_input]
dups = pd.Series(hashes).duplicated()
out = [d for d,h in zip(l_input, dups) if h]