Python: searching way to remove duplicates from list of pandas dataframes?-CodePudding

I have some list full of pandas dataframes. Is their a way to remove duplicates from it. Here some example code:

import pandas as pd
import numpy as np

if __name__ == '__main__':
    data1 = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
    
    df1 = pd.DataFrame.from_dict(data1, orient='index', columns=['A', 'B', 'C', 'D'])
    data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}

    df2 = pd.DataFrame.from_dict(data2)
    df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
    data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],

                dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])

    df4 = pd.DataFrame(data, columns=['c', 'a'])

    l_input = [df1, df2, df1, df3, df4, df4, df1, df3]
    # l_aim = [df1, df2, df3, df4]

the input list l_input in the example should removed and l_aim should be the result.

CodePudding user response：

Try df.equals():

out = []
while l_input:
    d = l_input.pop()
    if any(d.equals(df) for df in l_input):
        continue
    out.append(d)

print(*out[::-1], sep="\n\n")

Prints:

       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d

   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

   c  a
0  3  1
1  6  4
2  9  7

CodePudding user response：

If these are literally the same objects (as in your example), then you could use their ids:

out = list(dict((id(df), df) for df in l_input).values())

If not, you could use equals:

Output:

[       A  B  C  D
 row_1  3  2  1  0
 row_2  a  b  c  d,
    col_1 col_2
 0      3     a
 1      2     b
 2      1     c
 3      0     d,
    a  b  c
 0  1  2  3
 1  4  5  6
 2  7  8  9,
    c  a
 0  3  1
 1  6  4
 2  9  7]

CodePudding user response：

An efficient method to find the duplicates in linear time would be to compute a hash of the dataframes. You can't do it with the python hash function, but there is a helper function in pandas: pandas.util.hash_pandas_object.

The function computes a hash per row, so you need to aggregate to a single value. sum could be used but it might lead to collisions. Here I opted for a concatenation of all hashes. If you have huge dataframes this might consume a lot of memory (in such case, maybe hash the list of hashes).

hashes = [pd.util.hash_pandas_object(d).astype(str).str.cat(sep='-')
          for d in l_input]

# identify duplicated per index
dups = pd.Series(hashes).duplicated()

Output:

0    False
1    False
2     True
3    False
4    False
5     True
6     True
7     True
dtype: bool

To filter the unique dataframes:

out = [d for d,h in zip(l_input, dups) if h]

variant with a hash of the hashes

def df_hash(df):
    s = pd.util.hash_pandas_object(df)
    return hash(tuple(s))

hashes = [df_hash(d) for d in l_input]

dups = pd.Series(hashes).duplicated()

out = [d for d,h in zip(l_input, dups) if h]