Identifying/removing redundant columns in a pandas dataframe-CodePudding

If I have a dataframe like the following:

df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
                   'cat':['C','D','D','C','D','D','D','C'],
                   'num':[1,2,2,1,2,2,2,1],
                   'cat2':['X','Y','Y','X','Y','Y','Y','X']})

giving:

  val cat  num cat2
0   a   C    1    X
1   b   D    2    Y
2   c   D    2    Y
3   d   C    1    X
4   e   D    2    Y
5   f   D    2    Y
6   g   D    2    Y
7   h   C    1    X

You'll notice that we can determine the columns num and cat2 to be redundant because the values in the rows for cat, num and cat2 always match across the columns: C == 1 == X and D == 2 == Y.

I'd like to identify the columns that are redundant to ultimately discard them and have just one representation, like below. num or cat2 instead of cat would be fine there too.

  val cat
0   a   C
1   b   D
2   c   D
3   d   C
4   e   D
5   f   D
6   g   D
7   h   C

I can't think of a solution that doesn't involve nested loops that get exponentially more expensive with more columns, and I suspect there might be a clever way to address it. Other questions I've seen about redundant data are usually dealing with when values are equal.

Thanks!

CodePudding user response：

You can check with factorize, then drop_duplicates:

out = df.loc[:,df.transform(lambda x : x.factorize()[0]).T.drop_duplicates().T.columns]
Out[56]: 
  val cat
0   a   C
1   b   D
2   c   D
3   d   C
4   e   D
5   f   D
6   g   D
7   h   C

CodePudding user response：

You can generate a power set of the dataframe columns and check if the columns set is duplicated for all rows. If so, it means that the columns in this set are redundant.

from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s) 1))

redundant_cols = list(filter(lambda cols: (len(cols) > 1) and (df.duplicated(cols, keep=False).all()),
                             (powerset(df.columns))))

print(redundant_cols)

[('cat', 'num'), ('cat', 'cat2'), ('num', 'cat2'), ('cat', 'num', 'cat2')]

Then the last one also the longest one contains all redundant columns.

out = df.drop(columns=list(redundant_cols[-1][1:]))

print(out)

  val cat
0   a   C
1   b   D
2   c   D
3   d   C
4   e   D
5   f   D
6   g   D
7   h   C

CodePudding user response：

You can use scikit learn LabelEncoder to generate the encoded version of every column. That wil make it easy to compare all columns. Using LabelEncoder I generated this version of your dataframe:

    val cat num cat2
0   0   0   0   0
1   1   1   1   1
2   2   1   1   1
3   3   0   0   0
4   4   1   1   1
5   5   1   1   1
6   6   1   1   1
7   7   0   0   0

The full code to compare and find the same columns and remove them is given below:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
                'cat':['C','D','D','C','D','D','D','C'],
                'num':[1,2,2,1,2,2,2,1],
                'cat2':['X','Y','Y','X','Y','Y','Y','X']})

# labelencoder will encode all unique values to numbers that will make it easy to compare
temp_df = pd.DataFrame()
le = LabelEncoder()
for column_name in df.columns:
    temp_df[column_name] = le.fit_transform(df[column_name])

# now compare the columns to see which one are same
# this double for loop can defnitely be improved, it makes redundant comparisons
duplicate_columns = []
for column_name in df.columns:
    for compare_column_name in df.columns:
        if column_name != compare_column_name and column_name not in duplicate_columns:
            if temp_df[column_name].equals(temp_df[compare_column_name]):
                duplicate_columns.append(compare_column_name)

# now you have the duplicate columns just remove them
df.drop(duplicate_columns, axis=1, inplace=True)

CodePudding user response：

This is really a duplicate from here, where a correlation matrix is used, and a custom threshold can be passed to filter similar columns (in this case .95, but you could also use .999).

The only thing you are missing is converting your text to numeric values, which you could do with factorize, category codes, or any of the other many methods from pandas or sklearn, etc.

import pandas as pd
df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
                   'cat':['C','D','D','C','D','D','D','C'],
                   'num':[1,2,2,1,2,2,2,1],
                   'cat2':['X','Y','Y','X','Y','Y','Y','X']})


dfc = df.astype('category').apply(lambda x: x.cat.codes).corr().abs()
upper = dfc.where(np.triu(np.ones(dfc.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

df.drop(to_drop, axis=1, inplace=True)

print(df)

Output

 val cat
0   a   C
1   b   D
2   c   D
3   d   C
4   e   D
5   f   D
6   g   D
7   h   C

CodePudding user response：

For a faster way based on pandas.factorize, hash the resulting array and use it as a dictionary key:

df[{hash(pd.factorize(df[c])[0].data.tobytes()): c for c in df.columns[::-1]}.values()]

NB. if you have duplicated column names use instead:

df.iloc[:, list({hash(pd.factorize(df.iloc[:, i])[0].data.tobytes()): i
                 for i in range(df.shape[1])}.values())]

output:

  cat val
0   C   a
1   D   b
2   D   c
3   C   d
4   D   e
5   D   f
6   D   g
7   C   h

Execution speed:

# factorize   hash
737 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# factorize   drop_duplicates
2.25 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On 20 times more columns:

# factorize   hash
6.53 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# factorize   drop_duplicates
13 ms ± 781 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)