If I have a dataframe like the following:
df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
'cat':['C','D','D','C','D','D','D','C'],
'num':[1,2,2,1,2,2,2,1],
'cat2':['X','Y','Y','X','Y','Y','Y','X']})
giving:
val cat num cat2
0 a C 1 X
1 b D 2 Y
2 c D 2 Y
3 d C 1 X
4 e D 2 Y
5 f D 2 Y
6 g D 2 Y
7 h C 1 X
You'll notice that we can determine the columns num
and cat2
to be redundant because the values in the rows for cat
, num
and cat2
always match across the columns: C == 1 == X
and D == 2 == Y
.
I'd like to identify the columns that are redundant to ultimately discard them and have just one representation, like below. num
or cat2
instead of cat
would be fine there too.
val cat
0 a C
1 b D
2 c D
3 d C
4 e D
5 f D
6 g D
7 h C
I can't think of a solution that doesn't involve nested loops that get exponentially more expensive with more columns, and I suspect there might be a clever way to address it. Other questions I've seen about redundant data are usually dealing with when values are equal.
Thanks!
CodePudding user response:
You can check with factorize
, then drop_duplicates
:
out = df.loc[:,df.transform(lambda x : x.factorize()[0]).T.drop_duplicates().T.columns]
Out[56]:
val cat
0 a C
1 b D
2 c D
3 d C
4 e D
5 f D
6 g D
7 h C
CodePudding user response:
You can generate a power set of the dataframe columns and check if the columns set is duplicated for all rows. If so, it means that the columns in this set are redundant.
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) 1))
redundant_cols = list(filter(lambda cols: (len(cols) > 1) and (df.duplicated(cols, keep=False).all()),
(powerset(df.columns))))
print(redundant_cols)
[('cat', 'num'), ('cat', 'cat2'), ('num', 'cat2'), ('cat', 'num', 'cat2')]
Then the last one also the longest one contains all redundant columns.
out = df.drop(columns=list(redundant_cols[-1][1:]))
print(out)
val cat
0 a C
1 b D
2 c D
3 d C
4 e D
5 f D
6 g D
7 h C
CodePudding user response:
You can use scikit learn LabelEncoder to generate the encoded version of every column. That wil make it easy to compare all columns. Using LabelEncoder I generated this version of your dataframe:
val cat num cat2
0 0 0 0 0
1 1 1 1 1
2 2 1 1 1
3 3 0 0 0
4 4 1 1 1
5 5 1 1 1
6 6 1 1 1
7 7 0 0 0
The full code to compare and find the same columns and remove them is given below:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
'cat':['C','D','D','C','D','D','D','C'],
'num':[1,2,2,1,2,2,2,1],
'cat2':['X','Y','Y','X','Y','Y','Y','X']})
# labelencoder will encode all unique values to numbers that will make it easy to compare
temp_df = pd.DataFrame()
le = LabelEncoder()
for column_name in df.columns:
temp_df[column_name] = le.fit_transform(df[column_name])
# now compare the columns to see which one are same
# this double for loop can defnitely be improved, it makes redundant comparisons
duplicate_columns = []
for column_name in df.columns:
for compare_column_name in df.columns:
if column_name != compare_column_name and column_name not in duplicate_columns:
if temp_df[column_name].equals(temp_df[compare_column_name]):
duplicate_columns.append(compare_column_name)
# now you have the duplicate columns just remove them
df.drop(duplicate_columns, axis=1, inplace=True)
CodePudding user response:
This is really a duplicate from here, where a correlation matrix is used, and a custom threshold can be passed to filter similar columns (in this case .95, but you could also use .999).
The only thing you are missing is converting your text to numeric values, which you could do with factorize, category codes, or any of the other many methods from pandas or sklearn, etc.
import pandas as pd
df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
'cat':['C','D','D','C','D','D','D','C'],
'num':[1,2,2,1,2,2,2,1],
'cat2':['X','Y','Y','X','Y','Y','Y','X']})
dfc = df.astype('category').apply(lambda x: x.cat.codes).corr().abs()
upper = dfc.where(np.triu(np.ones(dfc.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
df.drop(to_drop, axis=1, inplace=True)
print(df)
Output
val cat
0 a C
1 b D
2 c D
3 d C
4 e D
5 f D
6 g D
7 h C
CodePudding user response:
For a faster way based on pandas.factorize
, hash the resulting array and use it as a dictionary key:
df[{hash(pd.factorize(df[c])[0].data.tobytes()): c for c in df.columns[::-1]}.values()]
NB. if you have duplicated column names use instead:
df.iloc[:, list({hash(pd.factorize(df.iloc[:, i])[0].data.tobytes()): i
for i in range(df.shape[1])}.values())]
output:
cat val
0 C a
1 D b
2 D c
3 C d
4 D e
5 D f
6 D g
7 C h
Execution speed:
# factorize hash
737 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# factorize drop_duplicates
2.25 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
On 20 times more columns:
# factorize hash
6.53 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# factorize drop_duplicates
13 ms ± 781 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)