Home > Blockchain >  Pyspark or Pandas, count the number of identical value among columns
Pyspark or Pandas, count the number of identical value among columns

Time:06-17

A B C D #_identical value
1 1 1 2 3 1
3 3 1 2 2 3
4 4 4 4 4 4
1 2 1 2 2 [1,2]

Where A,B,C,D are columns with values and '#_identical' shows the number of same values among A,B,C,D. And 'value' shows the value of the identical value.

CodePudding user response:

Here is one approach using a custom function:

def count(s):
    c = s.value_counts()
    c = c[c>1]
    return pd.Series({'#_identical': c.unique().tolist()[0],
                      'value': c.index.to_list()
                     })
df.join(df.apply(count, axis=1))

output:

   A  B  C  D  #_identical   value
0  1  1  1  2            3     [1]
1  3  3  1  2            2     [3]
2  4  4  4  4            4     [4]
3  1  2  1  2            2  [1, 2]

CodePudding user response:

You can use Counter

from collections import Counter
df['counter'] = df.apply(Counter, axis=1)
df['value'] = df['counter'].apply(lambda x: [key for key in x.keys() if x[key] == max(x.values())])
df['#_identical'] = df['counter'].apply(lambda x: [x[key] for key in x.keys() if x[key] == max(x.values())])
df['#_identical'] = df['#_identical'].apply(lambda x:list(set(x)))
df.drop(['counter'],axis=1, inplace=True)

print(df):

   A  B  C  D   value #_identical
0  1  1  1  2     [1]         [3]
1  3  3  1  2     [3]         [2]
2  4  4  4  4     [4]         [4]
3  1  2  1  2  [1, 2]         [2]

You can convert the lists into scalar if you would like, but I imagine you need lists with more data you see.

  • Related