Home > Back-end >  Filtering pandas dataframe based on repeated column values - Python
Filtering pandas dataframe based on repeated column values - Python

Time:11-17

So, I have a data frame of this type:

               Name   1   2   3   4   5  
               Alex  10  40  20  11  50
               Alex  10  60  20  11  60
                Sam  30  15  50  15  60
                Sam  30  12  50  15  43 
               John  50  18 100   8  32
               John  50  15 100   8  21
                    

I am trying to keep only the columns that have repeated values for all unique row values. For example, in this case, I want to keep columns 1,3,4 because they have repeated values for each 'duplicate' row. But I want to keep the column only if the values are repeated for EACH pair of names - so, the whole column should consist of pairs of same values. Any ideas of how to do that?

CodePudding user response:

Using a simple list inside agg:

cond = df.groupby('Name').agg(list).applymap(lambda x: len(x) != len(set(x)))

dupe_cols = cond.columns[cond.all()]

CodePudding user response:

this is the easiest way I can think of

from collections import Counter

import pandas as pd

data = [[   'Name',   1,   2,   3,   4,   5],
[   'Alex',  10,  40,  20,  11,  50],
[   'Alex',  10,  60,  20,  11,  60],
[    'Sam',  30,  15,  50,  15,  60],
[    'Sam',  30,  12,  50,  15,  43],
[   'John',  50,  18, 100,   8,  32],
[   'John',  50,  15, 100,   8,  21]]

df = pd.DataFrame(data)

vals = []
for row in range(0,len(df)):
    tmp = Counter(df.iloc[row])
    if 2 not in tmp.values():
        vals.append(row)
        
ndf = df.iloc[vals]
ndf.drop_duplicates(subset='Name',keep='first')

returns


Name    1   2   3   4   5
1   Alex    10  40  20  11  50
4   Sam 30  12  50  15  43
5   John    50  18  100 8   32
  • Related