So, I have a data frame of this type:
Name 1 2 3 4 5
Alex 10 40 20 11 50
Alex 10 60 20 11 60
Sam 30 15 50 15 60
Sam 30 12 50 15 43
John 50 18 100 8 32
John 50 15 100 8 21
I am trying to keep only the columns that have repeated values for all unique row values. For example, in this case, I want to keep columns 1,3,4 because they have repeated values for each 'duplicate' row. But I want to keep the column only if the values are repeated for EACH pair of names - so, the whole column should consist of pairs of same values. Any ideas of how to do that?
CodePudding user response:
Using a simple list
inside agg
:
cond = df.groupby('Name').agg(list).applymap(lambda x: len(x) != len(set(x)))
dupe_cols = cond.columns[cond.all()]
CodePudding user response:
this is the easiest way I can think of
from collections import Counter
import pandas as pd
data = [[ 'Name', 1, 2, 3, 4, 5],
[ 'Alex', 10, 40, 20, 11, 50],
[ 'Alex', 10, 60, 20, 11, 60],
[ 'Sam', 30, 15, 50, 15, 60],
[ 'Sam', 30, 12, 50, 15, 43],
[ 'John', 50, 18, 100, 8, 32],
[ 'John', 50, 15, 100, 8, 21]]
df = pd.DataFrame(data)
vals = []
for row in range(0,len(df)):
tmp = Counter(df.iloc[row])
if 2 not in tmp.values():
vals.append(row)
ndf = df.iloc[vals]
ndf.drop_duplicates(subset='Name',keep='first')
returns
Name 1 2 3 4 5
1 Alex 10 40 20 11 50
4 Sam 30 12 50 15 43
5 John 50 18 100 8 32