How to iterate a row and compare with each other?-CodePudding

For example, if the name 'Mark' is repeated and if the 'Age' value or 'Gender' value is different when compared to other rows which contain 'Mark', then such columns must be removed.

Code for creating dataframe:

df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'], 
                  'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
                  'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})

The dataframe is below:

       Name   Age   Gender
0      Mark   22    F
1      Mark   22    F
2      Mark   25    F
3      Mark   25    F
4      Mark   17    F
5      Nick   20    F
6      Nick   20    F
7      John   17    M
8      Sunny  23    M
9      Sunny  23    M

The expected output is:

       Name   Age   Gender
0      Nick   20    F
1      Nick   20    F
2      John   17    M
3      Sunny  23    M
4      Sunny  23    M

For example in the 1st and 2nd rows Name is Mark and Age is 22 and Gender is F. Whereas in 3rd-row Name is Mark, Age is 25 and Gender is F, we can see Age has multiple values, then we have to remove rows which contains Mark

CodePudding user response：

OK, this is tricky, so I'll explain each line.

import pandas as pd

df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'], 
                  'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
                  'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})

print(df)

# First, drop all the rows that are exact duplicates of each other.

df1 = df.drop_duplicates()
print(df1)

# If any rows that remain have a duplicate name, those need to go.

df2 = df1.duplicated('Name')
print(df2)

# Go build a DF with just the names that were duplicated.

df3 = df1[df2]['Name']
print(df3)

# Find all the rows in the original df that have a name in this list,
# invert that set, and the result is what we want.

df4 = df[~df.Name.isin(df3)]
print(df4)

Several of those operations can be combined to make a one-liner, but I think this makes it more clear. I've added blank lines to separate the dataframes in the output.

Output:

    Name Age Gender
0   Mark  22      F
1   Mark  22      F
2   Mark  25      F
3   Mark  25      F
4   Mark  17      F
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

    Name Age Gender
0   Mark  22      F
2   Mark  25      F
4   Mark  17      F
5   Nick  20      F
7   John  17      M
8  Sunny  23      M

0    False
2     True
4     True
5    False
7    False
8    False
dtype: bool

2    Mark
4    Mark
Name: Name, dtype: object

    Name Age Gender
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

CodePudding user response：

A one-liner answer:

df[df.Name.isin(s:=(df.drop_duplicates().groupby(['Name']).size().eq(1)).index[s])]

results in:

    Name Age Gender
5   Nick  20      F
6   Nick  20      F
7   John  17      M
8  Sunny  23      M
9  Sunny  23      M

Explanation:

# from the OG df
df[
   # boolean mask if the name is in
   df.Name.isin(
                # walrus-operator to temporarily hold result
                # drop the duplicate rows (duplicates name   age   gender)
                s:=(df.drop_duplicates()
                       # group on Name
                       .groupby(['Name'])
                       # after dropping duplicates, there's only one record
                       # i.e. no different age/gender records for same name
                       .size().eq(1))
                # mask on names where no-duplicates is true, drop names where false
                ).index[s]
               # pass as series to `df.Name.isin`
               )
  ]