For example, if the name 'Mark' is repeated and if the 'Age' value or 'Gender' value is different when compared to other rows which contain 'Mark', then such columns must be removed.
Code for creating dataframe:
df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'],
'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})
The dataframe is below:
Name Age Gender
0 Mark 22 F
1 Mark 22 F
2 Mark 25 F
3 Mark 25 F
4 Mark 17 F
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
The expected output is:
Name Age Gender
0 Nick 20 F
1 Nick 20 F
2 John 17 M
3 Sunny 23 M
4 Sunny 23 M
For example in the 1st and 2nd rows Name is Mark and Age is 22 and Gender is F. Whereas in 3rd-row Name is Mark, Age is 25 and Gender is F, we can see Age has multiple values, then we have to remove rows which contains Mark
CodePudding user response:
OK, this is tricky, so I'll explain each line.
import pandas as pd
df = pd.DataFrame({'Name' : ['Mark', 'Mark', 'Mark', 'Mark', 'Mark', 'Nick', 'Nick', 'John', 'Sunny', 'Sunny'],
'Age' : ['22', '22', '25', '25', '17', '20', '20', '17', '23', '23'],
'Gender' : ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M']})
print(df)
# First, drop all the rows that are exact duplicates of each other.
df1 = df.drop_duplicates()
print(df1)
# If any rows that remain have a duplicate name, those need to go.
df2 = df1.duplicated('Name')
print(df2)
# Go build a DF with just the names that were duplicated.
df3 = df1[df2]['Name']
print(df3)
# Find all the rows in the original df that have a name in this list,
# invert that set, and the result is what we want.
df4 = df[~df.Name.isin(df3)]
print(df4)
Several of those operations can be combined to make a one-liner, but I think this makes it more clear. I've added blank lines to separate the dataframes in the output.
Output:
Name Age Gender
0 Mark 22 F
1 Mark 22 F
2 Mark 25 F
3 Mark 25 F
4 Mark 17 F
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
Name Age Gender
0 Mark 22 F
2 Mark 25 F
4 Mark 17 F
5 Nick 20 F
7 John 17 M
8 Sunny 23 M
0 False
2 True
4 True
5 False
7 False
8 False
dtype: bool
2 Mark
4 Mark
Name: Name, dtype: object
Name Age Gender
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
CodePudding user response:
A one-liner answer:
df[df.Name.isin(s:=(df.drop_duplicates().groupby(['Name']).size().eq(1)).index[s])]
results in:
Name Age Gender
5 Nick 20 F
6 Nick 20 F
7 John 17 M
8 Sunny 23 M
9 Sunny 23 M
Explanation:
# from the OG df
df[
# boolean mask if the name is in
df.Name.isin(
# walrus-operator to temporarily hold result
# drop the duplicate rows (duplicates name age gender)
s:=(df.drop_duplicates()
# group on Name
.groupby(['Name'])
# after dropping duplicates, there's only one record
# i.e. no different age/gender records for same name
.size().eq(1))
# mask on names where no-duplicates is true, drop names where false
).index[s]
# pass as series to `df.Name.isin`
)
]