I have a pandas
dataframe with possible duplicate values and would like to keep rows that have the value yes in the ans column
import pandas as pd
import numpy as np
data = {
'id': [1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 8, 9, 9, 10],
'ans': ['no', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'no']
}
df = pd.DataFrame(data)
df.head(n = 8)
The expected output should be
data2 = {
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'ans': ['yes', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no']
}
df2 = pd.DataFrame(data2)
df2.head(n = 10)
Thanks in advance!
CodePudding user response:
You could use either
df.query("ans=='yes'")
or
df.loc[df.ans == 'yes',:]
CodePudding user response:
IIUC use:
df = pd.DataFrame(data)
df = df[df['id'].isin(df.loc[df['ans'].eq('yes'), 'id'])]
print (df)
id ans
0 1 no
1 1 yes
2 2 yes
5 5 yes
6 5 no
7 6 yes
9 8 no
10 8 yes
11 9 no
12 9 yes
Or:
df = pd.DataFrame(data)
df = df.loc[df['ans'].eq('yes').groupby(df['id']).idxmax()]
print (df)
id ans
1 1 yes
2 2 yes
3 3 no
4 4 no
5 5 yes
7 6 yes
8 7 no
10 8 yes
12 9 yes
13 10 no