I am trying to remove the duplicates in column duplicates
and keep only the records where the value in column name
is equal to "foo". Is there a better way to do it than my approach?
import pandas as pd
df = pd.DataFrame(
{"name": ["foo", "bar", "foo", "baz"], "duplicates": ["qux", "qux", "fred", "fred"]}
)
df["name"] = df["name"].map({"foo": "a"})
df.sort_values(["name", "duplicates"], inplace=True, ascending=True)
df.drop_duplicates("duplicates")
CodePudding user response:
From your solution need also values if not match foo
if not exist per groups by duplicates
, right?
Then solution is use DataFrameGroupBy.idxmax
for first True
s per groups with msk for compare foo
- if not exist get first False
value:
df = pd.DataFrame(
{"name": ["foo", "bar", "foo", "baz","bez"],
"duplicates": ["qux", "qux", "fred", "fred","John"]}
)
print (df)
name duplicates
0 foo qux
1 bar qux
2 foo fred
3 baz fred
4 bez John
df = df.loc[df["name"].eq('foo').groupby(df['duplicates']).idxmax()]
print (df)
name duplicates
4 bez John
2 foo fred
0 foo qux
CodePudding user response:
IIUC, you original df is
import pandas as pd
df = pd.DataFrame(
{"name": ["foo", "bar", "foo", "baz"], "duplicates": ["qux", "qux", "fred", "fred"]}
)
Output is
name | duplicates | |
---|---|---|
0 | foo | qux |
1 | bar | qux |
2 | foo | fred |
3 | baz | fred |
How about this?
df[
df['duplicates']\
.isin(df.groupby('duplicates')\
.size()\
.reset_index(name='count')\
.query('count>1')['duplicates']
)
].query('name=="foo"')
So you will get
name | duplicates | |
---|---|---|
0 | foo | qux |
2 | foo | fred |