I have a the following pandas dataframe:
and would like remove the duplicate rows. For example, (Atlanta Falcons/Jacksonville Jaguars is found as Jacksonville Jaguars/Atlanta Falcons). What is the best way to do so?
Thanks!
CodePudding user response:
The code that will do the trick for you is this one:
df["team_a"] = np.minimum(df['team1'], df['team2'])
df["team_b"] = np.maximum(df['team1'], df['team2'])
df.drop_duplicates(["season","week","team_a","team_b"],inplace= True)
df.drop(columns= ["team_a","team_b"],inplace= True)
Before doing this, please check your data, because when team1 and team2 are inverted, the columns team1_score and team2_score are not being inverted, so it may be confusing after you remove one of the rows.
CodePudding user response:
Because OP did not provide a reproducible dataset:
import pandas as pd
# dataset where the 1st and 5th observations are team A vs team F:
df = pd.DataFrame({
"season": [2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021],
"week": [12, 12, 12, 12, 12, 13, 13, 13, 13, 13],
"team1": ["A", "B", "C", "D", "F", "A", "B", "C", "D", "F"],
"team2": ["F", "G", "H", "I", "A", "F", "G", "H", "I", "A"]
})
df
season week team1 team2
0 2021 12 A F
1 2021 12 B G
2 2021 12 C H
3 2021 12 D I
4 2021 12 F A
5 2021 13 A F
6 2021 13 B G
7 2021 13 C H
8 2021 13 D I
9 2021 13 F A
# solution:
df[[df["team1"].str.contains(c) == False for c in df["team2"].tolist()][0]]
season week team1 team2
0 2021 12 A F
1 2021 12 B G
2 2021 12 C H
3 2021 12 D I
4 2021 13 A F
5 2021 13 B G
6 2021 13 C H
7 2021 13 D I
CodePudding user response:
This should do the trick:
df["no_duplicates"] = df["team1"] df["team2"]
df.drop_duplicates()
df["no_duplicates"]