I have a DataFrame containing columns that overlap in a sense:
import pandas as pd
df = pd.DataFrame({
'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
'Home': [True, False, True, False, False, True],
'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})
df['Date'] = pd.to_datetime(df['Date'])
print(df)
OUTPUT:
Date Team Home Opp Rslt
0 2017-04-02 CHC True STL L
1 2017-04-02 ARI False SFG W
2 2017-04-02 NYY True TBR L
3 2017-04-02 TBR False NYY W
4 2017-04-02 STL False CHC W
5 2017-04-02 SFG True ARI L
For the date of 2017-04-01
, there were 3 games played. The DataFrame contains the game results for each day for each team. This results in 6 results. Take row 2
and 3
, this is a game between NYY and TBR:
- Row
2
gives the NYY result ofL
, meaning they lost - Row
3
gives the TBR result ofW
, meaning they won
What I'm trying to do is group all row pairs that relate to the same game. My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex
. I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date
and find the other row that contains the same characters:
df['Match'] = df['Date'].dt.strftime('%Y-%m-%d') ',' df['Team'] ',' df['Opp']
print(df)
OUTPUT:
Date Team Home Opp Rslt Match
0 2017-04-02 CHC True STL L 2017-04-02,CHC,STL
1 2017-04-02 ARI False SFG W 2017-04-02,ARI,SFG
2 2017-04-02 NYY True TBR L 2017-04-02,NYY,TBR
3 2017-04-02 TBR False NYY W 2017-04-02,TBR,NYY
4 2017-04-02 STL False CHC W 2017-04-02,STL,CHC
5 2017-04-02 SFG True ARI L 2017-04-02,SFG,ARI
From here, I'm not sure how to proceed. I have a method in mind using sets that I've used in the past. If we focus on row 2
and 3
again, subtracting the sets of the string, split using the ,
, and taking the bool()
will return False
for two sets containing the same string elements and True
for anything else (different sets):
print(
bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)
OUTPUT:
False
True
Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related? The kind of label I would like to have is creating a unique numbering of games. Since these three games happen on the same day, labelling the pairs as 1, 2, 3
would be great so that each game pair for each day has a unique ID.