Home > Net >  How to label rows of DataFrame where the set of column values match?
How to label rows of DataFrame where the set of column values match?

Time:06-30

I have a DataFrame containing columns that overlap in a sense:

import pandas as pd

df = pd.DataFrame({
    'Date': ['2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02', '2017-04-02'],
    'Team': ['CHC', 'ARI', 'NYY', 'TBR', 'STL', 'SFG'],
    'Home': [True, False, True, False, False, True],
    'Opp': ['STL', 'SFG', 'TBR', 'NYY', 'CHC', 'ARI'],
    'Rslt': ['L', 'W', 'L', 'W', 'W', 'L']
})

df['Date'] = pd.to_datetime(df['Date'])

print(df)

OUTPUT:

        Date Team   Home  Opp Rslt
0 2017-04-02  CHC   True  STL    L
1 2017-04-02  ARI  False  SFG    W
2 2017-04-02  NYY   True  TBR    L
3 2017-04-02  TBR  False  NYY    W
4 2017-04-02  STL  False  CHC    W
5 2017-04-02  SFG   True  ARI    L

For the date of 2017-04-01, there were 3 games played. The DataFrame contains the game results for each day for each team. This results in 6 results. Take row 2 and 3, this is a game between NYY and TBR:

  • Row 2 gives the NYY result of L, meaning they lost
  • Row 3 gives the TBR result of W, meaning they won

What I'm trying to do is group all row pairs that relate to the same game. My initial idea was to create a new column that would act as a label for the pair and then use that to group on or set MultiIndex. I thought about it and considered concatenating the three columns into a single string for each row and then, using sets, look through all rows for each date in Date and find the other row that contains the same characters:

df['Match'] = df['Date'].dt.strftime('%Y-%m-%d')   ','   df['Team']   ','   df['Opp']

print(df)

OUTPUT:

        Date Team   Home  Opp Rslt               Match
0 2017-04-02  CHC   True  STL    L  2017-04-02,CHC,STL
1 2017-04-02  ARI  False  SFG    W  2017-04-02,ARI,SFG
2 2017-04-02  NYY   True  TBR    L  2017-04-02,NYY,TBR
3 2017-04-02  TBR  False  NYY    W  2017-04-02,TBR,NYY
4 2017-04-02  STL  False  CHC    W  2017-04-02,STL,CHC
5 2017-04-02  SFG   True  ARI    L  2017-04-02,SFG,ARI

From here, I'm not sure how to proceed. I have a method in mind using sets that I've used in the past. If we focus on row 2 and 3 again, subtracting the sets of the string, split using the ,, and taking the bool() will return False for two sets containing the same string elements and True for anything else (different sets):

print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,TBR,NYY'.split(',')))
)
print(
    bool(set('2017-04-02,NYY,TBR'.split(',')) - set('2017-04-02,CHC,STL'.split(',')))
)

OUTPUT:

False
True

Is there a better way to take a row value in a column and lookup all other row values in that same column and label the rows where they are related? The kind of label I would like to have is creating a unique numbering of games. Since these three games happen on the same day, labelling the pairs as 1, 2, 3 would be great so that each game pair for each day has a unique ID.

P.S. I've also seen enter image description here

  • Related