Home > Enterprise >  FInding same values on a list (or tupple?), group them together, check the rest of elements
FInding same values on a list (or tupple?), group them together, check the rest of elements

Time:12-18

I have an excel file that I'm taking data from. The data is basically a table of a day, and people playing together

day pl1 pl2
Mon 1000 1011
Tue 1100 0101
Tue 1000 0121
Wed 0101 1101
Mon 0210 1212
Wed 1000 0101

I want to check that no id is playing more than once(regardless if it is in column of pl1 or pl2) in one day. For example , in Wed, we have "0101" two times, one as pl1, and one in pl1, and I want to catch this.

And I'm looking of which would be the fastest and more pythonic way.

  1. I have thought of checking all elements of pl1 list, and pl2 and if I find same value anywhere, check if the value on the column of the day, is same, or not. However, not only I think this would be extremely slow to process, I also think is more complicate to check

  2. The other thought is to move them in lists of lists [[Mon,1000,1012],[Tue,1110,0101]...] and group them by day and then, check also the rest of elements? Still seems that too much time.

  3. Should I create a list of touples instead ? (I already use tupples and check that there is no identical pairs (same id and in same position pl1 an pl2)

Is there any fastest and more compact way ?

Thanks

CodePudding user response:

I think this is a prime application for pandas.DataFrame.groupby. Given the following DataFrame df:

   day    pl1    pl2
0  Mon  '1000'  '1011'
1  Tue  '1100'  '0101'
2  Tue  '1000'  '0121'
3  Wed  '0101'  '1101'
4  Mon  '0210'  '1212'
5  Wed  '1000'  '0101'

You can set_index to 'day', concatenate the player columns and then groupby on day and players, the find the size of each group. This will produce how many times a player played every day. Then we can filter for the players who appear more than once.

import pandas as pd
out = df.set_index('day')
out = pd.concat([out['pl1'], out['pl2']]).reset_index().groupby(['day', 0]).size()
out = out[out>1]

Output:

day    0  
Wed  '0101'   2
dtype: int64

Another option is to use a dictionary. We can first create dictionary temp which stores player IDs for every day. Then we can use collections.Counter to count how many times each player appeared in a given day and filter for the players who appeared more than once.

from collections import Counter
temp = {}
for d in df.to_dict('records'):
    temp.setdefault(d['day'], []).extend([d['pl1'],d['pl2']])

out = {}
for day, lst in temp.items():
    out[day] = {pl: v for pl, v in Counter(lst).items() if v > 1}

    

Output:

{'Mon': {}, 'Tue': {}, 'Wed': {'0101': 2}}

CodePudding user response:

I like this question, it's a nice little brain teaser!

There are probably module tools out there in pandas that can do something sql-like with a self join or the like, but I wanted to see if it could easily be done with just core Python.

The first way that occurred to me was to create a list of tuples, where each tuple is a pair of values, the first being the day and the 2nd being one of the players. Each line of values would then generate 2 tuples, one for player 1 and one for player 2. I.e., the first line of data:

Mon,1000,1011

would be put into 2 tuples:

('Mon',1000),('Mon',1011)

Add all of those tuples to a list. Then search the list for duplicates, which is a common enough task. Assuming the player data was in a csv file called "players.csv" like this:

day,pl1,pl2
Mon,1000,1011
Tue,1100,0101
Tue,1000,0121
Wed,0101,1101
Mon,0210,1212
Wed,1000,0101

the code below should return the information you are looking for:

# import data
playerFile = open('players.csv','r')
playerData = playerFile.readlines()[1:]
playerFile.close()

# create list of tuples, 1 tuple for each player in line of data
plays = []
for line in playerData:
    day,p1,p2 = line[:-1].split(',')
    plays.extend([(day,p1),(day,p2)])

# now check the list for duplicate tuples
# since each tuple is a day that a player played, 
# if a player played twice on 1 day, it'll be a duplicate
seen = set()
dupes = set()

for play in plays:
    if play in seen:
        dupes.add(play)
    else:
        seen.add(play)
        
# print duplicates
for dupe in dupes:
    print(dupe)

Output:

('Wed', '0101')

(Note that I had to use tuples instead of lists for each "play", since I wanted to only identify each duplicate once, i.e. if player 0101 played 3 times on one day, I only wanted one item. A set does that nicely, but you can't add lists to sets since they are mutable)

  • Related