I have an excel file that I'm taking data from. The data is basically a table of a day, and people playing together
day | pl1 | pl2 |
---|---|---|
Mon | 1000 | 1011 |
Tue | 1100 | 0101 |
Tue | 1000 | 0121 |
Wed | 0101 | 1101 |
Mon | 0210 | 1212 |
Wed | 1000 | 0101 |
I want to check that no id is playing more than once(regardless if it is in column of pl1 or pl2) in one day. For example , in Wed, we have "0101" two times, one as pl1, and one in pl1, and I want to catch this.
And I'm looking of which would be the fastest and more pythonic way.
I have thought of checking all elements of pl1 list, and pl2 and if I find same value anywhere, check if the value on the column of the day, is same, or not. However, not only I think this would be extremely slow to process, I also think is more complicate to check
The other thought is to move them in lists of lists [[Mon,1000,1012],[Tue,1110,0101]...] and group them by day and then, check also the rest of elements? Still seems that too much time.
Should I create a list of touples instead ? (I already use tupples and check that there is no identical pairs (same id and in same position pl1 an pl2)
Is there any fastest and more compact way ?
Thanks
CodePudding user response:
I think this is a prime application for pandas.DataFrame.groupby
. Given the following DataFrame df
:
day pl1 pl2
0 Mon '1000' '1011'
1 Tue '1100' '0101'
2 Tue '1000' '0121'
3 Wed '0101' '1101'
4 Mon '0210' '1212'
5 Wed '1000' '0101'
You can set_index
to 'day'
, concatenate the player columns and then groupby
on day and players, the find the size of each group. This will produce how many times a player played every day. Then we can filter for the players who appear more than once.
import pandas as pd
out = df.set_index('day')
out = pd.concat([out['pl1'], out['pl2']]).reset_index().groupby(['day', 0]).size()
out = out[out>1]
Output:
day 0
Wed '0101' 2
dtype: int64
Another option is to use a dictionary. We can first create dictionary temp
which stores player IDs for every day. Then we can use collections.Counter
to count how many times each player appeared in a given day and filter for the players who appeared more than once.
from collections import Counter
temp = {}
for d in df.to_dict('records'):
temp.setdefault(d['day'], []).extend([d['pl1'],d['pl2']])
out = {}
for day, lst in temp.items():
out[day] = {pl: v for pl, v in Counter(lst).items() if v > 1}
Output:
{'Mon': {}, 'Tue': {}, 'Wed': {'0101': 2}}
CodePudding user response:
I like this question, it's a nice little brain teaser!
There are probably module tools out there in pandas that can do something sql-like with a self join or the like, but I wanted to see if it could easily be done with just core Python.
The first way that occurred to me was to create a list of tuples, where each tuple is a pair of values, the first being the day and the 2nd being one of the players. Each line of values would then generate 2 tuples, one for player 1 and one for player 2. I.e., the first line of data:
Mon,1000,1011
would be put into 2 tuples:
('Mon',1000),('Mon',1011)
Add all of those tuples to a list. Then search the list for duplicates, which is a common enough task. Assuming the player data was in a csv file called "players.csv" like this:
day,pl1,pl2
Mon,1000,1011
Tue,1100,0101
Tue,1000,0121
Wed,0101,1101
Mon,0210,1212
Wed,1000,0101
the code below should return the information you are looking for:
# import data
playerFile = open('players.csv','r')
playerData = playerFile.readlines()[1:]
playerFile.close()
# create list of tuples, 1 tuple for each player in line of data
plays = []
for line in playerData:
day,p1,p2 = line[:-1].split(',')
plays.extend([(day,p1),(day,p2)])
# now check the list for duplicate tuples
# since each tuple is a day that a player played,
# if a player played twice on 1 day, it'll be a duplicate
seen = set()
dupes = set()
for play in plays:
if play in seen:
dupes.add(play)
else:
seen.add(play)
# print duplicates
for dupe in dupes:
print(dupe)
Output:
('Wed', '0101')
(Note that I had to use tuples instead of lists for each "play", since I wanted to only identify each duplicate once, i.e. if player 0101 played 3 times on one day, I only wanted one item. A set does that nicely, but you can't add lists to sets since they are mutable)