I have the next dataframe:
df = pd.DataFrame({
"player_id":[1,1,2,2,3,3,4,4,5,5,6,6],
"year" :[1,2,1,2,1,2,1,2,1,2,1,2],
"overall" :[20,16,7,3,8,80,20,12,9,3,2,1]})
what is the easiest way to randomly sort it grouped by player_id, e.g.
player_id | year | overall |
---|---|---|
4 | 1 | 80 |
4 | 2 | 20 |
1 | 1 | 20 |
1 | 2 | 16 |
... | ... | ... |
And then split it 80-20 into a train and testing set where they don't share any player_id.
CodePudding user response:
As Quang Hoang suggested in the comments. You can split your ids and then select the data based on those ids.
test_ids = df.player_id.drop_duplicates().sample(frac=0.2).values
#-> array([2])
train_data = df[~df["player_id"].isin(test_ids)]
"""
player_id year overall
0 1 1 20
1 1 2 16
4 3 1 8
5 3 2 80
6 4 1 20
7 4 2 12
8 5 1 9
9 5 2 3
10 6 1 2
11 6 2 1
"""
test_data = df[df["player_id"].isin(test_ids)]
"""
player_id year overall
2 2 1 7
3 2 2 3
"""