Home > database >  How to split dataframe randomly into given ratio according id
How to split dataframe randomly into given ratio according id

Time:12-30

I have a dataframe with two columns. How can I split according column "id" in a 70/30 ratio randomly. So with id 7 despite 3 occurring values ​​it only counts as 1/10 with ratio.

How to split data into 3 sets (train, validation and test)? Does not help in this case.

import pandas as pd
d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]}
df = pd.DataFrame(data=d)

So possible output df1_30 would be:

>>> df1_30
     id   col2
0    1    3
2    3    5
3    3    7
11   9    4

Another possible output of df1_30 could be also (just for clarification):

>>> df1_30
     id   col2
0    1    3
10   8    11
11   9    4

CodePudding user response:

Hopefully below code will help you , len_per is 30 percentage of total unique ids you have

 import pandas as pd
 import random
 d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]}

 df = pd.DataFrame(data=d)
 len_per = int(len(set(df['id'])) / 100 * 30)
 ids = random.sample(set(df["id"]), len_per)

 df1_30 = df[df["id"].isin(ids)]
 df1_70 = df[~df["id"].isin(ids)]

OutPut

   id   col2
1   2   4
5   5   9
11  9   4
  • Related