I have a dataframe with two columns. How can I split according column "id" in a 70/30 ratio randomly. So with id 7 despite 3 occurring values it only counts as 1/10 with ratio.
How to split data into 3 sets (train, validation and test)? Does not help in this case.
import pandas as pd
d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]}
df = pd.DataFrame(data=d)
So possible output df1_30 would be:
>>> df1_30
id col2
0 1 3
2 3 5
3 3 7
11 9 4
Another possible output of df1_30 could be also (just for clarification):
>>> df1_30
id col2
0 1 3
10 8 11
11 9 4
CodePudding user response:
Hopefully below code will help you , len_per is 30 percentage of total unique ids you have
import pandas as pd
import random
d = {'id': [1,2,3,3,4,5,6,7,7,7,8,9,10,10], 'col2': [3,4,5,7,8,9,1,5,9,10,11,4,1,7]}
df = pd.DataFrame(data=d)
len_per = int(len(set(df['id'])) / 100 * 30)
ids = random.sample(set(df["id"]), len_per)
df1_30 = df[df["id"].isin(ids)]
df1_70 = df[~df["id"].isin(ids)]
OutPut
id col2
1 2 4
5 5 9
11 9 4