Home > Back-end >  Randomly select values from a list column so that all elements across lists are selected
Randomly select values from a list column so that all elements across lists are selected

Time:03-04

Say, I had a pandas dataframe with a list column 'event_ids'

code    canceled  event_ids
xxx     [1.0]     [107385, 128281, 133015]
xxS     [0.0]     [108664, 110515, 113556]
ssD     [1.0]     [134798, 133499, 125396, 114298, 133915]
cvS     [0.0]     [107611]
eeS     [5.0]     [113472, 115236, 108586, 128043, 114106, 10796...
544W    [44.0]    [107650, 128014, 127763, 118036, 116247, 12802.

How to select k rows sufficiently randomly so that all elements across 'event_ids' are represented in the sample? By that I mean the event vocabulary in samples should be same as that of the population. By 'sufficiently' random I mean if some sort of importance sampling is possible so that initially the samples are random and added or rejected according to some condition.

CodePudding user response:

It is not clear if you want to select each element within the list in events_ids, or if each list should be considered as a unique element. In the latter case, this could work (not sure about the performance!)

Given this dataset:

x = np.random.randint(1,100, 5000)
y = [np.random.choice(['A','B','C','D','E','F']) for i in range(5000)]

df = pd.DataFrame({'x':x,'y':y})
df.head()

Output:
    x   y
0   42  A
1   88  B
2   80  A
3   69  B
4   72  B

There are 99 unique values in column 'x'. You want to sample so that every unique value in df['x'] is in the obtained sample.

idxs = []

for i in df.x.unique():
    idxs.extend(np.random.choice(df.loc[df['x']==i].index, size=1))


sample = df.loc[idxs]
len(sample.x.unique())

Output:
99

You can change the preferred size to obtain more values in your sample.

If you want each unique element in each list in events_ids, then you can use explode and then use the same code.

df

Out:

   x    y   z
0   84  D   [14805, 9243, 14838, 10204]
1   70  D   [6901, 1117, 3918, 8607, 1912]
2   7   F   [9853, 12519, 13011, 13279]
3   45  A   [6344, 14646, 9633, 4517, 9432, 11187]
4   41  A   [1104, 10318, 12531, 9443, 8347] 

df = df.explode('z').reset_index()
df.head()
Out:
    x   y   z
0   13  D   1876
1   13  D   2437
2   13  D   2681
3   13  D   1748
4   37  E   10155

  • Related