Randomly select values from a list column so that all elements across lists are selected-CodePudding

Say, I had a pandas dataframe with a list column 'event_ids'

code    canceled  event_ids
xxx     [1.0]     [107385, 128281, 133015]
xxS     [0.0]     [108664, 110515, 113556]
ssD     [1.0]     [134798, 133499, 125396, 114298, 133915]
cvS     [0.0]     [107611]
eeS     [5.0]     [113472, 115236, 108586, 128043, 114106, 10796...
544W    [44.0]    [107650, 128014, 127763, 118036, 116247, 12802.

How to select k rows sufficiently randomly so that all elements across 'event_ids' are represented in the sample? By that I mean the event vocabulary in samples should be same as that of the population. By 'sufficiently' random I mean if some sort of importance sampling is possible so that initially the samples are random and added or rejected according to some condition.

CodePudding user response：

It is not clear if you want to select each element within the list in events_ids, or if each list should be considered as a unique element. In the latter case, this could work (not sure about the performance!)

Given this dataset:

x = np.random.randint(1,100, 5000)
y = [np.random.choice(['A','B','C','D','E','F']) for i in range(5000)]

df = pd.DataFrame({'x':x,'y':y})
df.head()

Output:
    x   y
0   42  A
1   88  B
2   80  A
3   69  B
4   72  B

There are 99 unique values in column 'x'. You want to sample so that every unique value in df['x'] is in the obtained sample.

idxs = []

for i in df.x.unique():
    idxs.extend(np.random.choice(df.loc[df['x']==i].index, size=1))


sample = df.loc[idxs]
len(sample.x.unique())

Output:
99

You can change the preferred size to obtain more values in your sample.

If you want each unique element in each list in events_ids, then you can use explode and then use the same code.

df

Out:

   x    y   z
0   84  D   [14805, 9243, 14838, 10204]
1   70  D   [6901, 1117, 3918, 8607, 1912]
2   7   F   [9853, 12519, 13011, 13279]
3   45  A   [6344, 14646, 9633, 4517, 9432, 11187]
4   41  A   [1104, 10318, 12531, 9443, 8347] 

df = df.explode('z').reset_index()
df.head()
Out:
    x   y   z
0   13  D   1876
1   13  D   2437
2   13  D   2681
3   13  D   1748
4   37  E   10155