I have the following test DateFrame:
| tag | list | Count |
| -------- | ----------------------------------------------------|-------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] | 5 |
| potato | [['U',0.8],['V',0.7],['W',0.4],['X',0.3]] | 4 |
| cheese | [['I',0.2],['J',0.4]] | 2 |
I want to randomly sample the list column to pick any 3 from the first 4 lists of lists. (Like ['E',0.1] is not even considered for tag = icecream).
The rule should be able to pick 3 list randomly from the list of lists. If there is less than 3 then pick whatever is there and randomize it.
The result should be random every time so need to seed it for the same output:
| tag | list |
| -------- | -------------------------------|
| icecream | [['B',0.6],['C',0.5],['A',0.9]]|
| potato | [['W',0.4],['X',0.3],['U',0.8]]|
| cheese | [['J',0.4],['I',0.2]] |
This is what I tried:
data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]],
['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3]]],
['cheese',[['I',0.2],['J',0.4]]]]
df = pd.DataFrame(data, columns=['tag', 'list'])
df['Count'] = df['list'].str.len().sort_values( ascending=[False])
df
--
import random
item_top_3 = []
find = 4
num = 3
for i in range(df.shape[0]):
item_id = df["tag"].iloc[i]
whole_list = df["list"].iloc[i]
item_top_3.append([item_id, random.sample(whole_list[0:find], num)])
--
I get this error:
ValueError: Sample larger than population or is negative.
Can anyone help randomizing it. The original DataFrame has over 50,000 rows and I want to randomize for any rule like tomorrow someone may want to pick 5 random items from first 20 elements in the list of lists, but it should still work.
CodePudding user response:
Use a list comprehension combined with random.sample
:
import random
find = 4
num = 3
df['list'] = [random.sample(l[:find], k=min(num, len(l))) for l in df['list']]
output:
tag list Count
0 icecream [[C, 0.5], [B, 0.6], [D, 0.3]] 5
1 potato [[V, 0.7], [U, 0.8], [X, 0.3]] 4
2 cheese [[J, 0.4], [I, 0.2]] 2
CodePudding user response:
Alternatively, you can combine np.random.choice
with apply
after creating a temporary list column that only contains the first n
elements of your orginal list
column.
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"tag": ["icecream", "potato", "cheese"],
"list": [[['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]], [['U',0.8],['V',0.7],['W',0.4],['X',0.3]], [['I',0.2],['J',0.4]]],
"count": [5, 4, 2]
})
first_n = 4
size = 3
df["ls_tmp"] = df["list"].str[:first_n].apply(np.array)
df["list"] = df["ls_tmp"].apply(lambda x: list(x[np.random.choice(len(x), size=size)]))
You can also write a helper function and use map
instead of apply
, which should be faster and more effective:
def randomize(x, size=3):
return list(x[np.random.choice(len(x), size=size)])
df["list"] = df["ls_tmp"].map(randomize)
Output:
tag list count ls_tmp
0 icecream [[A, 0.9], [A, 0.9], [C, 0.5]] 5 [[A, 0.9], [B, 0.6], [C, 0.5], [D, 0.3]]
1 potato [[W, 0.4], [V, 0.7], [V, 0.7]] 4 [[U, 0.8], [V, 0.7], [W, 0.4], [X, 0.3]]
2 cheese [[J, 0.4], [J, 0.4]] 2 [[I, 0.2], [J, 0.4]]
where the column ls_tmp
contains the original first n
values.