Extract each item from a column of lists and then pick the top items-CodePudding

I have the following DateFrame:

| tag      | list                                                |
| -------- | ----------------------------------------------------|
| icecream | [['A',0.9],['B',0.6],['C',0.5],['D',0.3],['E',0.1]] |
| potato   | [['U',0.8],['V',0.7],['W',0.4],['X',0.3],['Y',0.2]] |

The column list is a list of lists with each list having an item and a value between 1 to 0. The lists are arranged in descending order of this value.

I want to extract each item from here and get the top 3 item but not the item itself. Resultant data frame should be:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['B',0.6],['C',0.5],['D',0.3]] |
| B    | [['A',0.9],['C',0.5],['D',0.3]] |
| C    | [['A',0.9],['B',0.6],['D',0.3]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['V',0.7],['W',0.4],['X',0.3]] |
| V    | [['U',0.8],['W',0.4],['X',0.3]] |
| W    | [['U',0.8],['V',0.7],['X',0.3]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

I tried and I am able to extract the value, I am stuck at the part where I want to ignore the item itself while creating the top_3. This is what I have done:

data = [['icecream', [['A', 0.9],['B', 0.6],['C',0.5],['D',0.3],['E',0.1]]], 
        ['potato', [['U', 0.8],['V', 0.7],['W',0.4],['X',0.3],['Y',0.2]]]]

df = pd.DataFrame(data, columns=['tag', 'list'])
df

--

temp = {}
for idx, row in df.iterrows():
    for item in row["list"]:
        temp[item[0]] = row["tag"]

top_items = {}
for idx, row in df.iterrows():
    top_items[row["tag"]] = row["list"]

similar = []
for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

df = pd.DataFrame(similar)
df.columns = ["item", "top_3"]

My result:

| item | top_3                           |
| ---- | --------------------------------|
| A    | [['A',0.9],['B',0.6],['C',0.5]] |
| B    | [['A',0.9],['B',0.6],['C',0.5]] |
| C    | [['A',0.9],['B',0.6],['C',0.5]] |
| D    | [['A',0.9],['B',0.6],['C',0.5]] |
| E    | [['A',0.9],['B',0.6],['C',0.5]] |
| U    | [['U',0.8],['V',0.7],['W',0.4]] |
| V    | [['U',0.8],['V',0.7],['W',0.4]] |
| W    | [['U',0.8],['V',0.7],['W',0.4]] |
| X    | [['U',0.8],['V',0.7],['W',0.4]] |
| Y    | [['U',0.8],['V',0.7],['W',0.4]] |

You see, the top_3 is wrong for A,B,C,U,V,W because in all cases it takes top 3 and thus doesn't care about the item itself.

The result I get is always bringing the top 3 and I tried to put filters but unable to get it working.

If there are better ways to extract the data than how I did, do let me know ways to optimize it.

CodePudding user response：

In this part you are missing an if/else condition, you just take the 3 first items ignoring that you should not take the same item key in case is in the top 3

for item, category in temp.items():
    top_3 = top_items.get(category)
    sample = top_3[:3]
    similar.append([item, sample])

Solution would be, remove the item from top_3 first, and then get the "sample"

for item, category in temp.items():
    top_3 = top_items.get(category)
    top_3_without_item = [x for x in top_3 if x[0] != item]
    sample = top_3_without_item[:3]
    similar.append([item, sample])

CodePudding user response：

As starting point, you can explode your list column then merge on itself. Next, you have to remove rows where the two list columns are equal and finally group the top 3 values:

out = df.explode('list')

out = (out.merge(df1, on='tag').query('list_x != list_y')
          .sort_values('list_y', key=lambda x: x.str[1], ascending=False)
          .assign(item=lambda x: x.pop('list_x').str[0])
          .groupby(['tag', 'item'])['list_y'].apply(lambda x: x.head(3).tolist())
          .rename('top_3').reset_index())

Output:

>>> out
        tag item                           top_3
0  icecream    A  [[B, 0.6], [C, 0.5], [D, 0.3]]
1  icecream    B  [[A, 0.9], [C, 0.5], [D, 0.3]]
2  icecream    C  [[A, 0.9], [B, 0.6], [D, 0.3]]
3  icecream    D  [[A, 0.9], [B, 0.6], [C, 0.5]]
4  icecream    E  [[A, 0.9], [B, 0.6], [C, 0.5]]
5    potato    U  [[V, 0.7], [W, 0.4], [X, 0.3]]
6    potato    V  [[U, 0.8], [W, 0.4], [X, 0.3]]
7    potato    W  [[U, 0.8], [V, 0.7], [X, 0.3]]
8    potato    X  [[U, 0.8], [V, 0.7], [W, 0.4]]
9    potato    Y  [[U, 0.8], [V, 0.7], [W, 0.4]]

CodePudding user response：

You can replicate each list with the number of element it has using