How to explode pandas dataframe with lists to label the ones in the same row with same id?-CodePudding

For example, I have a pandas dataframe like this :

Ignoring the "Name" column, I want a dataframe that looks like this, labelling the Hashes of the same group with their "ID"

Here, we traverse each row, we encounter "8a43", and assign ID 1 to it, and wherever we find the same hash value, we assign ID as 1. Then we move on to the next row, and encounter 79e2 and b183. We then traverse all the rows and wherever we find these values, we store their IDs as 2. Now the issue will arise when we reach "abc7". It will be assigned ID=5 as it was previously encountered in "abc5". But I also want that in rows after the current one, wherever I find "26ea", assign the ID=5 to those as well.

I hope all this makes sense. If not, feel free to reach out to me via comments or message. I will clear it out quickly.

CodePudding user response：

Solution using dict

import numpy as np
import pandas as pd

hashvalues = list(df['Hash_Value'])

dic, i = {}, 1
id_list = []
for hashlist in hashvalues:
    # convert to list
    if isinstance(hashlist, str):
        hashlist = hashlist.replace('[','').replace(']', '')
        hashlist = hashlist.split(',')

        # check if the hash is unknown
        if hashlist[0] not in dic:
            # Assign a new id
            dic[hashlist[0]] = i
            k = i
            i  = 1
        else:
            # if known use existing id
            k = dic[hashlist[0]]
            
        for h in hashlist[1:]:
            # set id of the rest of the list hashes
            # equal to the first hashes's id
            dic[h] = k
            
        id_list.append(k)
    else:
        id_list.append(np.nan)
    
     print(df)

               Hash   Name  ID
0            [8a43]   abc1   1
1       [79e2,b183]   abc2   2
2            [f82a]   abc3   3
3            [b183]   abc4   2
4  [eaa7,5ea9,1cee]   abc5   4
5            [5ea9]   abc6   4
6       [1cee,26ea]   abc7   4
7            [79e2]   abc8   2
8            [8a43]   abc9   1
9            [26ea]  abc10   4

CodePudding user response：

Use networkx solution for dictionary for common values, select first value in Hash_Value by str and use Series.map:

#if necessary convert to lists
#df['Hash_Value'] = df['Hash_Value'].str.strip('[]').str.split(', ')

import networkx as nx

G=nx.Graph()
for l in df['Hash_Value']:
    nx.add_path(G, l)

new = list(nx.connected_components(G))

print (new)
[{'8a43'}, {'79e2', 'b183'}, {'f82a'}, {'5ea9', '1cee', '26ea', 'eaa7'}]

mapped =  {node: cid for cid, component in enumerate(new) for node in component}

df['ID'] = df['Hash_Value'].str[0].map(mapped)   1

print (df)
           Hash_Value   Name  ID
0              [8a43]   abcl   1
1        [79e2, b183]   abc2   2
2              [f82a]   abc3   3
3              [b183]   abc4   2
4  [eaa7, 5ea9, 1cee]   abc5   4
5              [5ea9]   abc6   4
6        [1cee, 26ea]   abc7   4
7              [79e2]   abc8   2
8              [8a43]   abc9   1
9              [26ea]  abc10   4