Efficient way to replace values in a column starting from a list of pairs-CodePudding

I'm trying to replace duplicates in my data, and I'm looking for an efficient way to do that.

I have a df with 2 columns, idA and idB, like this:

This is a df with similarities. I want to create a dictionary in which the key is the id, and the value is a list with all the devices linked to the key. Example:

d[5] = [22, 6000]
d[22] = [5, 590]

What I'm doing is the following:

dict_dup = dict()

for j in tqdm(ids):
    
    l1 = []
    
    for i in range(0, len(dup_list)):
    
        if j in dup_list[i]:
            
            l2 = list(dup_list[i])
            l2.remove(j)
                       
            l1.append(l2[0])
            
            dict_dup[j] = l1

Is it possible to make it more efficiently?

CodePudding user response：

Assuming this is a pandas DataFrame, we can groupby "idA", collect "idB" values of each group in a list and use to_dict for the dictionary:

out = df.groupby('idA')['idB'].apply(list).to_dict()

Output:

{5: [6000], 22: [5, 590]}

That being said, it's not exactly the best way to replace duplicates imo. Why are you creating a dictionary? Why not work on the DataFrame itself? But given the very limited data you have provided, we can only speculate.

CodePudding user response：

I have to do some guessing here, because you question is no super clear, but the way I understand it, you want a dictionary that maps each id in idA or idB to the list of ids found on the other side, from that id.

If I understood your problem correctly, I would solve it by directly constructing a dictionary mapping ids to sets of ids.

idA = [22, 22, 5]
idB = [5, 590, 6000]

dict_dup = dict()
for a, b in zip(idA, idB):
    if a not in dict_dup:
        dict_dup[a] = set()
    dict_dup[a].add(b)

    if b not in dict_dup:
        dict_dup[b] = set()
    dict_dup[b].add(a)

After this runs, print(dict_dup) outputs

{22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}}

which I think is the data structure you're looking for.

By using dicts and sets, this code is very efficient. It will run in linear time over the number of ids.

Shorter code with defaultdict

You can also make this code a lot shorter by using a defaultdict instead of a regular dict, which will automatically create those empty sets when needed:

from collections import defaultdict

idA = [22, 22, 5]
idB = [5, 590, 6000]

dict_dup = defaultdict(set)
for a, b in zip(idA, idB):
    dict_dup[a].add(b)
    dict_dup[b].add(a)

The print statements produces slightly different output, but it's equivalent:

defaultdict(<class 'set'>, {22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}})

This still contains the info you want, and is just as efficient as the first solution.

Putting it back in your data frame

Now, if you need to put this information back in your dataframe, you can use dict_dup to efficiently retrieve what you're looking for for each row.