Home > Enterprise >  Creating Edgelist from DataFrame with column containing multiple values
Creating Edgelist from DataFrame with column containing multiple values

Time:10-29

I have a dataframe containing some champions and their related traits, it looks like this:

Index Champion Traits
0 Alphelios 'Nightbringer', 'Ranger'
1 Ashe 'Draconic', 'Ranger'
2 Heimerdinger 'Renewer', 'Draconic', 'Caretaker'
3 Lee Sin 'Nightbringer', 'Skirmisher'

Where the Traits are represented as: ['Nightbringer', 'Ranger'], ['Draconic', 'Ranger'], ect.

... and so forth.

I wish to create an edgelist with the champions and those champions that has the same trait, as is;

Source Target
0 3
0 1
1 2

.. The list goes on. I'd also like the last DataFrame to contain a column with the weights, for example if two champions have the same two traits, or even tree, then it will have a weight of 2 (3). I think the dataframe should be extended so that each champion has several rows (containing each of their trait), but I can't seem to find a solution to the problem. Can anybody help me out here? Thanks!

CodePudding user response:

IIUC, you need to count the number of other rows with matching traits.

Use a combination of str.get_dummies and numpy:

NB. This assumes Traits are strings, if lists just get dummies from the list instead

import numpy as np

a = df.Traits.str.get_dummies(sep=',').values
b = a.dot(a.T)
np.fill_diagonal(b, 0)

pd.DataFrame({'Source': df['Index'],
              'Target': b.sum(1)})

Output:

   Source  Target
0       0       2
1       1       1
2       2       0
3       3       1

CodePudding user response:

You can create a dict with the indexes of all occurrences of a trait:

my_dict = {}

for i, j in enumerate(df['Traits']):
    for trait in j:
        if trait in my_dict:
            my_dict[trait].append(i)
        else:
            my_dict[trait] = [i]
print(my_dict)

The output:

{'Nightbringer': [0, 3], 'Ranger': [0, 1], 'Draconic': [1, 2], 'Renewer': [2], 'Caretaker': [2], 'Skirmisher': [3]}

This is a better approach because you don't get unnecessary repetition, e.g. 0 points to 3 and 3 points to 0

  • Related