How can I structure this dataset for analysis & visualization? (Some columns contain lists instead o-CodePudding

Question: How can I improve either my method ("expand_traits" posted below) or the data structure I am trying to use? I estimate the runtime of my solution to be a few hours, which seems like I went very wrong somewhere (considering it takes ~ 10 minutes to collect all of the data, and possibly a few hours to transform it into something I can analyze).

I have collected some data that is essentially a Pandas DataFrame, where some columns in the table are a list of lists (technically formatted as strings, so when I evaluate them I am using ast.literal_eval(column) - if that's relevant).

To explain the context a bit:

The data contains historical stats from League of Legends TFT game mode. I am aiming to perform some analysis on it in terms of being able to group by each item in the list, and see how they perform on average. I can only really think of doing this in terms of tables - something like df.groupby(by='Trait').mean() to get the average win-rate for each trait, but am open to other ideas.

Here is an example of the dataset:

Rank	Summoner	Traits	Units
1	name1	['7 Innovator', '1 Transformer', '3 Enchanter', '2 Socialite', '2 Clockwork', '2 Scholar', '2 Scrap']	['Ezreal', 'Singed', 'Zilean', 'Taric', 'Heimerdinger', 'Janna', 'Orianna', 'Seraphine', 'Jayce']
2	name2	['1 Cuddly', '1 Glutton', '5 Mercenary', '4 Bruiser', '6 Chemtech', '2 Scholar', '1 Socialite', '2 Twinshot']	['Illaoi', 'Gangplank', 'MissFortune', 'Lissandra', 'Zac', 'Urgot', 'DrMundo', 'TahmKench', 'Yuumi', 'Viktor']

The total records in the table is approximately 40,000 (doesn't sound like much) but my original idea was to basically "unpivot" the nested lists into their own record.

My idea looks a little something like:

Summoner	Trait	Record_ID
name1	7 Innovator	id_1
name1	1 Transformer	id_1
...	...	...
name2	1 Cuddly	id_2
name2	1 Glutton	id_2

Due to the number of items in each list, this transformation will turn my ~40,000 records into a few hundred thousand.

Another thing to note is that because this transformation would be unique to each column that contains lists, I would need to perform it separately (as far as I know) on each column. Here is the current code I am using to do this to the "Traits" column, which takes my computer around 35 mins to complete (also pretty average PC - nothing crazy but equivalent to intel i5 & 16 gigs of RAM.

def expand_traits(traits_df):
    traits_df_expanded = pd.DataFrame()
    for i in range(len(traits_df)):
        traits = ast.literal_eval(traits_df.Traits[i])
        for trait in traits:   
            record = {
                'Summoner': traits_df.Summoner[i],
                'Trait': trait,
                'match_id': str(traits_df.match_id[i])
                }
            traits_df_expanded = traits_df_expanded.append(record, ignore_index=True)

Is this approach logical? Or am I missing something here.

I can't imagine this being the optimal method - I also might have gone wrong somewhere in my expand_traits method.

CodePudding user response：

Use explode:

cols = ['Summoner', 'Traits', 'Record_ID']
out = df.assign(Record_ID='id_'   df['Rank'].astype(str))[cols] \
        .explode('Traits', ignore_index=True) \
        .rename(columns={'Traits': 'Trait'})
print(out)

# Output:
   Summoner          Trait Record_ID
0     name1    7 Innovator      id_1
1     name1  1 Transformer      id_1
2     name1    3 Enchanter      id_1
3     name1    2 Socialite      id_1
4     name1    2 Clockwork      id_1
5     name1      2 Scholar      id_1
6     name1        2 Scrap      id_1
7     name2       1 Cuddly      id_2
8     name2      1 Glutton      id_2
9     name2    5 Mercenary      id_2
10    name2      4 Bruiser      id_2
11    name2     6 Chemtech      id_2
12    name2      2 Scholar      id_2
13    name2    1 Socialite      id_2
14    name2     2 Twinshot      id_2