Question: How can I improve either my method ("expand_traits" posted below) or the data structure I am trying to use? I estimate the runtime of my solution to be a few hours, which seems like I went very wrong somewhere (considering it takes ~ 10 minutes to collect all of the data, and possibly a few hours to transform it into something I can analyze).
I have collected some data that is essentially a Pandas DataFrame, where some columns in the table are a list of lists (technically formatted as strings, so when I evaluate them I am using ast.literal_eval(column) - if that's relevant).
To explain the context a bit:
The data contains historical stats from League of Legends TFT game mode. I am aiming to perform some analysis on it in terms of being able to group by each item in the list, and see how they perform on average. I can only really think of doing this in terms of tables - something like df.groupby(by='Trait').mean() to get the average win-rate for each trait, but am open to other ideas.
Here is an example of the dataset:
Rank | Summoner | Traits | Units |
---|---|---|---|
1 | name1 | ['7 Innovator', '1 Transformer', '3 Enchanter', '2 Socialite', '2 Clockwork', '2 Scholar', '2 Scrap'] | ['Ezreal', 'Singed', 'Zilean', 'Taric', 'Heimerdinger', 'Janna', 'Orianna', 'Seraphine', 'Jayce'] |
2 | name2 | ['1 Cuddly', '1 Glutton', '5 Mercenary', '4 Bruiser', '6 Chemtech', '2 Scholar', '1 Socialite', '2 Twinshot'] | ['Illaoi', 'Gangplank', 'MissFortune', 'Lissandra', 'Zac', 'Urgot', 'DrMundo', 'TahmKench', 'Yuumi', 'Viktor'] |
The total records in the table is approximately 40,000 (doesn't sound like much) but my original idea was to basically "unpivot" the nested lists into their own record.
My idea looks a little something like:
Summoner | Trait | Record_ID |
---|---|---|
name1 | 7 Innovator | id_1 |
name1 | 1 Transformer | id_1 |
... | ... | ... |
name2 | 1 Cuddly | id_2 |
name2 | 1 Glutton | id_2 |
Due to the number of items in each list, this transformation will turn my ~40,000 records into a few hundred thousand.
Another thing to note is that because this transformation would be unique to each column that contains lists, I would need to perform it separately (as far as I know) on each column. Here is the current code I am using to do this to the "Traits" column, which takes my computer around 35 mins to complete (also pretty average PC - nothing crazy but equivalent to intel i5 & 16 gigs of RAM.
def expand_traits(traits_df):
traits_df_expanded = pd.DataFrame()
for i in range(len(traits_df)):
traits = ast.literal_eval(traits_df.Traits[i])
for trait in traits:
record = {
'Summoner': traits_df.Summoner[i],
'Trait': trait,
'match_id': str(traits_df.match_id[i])
}
traits_df_expanded = traits_df_expanded.append(record, ignore_index=True)
Is this approach logical? Or am I missing something here.
I can't imagine this being the optimal method - I also might have gone wrong somewhere in my expand_traits method.
CodePudding user response:
Use explode
:
cols = ['Summoner', 'Traits', 'Record_ID']
out = df.assign(Record_ID='id_' df['Rank'].astype(str))[cols] \
.explode('Traits', ignore_index=True) \
.rename(columns={'Traits': 'Trait'})
print(out)
# Output:
Summoner Trait Record_ID
0 name1 7 Innovator id_1
1 name1 1 Transformer id_1
2 name1 3 Enchanter id_1
3 name1 2 Socialite id_1
4 name1 2 Clockwork id_1
5 name1 2 Scholar id_1
6 name1 2 Scrap id_1
7 name2 1 Cuddly id_2
8 name2 1 Glutton id_2
9 name2 5 Mercenary id_2
10 name2 4 Bruiser id_2
11 name2 6 Chemtech id_2
12 name2 2 Scholar id_2
13 name2 1 Socialite id_2
14 name2 2 Twinshot id_2