I have a data frame with about 20k rows and about 10 columns. One of the columns has a list in it between len=1 and len=4.
If a list has more than one entry in it, I want to duplicate that row and append it to the bottom. The second table is what I want to be added.
Index | Col 1 | Col 2 | List Col | List Len |
---|---|---|---|---|
1 | ABC | CDE | ['String1', 'String2'] | 2 |
2 | EFG | HIJ | ['String3'] | 1 |
Index | Col 1 | Col 2 | List Col | List Len |
---|---|---|---|---|
3 | ABC | CDE | ['String2'] | 2 |
and change the first row to
Index | Col 1 | Col 2 | List Col | List Len |
---|---|---|---|---|
1 | ABC | CDE | ['String1'] | 2 |
Apologies for the really terrible formatting. When I created a second table to show everything, Stack Overflow viewed it as unformatted code and I couldn't post it.
The index is the automatically generated index and not relevant to my data.
#Finding the max number of list len
max_list_len= max(original_df['List Col'])
#Set my counter variable
number=2
#Begin loop
while number<=max_list_len:
#copies relevant rows into new data frame
dataframe_being_duplicated = original_df.loc[(original_df['List Col']).eq(number)]
dataframe_being_duplicated.loc[(original_df['List Len']).eq(number), 'List Col'] = dataframe_being_duplicated.loc[(original_df['List Len']).eq(number)]['List Col'].str[number-1]
full_quote_df = pd.concat(full_quote_df,dataframe_being_duplicated,ignore_index=True)
number =1 #increment number
With this, it throws an error that says "cannot reindex on an axis with duplicate labels"
I'm also now realizing that this process for rows where the len is larger than 2, because it only duplicates the row once (if it worked as intended).
Is there a faster way to 'unstack' rows like this?
CodePudding user response:
If you really want to duplicate the rows:
out = df.loc[df.index.repeat(df['List Col'].str.len())]
Output:
Index Col 1 Col 2 List Col List Len
0 1 ABC CDE [String1, String2] 2
0 1 ABC CDE [String1, String2] 2
1 2 EFG HIJ [String3] 1
Else a classical explode
:
out = df.explode('List Col')
Output:
Index Col 1 Col 2 List Col List Len
0 1 ABC CDE String1 2
0 1 ABC CDE String2 2
1 2 EFG HIJ String3 1