I ran into this problem where I have a dataframe that looks like the following (the values in the last 3 columns are usually 4-5 alphanumeric codes).
import pandas as pd
data = {'ID':['P39','S32'],
'Name':['Pipe','Screw'],
'Col3':['Test1, Test2, Test3','Test6, Test7'],
'Col4':['','Test8, Test9'],
'Col5':['Test4, Test5','Test10, Test11, Test12, Test13']
}
df = pd.DataFrame(data)
ID | Name | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | P39 | Pipe | Test1, Test2, Test3 | Test4, Test5 | |
1 | S32 | Screw | Test6, Test7 | Test8, Test9 | Test10, Test11, Test12, Test13 |
I want to expand this dataframe or create a new one based on the values in the last 3 columns in each row. I want to create more rows based on the maximum amount of values separated by commas in one of the last 3 rows. I then want to keep the first 2 columns the same in all of the expanded rows. But I want to fill the last 3 columns in the expanded rows with only one value each from the original column.
In the above example, the first row would indicate I need 3 total rows (Col3 has the most at 3 values), and the second row would indicate I need 4 total rows (Col5 has the most at 4 values). A desired output would be along the lines of:
ID | Name | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | P39 | Pipe | Test1 | Test4 | |
1 | P39 | Pipe | Test2 | Test5 | |
2 | P39 | Pipe | Test3 | ||
3 | S32 | Screw | Test6 | Test8 | Test10 |
4 | S32 | Screw | Test7 | Test9 | Test11 |
5 | S32 | Screw | Test12 | ||
6 | S32 | Screw | Test13 |
I first found a way to figure out the number of rows needed. I also had the idea to append the values to a new dataframe in the same loop. Although, I'm not sure how to separate the values in the last 3 columns and append them one by one in the rows. I know the str.split() is useful to put the values into a list. My only idea would be if I need to loop through each column separately and append it to the correct row, but I'm not sure how to do that.
output1 = pd.DataFrame(
columns = ['ID', 'Name', 'Col3', 'Col4', 'Col5'])
for index, row in df.iterrows():
output2 = pd.DataFrame(
columns = ['ID', 'Name', 'Col3', 'Col4', 'Col5'])
col3counter = df.iloc[index, 2].count(',')
col4counter = df.iloc[index, 3].count(',')
col5counter = df.iloc[index, 4].count(',')
numofnewcols = max(col3counter, col4counter, col5counter) 1
iter1 = df.iloc[index, 2].split(', ')
iter2 = df.iloc[index, 3].split(', ')
iter3 = df.iloc[index, 4].split(', ')
#for q in iter1
#output2.iloc[ , 2] =
output1 = pd.concat([output1, output2], ignore_index=True)
del output2
CodePudding user response:
A bit tricky but it should work with melt
to flat your dataframe then pivot_table
to reshape it:
out = (df.reset_index().melt(['ID', 'Name', 'index'], var_name='col', value_name='val')
.assign(val=lambda x: x['val'].str.split(', ')).explode('val')
.assign(row=lambda x: x.groupby(['index', 'col']).cumcount())
.pivot_table('val', ['index', 'row', 'ID', 'Name'], 'col', aggfunc='first')
.droplevel(['index', 'row']).reset_index().rename_axis(columns=None).fillna(''))
Output:
ID | Name | Col3 | Col4 | Col5 | |
---|---|---|---|---|---|
0 | P39 | Pipe | Test1 | Test4 | |
1 | P39 | Pipe | Test2 | Test5 | |
2 | P39 | Pipe | Test3 | ||
3 | S32 | Screw | Test6 | Test8 | Test10 |
4 | S32 | Screw | Test7 | Test9 | Test11 |
5 | S32 | Screw | Test12 | ||
6 | S32 | Screw | Test13 |
CodePudding user response:
Here is a way:
cols = ['Col3','Col4','Col5']
s = df[cols].stack().str.split(', ')
s2 = s.str.len().groupby(level=0).transform(lambda x: x.max() - x)
df.loc[:,~df.columns.isin(cols)].join((s s2.map(lambda x: x * [''])).unstack()).explode(cols).reset_index(drop=True)
Output:
ID Name Col3 Col4 Col5
0 P39 Pipe Test1 Test4
1 P39 Pipe Test2 Test5
2 P39 Pipe Test3
3 S32 Screw Test6 Test8 Test10
4 S32 Screw Test7 Test9 Test11
5 S32 Screw Test12
6 S32 Screw Test13
CodePudding user response:
This equalizes the number of values in each list by row, so that you can get your desired output with a multi-column explode.
import pandas as pd
import numpy as np
cols = ['Col3','Col4','Col5']
for col in cols:
df[col] = df[col].str.split(', ')
df['rows'] = df[cols].applymap(len).max(axis=1)
for col in cols:
df[col] = df[[col, 'rows']].apply(lambda x: x[col] [np.nan]*(x['rows'] - len(x[col])), axis=1)
'''
# Or, simplified with more-itertools and np.vectorize
from more_itertools import padded
vec_pad = np.vectorize(padded, excluded={1})
for col in cols:
df[col] = vec_pad(df[col], np.nan, df.rows)
df[cols] = df[cols].applymap(list)
'''
df = (df.explode(cols)
.drop('rows', axis=1)
.replace('', np.nan))
print(df)
Output:
ID Name Col3 Col4 Col5
0 P39 Pipe Test1 NaN Test4
0 P39 Pipe Test2 NaN Test5
0 P39 Pipe Test3 NaN NaN
1 S32 Screw Test6 Test8 Test10
1 S32 Screw Test7 Test9 Test11
1 S32 Screw NaN NaN Test12
1 S32 Screw NaN NaN Test13