Home > other >  Split Pandas column of lists into multiple columns based on value
Split Pandas column of lists into multiple columns based on value

Time:10-10

I have a Pandas Dataframe that has a list column. I'd like to split this list column into multiple column based on the value. Returning yes_value or no_value for each record based on the column name.

Example input:

id | values
---|----------
1  | [A,B,C,D]
2  | [D,E,F]
3  | [A,D]
4  | [K]

Expected output:

id | values   |  A    |   B   |   C   |   D   |   E   |   F   |    K  |
---|----------|-------|-------|-------|-------|-------|-------|-------|
1  | [A,B,C,D]| yes_A | yes_B | yes_C | yes_D |  no_E |  no_F |  no_K |
2  | [D,E,F]  | no_A  | no_B  | no_C  | yes_D | yes_E | yes_F |  no_K |
3  | [A,D]    | yes_A | no_B  | no_C  | yes_D |  no_E |  no_F |  no_K | 
4  | [K]      | no_A  | no_B  | no_C  |  no_D |  no_E |  no_F | yes_K | 

CodePudding user response:

Code:

#Input
df = pd.DataFrame({'id':[1,2,3,4], 'Values':[['A','B','C','D'], ['D','E','F'], ['A','D'],  ['K']]})

#STEP 1 merging the Values column all lists into one and findout the unique values using set
#so the output will be {'A', 'B', 'C', 'D', 'E', 'F', 'K'} and then looping on it as below
for i in sorted(set(sum(df['Values'].tolist(),[]))):  

    #STEP 2 Creating new column and check if column in list or not
    df[i] = df['Values'].apply(lambda x: f'yes_{i}' if i in x else f'no_{i}')
df

Output:

    id  Values          A       B       C       D       E       F    K
0   1   [A, B, C, D]    yes_A   yes_B   yes_C   yes_D   no_E    no_F    no_K
1   2   [D, E, F]       no_A    no_B    no_C    yes_D   yes_E   yes_F   no_K
2   3   [A, D]          yes_A   no_B    no_C    yes_D   no_E    no_F    no_K
3   4   [K]             no_A    no_B    no_C    no_D    no_E    no_F    yes_K

CodePudding user response:

You can use a crosstab to reshape:

df2 = df.explode('values')
df3 = pd.crosstab(df2['id'], df2['values']).replace({0: 'no_', 1: 'yes_'})

out = df.merge(df3.add(df3.columns), left_on='id', right_index=True)

Or str.get_dummies:

df2 = df['values'].agg('|'.join).str.get_dummies().replace({0: 'no_', 1: 'yes_'})
out = df.join(df2.add(df2.columns))

output:

   id        values      A      B      C      D      E      F      K
0   1  [A, B, C, D]  yes_A  yes_B  yes_C  yes_D   no_E   no_F   no_K
1   2     [D, E, F]   no_A   no_B   no_C  yes_D  yes_E  yes_F   no_K
2   3        [A, D]  yes_A   no_B   no_C  yes_D   no_E   no_F   no_K
3   4           [K]   no_A   no_B   no_C   no_D   no_E   no_F  yes_K
  • Related