How to split comma separated text into columns on pandas dataframe?-CodePudding

I have a dataframe where one of the columns has its items separated with commas. It looks like:

Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e

My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row. The matrix should look like this:

Data	a	b	c	d	e
a,b,c	1	1	1	0	0
a,c,d	1	0	1	1	0
d,e	0	0	0	1	1
a,e	1	0	0	0	1
a,b,c,d,e	1	1	1	1	1

To separate column Data what I did is:

df['data'].str.split(',', expand = True)

Then I don't know how to proceed to allocate the flags to each of the columns.

CodePudding user response：

Maybe you can try this without pivot.

Create the dataframe.

import pandas as pd
import io

s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''

df = pd.read_csv(io.StringIO(s), sep = "\s ")

We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.

Finally fillna with zero and change the data into integer with astype(int).

df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)

#
    a   b   c   d   e
0   1   1   1   0   0
1   1   0   1   1   0
2   0   0   0   1   1
3   1   0   0   0   1
4   1   1   1   1   1

And then merge it with the original column.

new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)

#
    Data        a   b   c   d   e
0   a,b,c       1   1   1   0   0
1   a,c,d       1   0   1   1   0
2   d,e         0   0   0   1   1
3   a,e         1   0   0   0   1
4   a,b,c,d,e   1   1   1   1   1

CodePudding user response：

If you split the strings into lists, then explode them, it makes pivot possible.

(df.assign(data_list=df.Data.str.split(','))
   .explode('data_list')
   .pivot_table(index='Data',
                columns='data_list',
                aggfunc=lambda x: 1,
                fill_value=0))

Output

data_list  a  b  c  d  e
Data                    
a,b,c      1  1  1  0  0
a,b,c,d,e  1  1  1  1  1
a,c,d      1  0  1  1  0
a,e        1  0  0  0  1
d,e        0  0  0  1  1

CodePudding user response：

You could apply a custom count function for each key:

for k in ["a","b","c","d","e"]:
    df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)