Home > Enterprise >  How to split comma separated text into columns on pandas dataframe?
How to split comma separated text into columns on pandas dataframe?

Time:04-12

I have a dataframe where one of the columns has its items separated with commas. It looks like:

Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e

My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row. The matrix should look like this:

Data a b c d e
a,b,c 1 1 1 0 0
a,c,d 1 0 1 1 0
d,e 0 0 0 1 1
a,e 1 0 0 0 1
a,b,c,d,e 1 1 1 1 1

To separate column Data what I did is:

df['data'].str.split(',', expand = True)

Then I don't know how to proceed to allocate the flags to each of the columns.

CodePudding user response:

Maybe you can try this without pivot.

Create the dataframe.

import pandas as pd
import io

s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''

df = pd.read_csv(io.StringIO(s), sep = "\s ")

We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.

Finally fillna with zero and change the data into integer with astype(int).

df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)

#
    a   b   c   d   e
0   1   1   1   0   0
1   1   0   1   1   0
2   0   0   0   1   1
3   1   0   0   0   1
4   1   1   1   1   1

And then merge it with the original column.

new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)

#
    Data        a   b   c   d   e
0   a,b,c       1   1   1   0   0
1   a,c,d       1   0   1   1   0
2   d,e         0   0   0   1   1
3   a,e         1   0   0   0   1
4   a,b,c,d,e   1   1   1   1   1

CodePudding user response:

If you split the strings into lists, then explode them, it makes pivot possible.

(df.assign(data_list=df.Data.str.split(','))
   .explode('data_list')
   .pivot_table(index='Data',
                columns='data_list',
                aggfunc=lambda x: 1,
                fill_value=0))

Output

data_list  a  b  c  d  e
Data                    
a,b,c      1  1  1  0  0
a,b,c,d,e  1  1  1  1  1
a,c,d      1  0  1  1  0
a,e        1  0  0  0  1
d,e        0  0  0  1  1

CodePudding user response:

You could apply a custom count function for each key:

for k in ["a","b","c","d","e"]:
    df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
  • Related