Home > OS >  Split string with missing data/levels over several columns
Split string with missing data/levels over several columns

Time:11-22

Say we have this df:

df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})

Say we want to split these strings at each comma and put them in new columns. As found here (How can I separate text into multiple values in a CSV file using Python?) and here (Split one column into multiple columns by multiple delimiters in Pandas), we can use str.split so:

df[['blue', 'red', 'green', 'yellow', 'magenta', 'purple']] = df['string'].str.split(',',expand=True)
df

Result:

    string                                  blue    red     green   yellow  magenta purple

0   blue                                    blue    None    None    None    None    None
1   blue,red                                blue    red     None    None    None    None
2   red                                     red     None    None    None    None    None
3   purple                                  purple  None    None    None    None    None
4   blue,red,green,yellow,magenta,purple    blue    red     green   yellow  magenta purple
5                                                   None    None    None    None    None
6   yellow,purple                           yellow  purple  None    None    None    None

Indices 0, 1, and 4 work as I want, with the colors correctly categorised in the new columns.

For the other indices, colors are misclassified. Note the examples linked above do not solve my issue since they have no missing data points/levels. How do I fix this? (And also, add a 'None' for index 5, column 'blue'?)

Many thanks

CodePudding user response:

Use Series.str.get_dummies:

df1 = df.join(df['string'].str.get_dummies(','))
print (df1)
                                 string  blue  green  magenta  purple  red  \
0                                  blue     1      0        0       0    0   
1                              blue,red     1      0        0       0    1   
2                                   red     0      0        0       0    1   
3                                purple     0      0        0       1    0   
4  blue,red,green,yellow,magenta,purple     1      1        1       1    1   
5                                           0      0        0       0    0   
6                         yellow,purple     0      0        0       1    0   

   yellow  
0       0  
1       0  
2       0  
3       0  
4       1  
5       0  
6       1  

If need also values instead 1 and None instead 0 add:

df1 = df['string'].str.get_dummies(',')
df = df.join(pd.DataFrame(np.where(df1, df1.columns.to_series(),None), 
                          index=df1.index,
                          columns=df1.columns))

print (df)
                                 string  blue  green  magenta  purple   red  \
0                                  blue  blue   None     None    None  None   
1                              blue,red  blue   None     None    None   red   
2                                   red  None   None     None    None   red   
3                                purple  None   None     None  purple  None   
4  blue,red,green,yellow,magenta,purple  blue  green  magenta  purple   red   
5                                        None   None     None    None  None   
6                         yellow,purple  None   None     None  purple  None   

   yellow  
0    None  
1    None  
2    None  
3    None  
4  yellow  
5    None  
6  yellow  

CodePudding user response:

I would do it following way

import pandas as pd
df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})
def convert_to_dict(string):
    return dict((i,i) for i in string.split(",") if i)
df2 = df['string'].apply(convert_to_dict).apply(pd.Series)
finaldf = pd.concat([df,df2],axis=1)
print(finaldf)

output

                                 string  blue  red  purple  green  yellow  magenta
0                                  blue  blue  NaN     NaN    NaN     NaN      NaN
1                              blue,red  blue  red     NaN    NaN     NaN      NaN
2                                   red   NaN  red     NaN    NaN     NaN      NaN
3                                purple   NaN  NaN  purple    NaN     NaN      NaN
4  blue,red,green,yellow,magenta,purple  blue  red  purple  green  yellow  magenta
5                                         NaN  NaN     NaN    NaN     NaN      NaN
6                         yellow,purple   NaN  NaN  purple    NaN  yellow      NaN

Explanation: Important part is converting string to dict so for each key-value pair key is equal value. Then I convert (single pandas.Series holding dicts) into pandas.DataFrame and pandas.concat it with original pandas.DataFrame.

CodePudding user response:

import pandas as pd
from typing import List
from pandas import DataFrame    


# your input
df = pd.DataFrame({
        'string': ['blue',
                   'blue,red',
                   'red',
                   'purple',
                   'blue,red,green,yellow,magenta,purple',
                   '',
                   'yellow,purple']})


# define columns
input_col = "string"    
encoded_col = "string_enriched"

# unique colors
unique_colors: List[str] = sorted([x for x in set(sum([x.split(",") for x in df[input_col].tolist()], [])) if x])

# create column with unique colors encoded
df[encoded_col] = df[input_col].apply(lambda x: (1 if w in x.split(",") else 0 for w in unique_colors))

# enrich encoded column swapping 0,1 to colors
df[encoded_col] = df.apply(lambda row: (w[1] if w[0] else "" for w in zip(row[encoded_col], unique_colors)), axis=1)

# explode column into multiple columns
dx = pd.DataFrame(df[encoded_col].to_list(), columns = unique_colors)

# concat two dataframes horizontaly
df_out: DataFrame = pd.concat([df[input_col], dx], axis=1)
  • Related