How to split phrases into words in a data frame by multiple delimiters, ignoring NaN?-CodePudding

I have a date frame that needs to be split and cleared. I'm trying to separate phrases into words using regular expressions, but I'm not getting exactly what I want. In addition, I need to lowercase the words and remove extra characters (I wanted to do this with strip() and lower(), but I don’t know where to apply them). Another problem with NaN, they need to be ignored, but they become lists. Right now my function looks like this:

def splitSentence(Sentence):
    words = re.split(';|,|/|&|\||-|:| ', str(Sentence))
    words.sort()
    # words.strip(' , !').lower()
    return words

df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
                     'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})

df

    Name     Color
0   Mark     NaN
1   Ann      NaN
2   John     NaN
3   Elsa     blue teal/ blue gray
4   Emma     blue| green|grey it changes
5   Andrew   BLACK!!!!
6   Max      blue&green
7   Rose     dichromatic: one blue| one green
8   Donald   green;very very orangey brown and blue
9   Hugh     Hazel, Green,Gray
10  Alex     dark-coffee

I apply my function to the dataframe and get this:

df['Color'].apply(lambda x: splitSentence(x))

0                                              [nan]
1                                              [nan]
2                                              [nan]
3                         [, blue, blue, gray, teal]
4                 [, blue, changes, green, grey, it]
5                                        [BLACK!!!!]
6                                      [blue, green]
7           [, , blue, dichromatic, green, one, one]
8     [and, blue, brown, green, orangey, very, very]
9                             [, Gray, Green, Hazel]
10                                    [coffee, dark]

But I need to get this (without the square brackets):

0                                         NaN
1                                         NaN
2                                         NaN
3                            blue, gray, teal
4              blue, changes, green, grey, it
5                                       black 
6                                 blue, green
7               blue, dichromatic, green, one
8      and, blue, brown, green, orangey, very
9                          gray, green, hazel
10                               coffee, dark

Can you please tell me how can I fix my code? Thanks

CodePudding user response：

import pandas as pd
import numpy as np
import re

# Create DataFrame
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
                     'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})

# Function
def splitSentence(Sentence):
    
    # Added ! to punctuation list
    words = re.split(';|,|/|&|\||-|:|!| ', str(Sentence))
    words.sort()
    
    # Creating a new list of lowercase words without empty strings
    new_words_list = [x.lower() for x in words if x != '']
    
    # Joining all elements in the list with ','
    joined_string = ",".join(new_words_list)

    return joined_string
 

df['Color'].apply(lambda x: splitSentence(x))

Out[27]:  0                                        nan
          1                                        nan
          2                                        nan
          3                        blue,blue,gray,teal
          4                 blue,changes,green,grey,it
          5                                      black
          6                                 blue,green
          7             blue,dichromatic,green,one,one
          8     and,blue,brown,green,orangey,very,very
          9                           gray,green,hazel
          10                               coffee,dark