I have a date frame that needs to be split and cleared. I'm trying to separate phrases into words using regular expressions, but I'm not getting exactly what I want. In addition, I need to lowercase the words and remove extra characters (I wanted to do this with strip() and lower(), but I don’t know where to apply them). Another problem with NaN, they need to be ignored, but they become lists. Right now my function looks like this:
def splitSentence(Sentence):
words = re.split(';|,|/|&|\||-|:| ', str(Sentence))
words.sort()
# words.strip(' , !').lower()
return words
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
df
Name Color
0 Mark NaN
1 Ann NaN
2 John NaN
3 Elsa blue teal/ blue gray
4 Emma blue| green|grey it changes
5 Andrew BLACK!!!!
6 Max blue&green
7 Rose dichromatic: one blue| one green
8 Donald green;very very orangey brown and blue
9 Hugh Hazel, Green,Gray
10 Alex dark-coffee
I apply my function to the dataframe and get this:
df['Color'].apply(lambda x: splitSentence(x))
0 [nan]
1 [nan]
2 [nan]
3 [, blue, blue, gray, teal]
4 [, blue, changes, green, grey, it]
5 [BLACK!!!!]
6 [blue, green]
7 [, , blue, dichromatic, green, one, one]
8 [and, blue, brown, green, orangey, very, very]
9 [, Gray, Green, Hazel]
10 [coffee, dark]
But I need to get this (without the square brackets):
0 NaN
1 NaN
2 NaN
3 blue, gray, teal
4 blue, changes, green, grey, it
5 black
6 blue, green
7 blue, dichromatic, green, one
8 and, blue, brown, green, orangey, very
9 gray, green, hazel
10 coffee, dark
Can you please tell me how can I fix my code? Thanks
CodePudding user response:
import pandas as pd
import numpy as np
import re
# Create DataFrame
df = pd.DataFrame({'Name': ['Mark', 'Ann', 'John', 'Elsa', 'Emma', 'Andrew', 'Max', 'Rose', 'Donald', 'Hugh', 'Alex'],
'Color': [np.nan, np.nan, np.nan, 'blue teal/ blue gray', 'blue| green|grey it changes', 'BLACK!!!!', 'blue&green', 'dichromatic: one blue| one green', 'green;very very orangey brown and blue', 'Hazel, Green,Gray', 'dark-coffee']})
# Function
def splitSentence(Sentence):
# Added ! to punctuation list
words = re.split(';|,|/|&|\||-|:|!| ', str(Sentence))
words.sort()
# Creating a new list of lowercase words without empty strings
new_words_list = [x.lower() for x in words if x != '']
# Joining all elements in the list with ','
joined_string = ",".join(new_words_list)
return joined_string
df['Color'].apply(lambda x: splitSentence(x))
Out[27]: 0 nan
1 nan
2 nan
3 blue,blue,gray,teal
4 blue,changes,green,grey,it
5 black
6 blue,green
7 blue,dichromatic,green,one,one
8 and,blue,brown,green,orangey,very,very
9 gray,green,hazel
10 coffee,dark