I have this dataframe with the 'Text' Column, I would like to create a column such as 'Cleaned Col' essentially using regex. I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)'
but I am not getting the right outcome. Thanks in advance!
Text | Cleaned Col |
-------------------------------------------------------
, , , Apples , , , Hard Work , , | Apples, Hard Work |
, , , , , , , , Apples , , , , , | Apples |
Apples , , Watermelon , , , , , ,| Apples, Watermelon |
, , , , , , , , , , , , , , , , ,| |
Updated
CodePudding user response:
Use Series.str.findall
for get words and join by comma:
df['Cleaned Col'] = df['Text'].str.findall('\w ').str.join(', ')
print (df)
Text Cleaned Col
0 , , , Apples , , , Bananas , , , Apples, Bananas
1 , , , , , , , , Apples , , , , , Apples
2 Apples , , Watermelon , , , , , , Apples, Watermelon
3 , , , , , , , , , , , , , , , , ,
CodePudding user response:
You could try replacing the commas with spaces, then clearing out the left and right spaces and replacing the middle spaces with a comma:
df['Cleaned Col'] = df['Text'].apply(lambda x: x.replace(',', ' ').lstrip().rstrip().replace(' ', ', ')
CodePudding user response:
Since your fields are comma-delimited you can use
# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,] ').str.join(', ')
# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')
The regex extracts all found matches and .str.join(', ')
joins the resulting list items into a single string. The regex (see its demo) means:
[^\s,]
- one or more chars other than whitespace and comma[^\s,]
- a single char other than whitespace and comma(?:[^,]*[^\s,])?
- an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma.
If you have your commas padded with spaces and you really want to use Series.str.replace
, you could use
df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,] |[\s,] $|(\s)*(,)[\s,]*', r'\2\1', regex=True)
See this regex demo.
Details:
^[\s,]
- one or more whitespaces or commas at the start of string[\s,] $
- one or more whitespaces or commas at the end of string(\s)*(,)[\s,]*
- zero or more whitespaces (the last one matched is kept in Group 1,\1
), then a comma (captured into Group 2,\2
) and then zero or more whitespace or comma chars.
The replacement is Group 2 Group 1 values.