Removing repeated commas from Pandas Dataframe Column in other words I just need the text from the c-CodePudding

I have this dataframe with the 'Text' Column, I would like to create a column such as 'Cleaned Col' essentially using regex. I looked at different patterns such as this r'\s*,*([^(a-zA-Z)]*)' but I am not getting the right outcome. Thanks in advance!

             Text                |    Cleaned Col      |
-------------------------------------------------------
, , , Apples , , , Hard Work , , | Apples, Hard Work   |
, , , , , , , , Apples , , , , , | Apples              |
Apples , , Watermelon , , , , , ,| Apples, Watermelon  |
, , , , , , , , , , , , , , , , ,|                     |

Updated

CodePudding user response：

Use Series.str.findall for get words and join by comma:

df['Cleaned Col'] = df['Text'].str.findall('\w ').str.join(', ')
print (df)
                                      Text         Cleaned Col
0      , , , Apples , , , Bananas , , ,        Apples, Bananas
1    , , , , , , , , Apples , , , , ,                   Apples
2        Apples , , Watermelon , , , , , ,  Apples, Watermelon
3  , , , , , , , , , , , , , , , , ,

CodePudding user response：

You could try replacing the commas with spaces, then clearing out the left and right spaces and replacing the middle spaces with a comma:

df['Cleaned Col'] = df['Text'].apply(lambda x: x.replace(',', ' ').lstrip().rstrip().replace(' ', ', ')

CodePudding user response：

Since your fields are comma-delimited you can use

# If the fields CANNOT contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,] ').str.join(', ')

# If the fields can contain whitespace:
df['Cleaned Col'] = df['Text'].str.findall(r'[^\s,](?:[^,]*[^\s,])?').str.join(', ')

The regex extracts all found matches and .str.join(', ') joins the resulting list items into a single string. The regex (see its demo) means:

[^\s,] - one or more chars other than whitespace and comma
[^\s,] - a single char other than whitespace and comma
(?:[^,]*[^\s,])? - an optional occurrence of any zero or more chars other than a comma and then a char other than whitespace and comma.

If you have your commas padded with spaces and you really want to use Series.str.replace, you could use

df['Cleaned Col'] = df['Text'].str.replace(r'^[\s,] |[\s,] $|(\s)*(,)[\s,]*', r'\2\1', regex=True)

See this regex demo.

Details:

^[\s,] - one or more whitespaces or commas at the start of string
[\s,] $ - one or more whitespaces or commas at the end of string
(\s)*(,)[\s,]* - zero or more whitespaces (the last one matched is kept in Group 1, \1), then a comma (captured into Group 2, \2) and then zero or more whitespace or comma chars.

The replacement is Group 2 Group 1 values.