For example you have "blablabla 23423451, neemememem 5688234 hhojvz 3451, yoea"
.
Output should look like this "blablabla 2342345, neemememem 5688234, hhojvz 345, yoea"
If there is already a comma, just skip.
Note: Such text in the dataframe, there are bunch of it. So, ideally would be in pandas. All numbers and text are unique (no dublicates). Length of digits in a number are random.
text |
---|
blablabla 2342345 neemememem 5688234 hhojvz 345 yoea |
asdffgh 645655 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 ghhjfg 777777 hhojvz 345 ertert 698666666 neemememem 5688234 hhojvz 345 yoea |
blablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla |
5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoeablablabla 2342345 neemememem 5688234 hhojvz 345 yoea |
blablabla 2342345 neemememem 5688234 hhojvz 345 yoea |
sdf 2345 |
CodePudding user response:
You can use a regex:
df['text'] = df['text'].str.replace(r'(\d )(?!,)\b', r'\1,', regex=True)
How it works
(\d ) # capture digits
(?!,) # not followed by comma
\b # ensure word boundary
Replace with: captured group (\1
) and comma
output:
text
0 blablabla 2342345, neemememem 5688234, hhojvz 345, yoea
1 asdffgh 645655, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, ghhjfg 777777, hhojvz 345, ertert 698666666, neemememem 5688234, hhojvz 345, yoea
2 blablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla
3 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoeablablabla 2342345, neemememem 5688234, hhojvz 345, yoea
4 blablabla 2342345, neemememem 5688234, hhojvz 345, yoea
5 sdf 2345,
Alternative regex: (\d )(?!,)(?!\d)
, this removes the condition on the word boundary and avoids to transform '123,'
into '12,3,'