Home > database >  How to remove only the spaces that occur after a digit, and before a letter with RegEx?
How to remove only the spaces that occur after a digit, and before a letter with RegEx?

Time:12-07

I have a pandas column, shown below, having days and hours in different languages. I would like to standardize this column to cumulative hours. Unfortunately there are spaces, and before I can do any further processing, I would like to remove only the spaces that occur after a digit, and before a letter. Hence keeping the space between the day and hour or between the hour and minute indicators.

If this problem is solved, then the next approach I would do is to split the column by the remaining space, and apply a custom mapper to process the rest of the data.

How can I remove only the spaces that occur after a digit, and before a letter?

import pandas as pd
L=['4 d 19 h',
 '6g 23h',
 '6g 23h',
 '6g 23h',
 '6g 23h',
 '5g 18h',
 '5g 11h',
 '4T 20Std',
 '5T 18Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 23Std',
 '6T 14Std',
 '1T 14Std',
 '4T 9Std',
 '1T 22Std',
 '20Std 50 Min',
 '9\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '9\xa0j 22\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 18\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '9\xa0j 22\xa0h',
 '9\xa0j 22\xa0h',
 '2\xa0h 8\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 9\xa0min',
 '5\xa0j 23\xa0h',
 '4\xa0j 20\xa0h',
 '4\xa0j 22\xa0h',
 '6\xa0j 20\xa0h',
 '2\xa0h 5\xa0min',
 '4\xa0j 19\xa0h',
 '5\xa0j 23\xa0h',
 '6\xa0j 11\xa0h',
 '3\xa0j 9\xa0h',
 '5\xa0j 16\xa0h',
 '5\xa0j 18\xa0h',
 '21\xa0h 8\xa0min',
 '1\xa0j 20\xa0h',
 '4\xa0j 21\xa0h',
 '3\xa0j 18\xa0h',
 '2\xa0j 17\xa0h',
 '9\xa0j 15\xa0h',
 '5\xa0j 20\xa0h',
 '2\xa0j 19\xa0h',
 '20\xa0h 31\xa0min',
 '5\xa0j 20\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 15\xa0h',
 '5\xa0j 23\xa0h',
 '4\xa0j 17\xa0h',
 '21\xa0h 8\xa0min',
 '4\xa0j 18\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 22\xa0h',
 '4\xa0j 19\xa0h',
 '15\xa0h 44\xa0min',
 '9\xa0h 58\xa0min',
 '6\xa0j 22\xa0h',
 '21\xa0h 8\xa0min',
 '7\xa0j 14\xa0h',
 '6\xa0j',
 '5\xa0j 20\xa0h',
 '6\xa0j 17\xa0h',
 '6\xa0j 11\xa0h',
 '9\xa0j 23\xa0h',
 '2\xa0j 17\xa0h',
 '6\xa0j 11\xa0h',
 '5\xa0j 16\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 20\xa0h',
 '2\xa0j 8\xa0h',
 '6\xa0j 16\xa0h',
 '4\xa0j 12\xa0h',
 '6\xa0j 14\xa0h',
 '2\xa0j 20\xa0h',
 '3\xa0j 17\xa0h',
 '5\xa0j 20\xa0h',
 '7\xa0j 3\xa0h',
 '5\xa0j 18\xa0h',
 '1\xa0j 20\xa0h',
 '6\xa0j 4\xa0h',
 '1\xa0j 1\xa0h',
 '6\xa0j 21\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 19\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 14\xa0h',
 '5\xa0j 2\xa0h',
 '5\xa0j 16\xa0h',
 '9\xa0j 15\xa0h',
 '8\xa0j 23\xa0h',
 '1\xa0j 21\xa0h',
 '6\xa0j 4\xa0h',
 '1\xa0j 16\xa0h',
 '6\xa0j 18\xa0h',
 '18\xa0h 6\xa0min',
 '17\xa0h 38\xa0min',
 '5\xa0j 10\xa0h',
 '9\xa0h 52\xa0min',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 14\xa0h',
 '6\xa0j 1\xa0h',
 '5\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '1\xa0j 19\xa0h',
 '4\xa0j 22\xa0h',
 '3\xa0j',
 '1\xa0j 17\xa0h',
 '1\xa0j 10\xa0h',
 '4\xa0j 19\xa0h',
 '6\xa0j 7\xa0h',
 '8\xa0j 10\xa0h',
 '2\xa0j 19\xa0h',
 '9\xa0j 18\xa0h',
 '18\xa0h 23\xa0min',
 '5\xa0j 20\xa0h',
 '2\xa0j 16\xa0h',
 '18\xa0h 46\xa0min',
 '1\xa0j 19\xa0h',
 '3\xa0j 2\xa0h',
 '9\xa0j 20\xa0h',
 '9\xa0j 20\xa0h',
 '9\xa0j 20\xa0h',
 '6\xa0j 21\xa0h',
 '3\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '3\xa0j 17\xa0h',
 '2\xa0j 11\xa0h',
 '6\xa0j 1\xa0h',
 '6\xa0j 1\xa0h',
 '5\xa0j 19\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 20\xa0h',
 '4\xa0j 18\xa0h',
 '5\xa0j 20\xa0h',
 '5\xa0j 15\xa0h',
 '5\xa0j 20\xa0h',
 '9\xa0j 18\xa0h',
 '7\xa0j 17\xa0h',
 '6\xa0j 20\xa0h',
 '5\xa0j 10\xa0h',
 '14\xa0h 43\xa0min',
 '6\xa0j 5\xa0h',
 '5\xa0j 19\xa0h',
 '9\xa0j 18\xa0h',
 '1\xa0j 18\xa0h',
 '1\xa0j 18\xa0h',
 '6\xa0j 1\xa0h',
 '6\xa0j 22\xa0h',
 '3\xa0j 10\xa0h',
 '7\xa0j 20\xa0h',
 '3\xa0j 10\xa0h',
 '6\xa0j 19\xa0h',
 '1\xa0j 14\xa0h',
 '6\xa0j 18\xa0h',
 '1\xa0j 12\xa0h',
 '5\xa0j 1\xa0h',
 '2\xa0h 7\xa0min',
 '2\xa0h 7\xa0min',
 '20\xa0h 5\xa0min',
 '5\xa0j 20\xa0h',
 '9\xa0j 7\xa0h',
 '5\xa0j 19\xa0h',
 '15\xa0h 14\xa0min',
 '2\xa0h 5\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 6\xa0min',
 '2\xa0h 7\xa0min',
 '2\xa0h 5\xa0min',
 '8\xa0j 18\xa0h',
 '4\xa0j 10\xa0h',
 '5\xa0j 20\xa0h',
 '2\xa0j 19\xa0h',
 '1\xa0j 20\xa0h',
 '1\xa0j 14\xa0h',
 '6\xa0j 17\xa0h',
 '1\xa0j 22\xa0h',
 '1\xa0j 19\xa0h',
 '1\xa0j 12\xa0h',
 '6\xa0j 10\xa0h',
 '21\xa0h 8\xa0min',
 '5\xa0j 21\xa0h',
 '2\xa0h 10\xa0min',
 '1\xa0j 20\xa0h',
 '5\xa0j 20\xa0h',
 '3\xa0j 22\xa0h',
 '15\xa0h 12\xa0min',
 '6\xa0j 14\xa0h',
 '5\xa0j 20\xa0h',
 '1\xa0j 3\xa0h',
 '5\xa0j 21\xa0h',
 '6\xa0j 21\xa0h',
 '5\xa0j 19\xa0h',
 '6\xa0j 20\xa0h',
 '6\xa0j 19\xa0h',
 '6\xa0j 19\xa0h',
 '7\xa0j 3\xa0h',
 '4\xa0j 18\xa0h',
 '2\xa0j 20\xa0h',
 '5\xa0j 17\xa0h',
 '8\xa0j 18\xa0h',
 '5\xa0j 16\xa0h',
 '5\xa0j 16\xa0h',
 '4\xa0j 17\xa0h',
 '2\xa0j 18\xa0h',
 '16\xa0h 39\xa0min']
DATAFRAME=pd.DataFrame({'TIME':L})

I tried many RegEx expressions, but all remove the middle space as well or expect certain number of spaces.

Desired outcome:

'4 d 19 h'       --> '4d 19h'
'6g 23h'         --> '6g 23h'
'20Std 50 Min'   --> '20Std 50Min'
'1\xa0j 18\xa0h' --> '1j 18h',

CodePudding user response:

import re

re.sub(r'(?<=[0-9])[ \u00A0](?=[a-zA-Z])', '', '4 d 19 h')

Edited with an example, may add additional unicode characters from here as you wish like the above example https://jkorpela.fi/chars/spaces.html

for details on the regexp specifics refer to https://www.regular-expressions.info/lookaround.html

CodePudding user response:

If there can only be one space between the digit and the character, I would use the following regular expression: (\d )[\s|\xa0](\w)

So, in your code it would be something like this:

DATAFRAME=pd.DataFrame({'TIME':[re.sub(r"(\d )[\s|\xa0](\w )", r"\1\2", l) for l in L]})

Hope it helps.

  • Related