Home > Net >  Reformat a date contained in a string with Python by removing unnecessary white space (e.g.: "1
Reformat a date contained in a string with Python by removing unnecessary white space (e.g.: "1

Time:01-28

I have multiple strings which contain dates formatted with unnecessary white space. Examples:

  • "1 . J U L Y 1 9 5 0"
  • "1 8 . A P R I L 1 9 8 0"
  • "Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"
  • "D E C I S I O N: 1 3 . D E C E M B E R 2 0 1 8 / P U B L I S H E D: 1 4 . D E C E M B E R 2 0 1 8" (edit)

Using Python 3.10, how can I replace these dates with a correctly formatted version:

  • "1. JULY 1950"
  • "18. APRIL 1980"
  • "Hello world, today is: 24. JANUARY 2000"
  • "D E C I S I O N: 13. DECEMBER 2018 / P U B L I S H E D: 14. DECEMBER 2018" (edit)

I have regex to find the dates but I am unsure how to proceed from here:

^\s*\d \s [\S\s]*\s \d{1}\s*\d{1}\s*\d{1}\s*\d{1}\s*$

CodePudding user response:

Your regex should match two parts to be replaced:

  • (?<=\d) (?=[\.\d\/]): spaces that are found between two digits, or a digit and a dot
  • (?<=[A-Z]) (?=[A-Z]): spaces that are found between two letters

Here's the full regex:

(?<=\d) (?=[\.\d])|(?<=[A-Z]) (?=[A-Z]): 

Your Python code should look like this:

import re

your_strings = [
    "1 . J U L Y 1 9 5 0",
    "1 8 . A P R I L 1 9 8 0",
    "Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"
]

pattern = r"(?<=\d) (?=[\.\d])|(?<=[A-Z]) (?=[A-Z])"

[re.sub(pattern, '', string) for string in your_strings]

Output:

['1. JULY 1950', 
 '18. APRIL 1980', 
 'Hello world, today is: 24. JANUARY 2000', 
 'DECISION: 13. DECEMBER 2018 / PUBLISHED: 14. DECEMBER 2018']

Check the Regex demo and Python demo.

CodePudding user response:

You can use

import re
pattern = re.compile(r"(\d(?:\s?\d)?\s?\.)\s?((?:j\s?a\s?n|f\s?e\s?b\s?r)\s?u\s?a\s?r\s?y|m\s?a\s?(?:r\s?c\s?h|y)|a\s?p\s?r\s?i\s?l|j\s?u\s?(?:n\s?e|l\s?y)|a\s?u\s?g\s?u\s?s\s?t|o\s?c\s?t\s?o\s?b\s?e\s?r|(?:s\s?e\s?p\s?t|n\s?o\s?v|d\s?e\s?c)e\s?m\s?b\s?e\s?r)\s?(\d\s?\d\s?\d\s?\d)", re.I)
strs = ["1 . J U L Y 1 9 5 0", "1 8 . A P R I L 1 9 8 0", "Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"]

for text in strs:
    print(re.sub(pattern, lambda x: f'{"".join(x.group(1).split())} {"".join(x.group(2).split())} {"".join(x.group(3).split())}', text))

See the online demo.

Output:

1. JULY 1950
18. APRIL 1980
Hello world, today is: 24. JANUARY 2000

Details:

  • (\d(?:\s?\d)?\s?\.) - Group 1: day, one or two digits with a dot
  • \s? - an optional whitespace
  • ((?:j\s?a\s?n|f\s?e\s?b\s?r)\s?u\s?a\s?r\s?y|m\s?a\s?(?:r\s?c\s?h|y)|a\s?p\s?r\s?i\s?l|j\s?u\s?(?:n\s?e|l\s?y)|a\s?u\s?g\s?u\s?s\s?t|o\s?c\s?t\s?o\s?b\s?e\s?r|(?:s\s?e\s?p\s?t|n\s?o\s?v|d\s?e\s?c)e\s?m\s?b\s?e\s?r) - a month pattern
  • \s? - an optional whitespace
  • (\d\s?\d\s?\d\s?\d) - a year pattern, four digits.
  • Related