I have multiple strings which contain dates formatted with unnecessary white space. Examples:
- "1 . J U L Y 1 9 5 0"
- "1 8 . A P R I L 1 9 8 0"
- "Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"
- "D E C I S I O N: 1 3 . D E C E M B E R 2 0 1 8 / P U B L I S H E D: 1 4 . D E C E M B E R 2 0 1 8" (edit)
Using Python 3.10, how can I replace these dates with a correctly formatted version:
- "1. JULY 1950"
- "18. APRIL 1980"
- "Hello world, today is: 24. JANUARY 2000"
- "D E C I S I O N: 13. DECEMBER 2018 / P U B L I S H E D: 14. DECEMBER 2018" (edit)
I have regex to find the dates but I am unsure how to proceed from here:
^\s*\d \s [\S\s]*\s \d{1}\s*\d{1}\s*\d{1}\s*\d{1}\s*$
CodePudding user response:
Your regex should match two parts to be replaced:
(?<=\d) (?=[\.\d\/])
: spaces that are found between two digits, or a digit and a dot(?<=[A-Z]) (?=[A-Z])
: spaces that are found between two letters
Here's the full regex:
(?<=\d) (?=[\.\d])|(?<=[A-Z]) (?=[A-Z]):
Your Python code should look like this:
import re
your_strings = [
"1 . J U L Y 1 9 5 0",
"1 8 . A P R I L 1 9 8 0",
"Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"
]
pattern = r"(?<=\d) (?=[\.\d])|(?<=[A-Z]) (?=[A-Z])"
[re.sub(pattern, '', string) for string in your_strings]
Output:
['1. JULY 1950',
'18. APRIL 1980',
'Hello world, today is: 24. JANUARY 2000',
'DECISION: 13. DECEMBER 2018 / PUBLISHED: 14. DECEMBER 2018']
Check the Regex demo and Python demo.
CodePudding user response:
You can use
import re
pattern = re.compile(r"(\d(?:\s?\d)?\s?\.)\s?((?:j\s?a\s?n|f\s?e\s?b\s?r)\s?u\s?a\s?r\s?y|m\s?a\s?(?:r\s?c\s?h|y)|a\s?p\s?r\s?i\s?l|j\s?u\s?(?:n\s?e|l\s?y)|a\s?u\s?g\s?u\s?s\s?t|o\s?c\s?t\s?o\s?b\s?e\s?r|(?:s\s?e\s?p\s?t|n\s?o\s?v|d\s?e\s?c)e\s?m\s?b\s?e\s?r)\s?(\d\s?\d\s?\d\s?\d)", re.I)
strs = ["1 . J U L Y 1 9 5 0", "1 8 . A P R I L 1 9 8 0", "Hello world, today is: 2 4 . J A N U A R Y 2 0 0 0"]
for text in strs:
print(re.sub(pattern, lambda x: f'{"".join(x.group(1).split())} {"".join(x.group(2).split())} {"".join(x.group(3).split())}', text))
See the online demo.
Output:
1. JULY 1950
18. APRIL 1980
Hello world, today is: 24. JANUARY 2000
Details:
(\d(?:\s?\d)?\s?\.)
- Group 1: day, one or two digits with a dot\s?
- an optional whitespace((?:j\s?a\s?n|f\s?e\s?b\s?r)\s?u\s?a\s?r\s?y|m\s?a\s?(?:r\s?c\s?h|y)|a\s?p\s?r\s?i\s?l|j\s?u\s?(?:n\s?e|l\s?y)|a\s?u\s?g\s?u\s?s\s?t|o\s?c\s?t\s?o\s?b\s?e\s?r|(?:s\s?e\s?p\s?t|n\s?o\s?v|d\s?e\s?c)e\s?m\s?b\s?e\s?r)
- a month pattern\s?
- an optional whitespace(\d\s?\d\s?\d\s?\d)
- a year pattern, four digits.