Home > Software design >  Convert dates in Python from a text file with text
Convert dates in Python from a text file with text

Time:07-30

I have a text file (with utf-8 text) with lots of text and dates in the format dd/mm/yyyy, with years ranging from 1970 to 2022. I want to read this file and convert the dates to yyyy-mm-dd format, while keeping all the text as it is. Do you know how to do it with Python? Or I don't mind using another tool (such as awk, sed) but as long as the rest of the file will not be affected.

Optionally, I want to search also for dates without leading zeros in the day or month, and convert them too. But first I want to display them (I'm not sure if there are such dates).

It's important not to convert other strings, so if the "year" is not from 1970 to 2022, don't convert the string.

I wrote this program but it needs debugging, I don't know how to write the _repl function properly.

import re
import io


def _repl(s):
    x = s.split("/")
    if ((len(x) == 3) and (0 < int(x[0]) <= 31) and (0 < int(x[1]) <= 12) and (1970 <= int(x[2]) <= 2022)):
        return "{:04d}-{:02d}-{:02d}".format(int(x[2]), int(x[1]), int(x[0]))
    return x


with io.open("1.txt", mode="r", encoding="utf-8") as f:
    b = f.readlines()

c = list()
for line in b:
    _line = ""
    while (not (_line == line)):
        # _line = re.sub(pattern=r'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})', repl=_repl, string=line)
        _line = re.sub(pattern=r'([0-9]{2})/([0-9]{2})/([0-9]{4})', repl=_repl, string=line)
    c.append(_line)

with io.open('2.txt', mode='w', encoding="utf-8") as f:
    for line in c:
        f.write("{}".format(line))

CodePudding user response:

repl function gets as its' single argument match object, consider following simple example

import re

def repl(m):
    day, month, year = m.groups()
    return '-'.join([year, month, day]) if 1970 <= int(year) <= 2022 else m.group()

text1 = "Date 01/01/1901 and 01/01/3001 are outside range"
text2 = "Year 2000 should not be changed"
text3 = "Date 01/12/1970 shall be changed"
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text1))
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text2))
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text3))

output

Date 01/01/1901 and 01/01/3001 are outside range
Year 2000 should not be changed
Date 1970-12-01 shall be changed

Explanation: I use argument unpacking to get day, month, year, then I check if year (as numerical value) is inside range of [1970,2022] if yes I do create --sheared year, month, day otherwise I left what was matched as-is.

CodePudding user response:

Thanks to @Daweo for your answer. I just changed it a bit to accept also dates without leading zeros in the day or the month.

This is my program:

import re
import io


def _repl(m):
    day, month, year = m.groups()
    if ((0 < int(day) <= 31) and (0 < int(month) <= 12) and (1970 <= int(year) <= 2022)):
        return "{:04d}-{:02d}-{:02d}".format(int(year), int(month), int(day))
    else:
        return m.group()


with io.open("1.txt", mode="r", encoding="utf-8") as f:
    b = f.readlines()

c = list()
for line in b:
    _line = re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=line)
    c.append(_line)

with io.open('2.txt', mode='w', encoding="utf-8") as f:
    for line in c:
        f.write("{}".format(line))

Here are some texts for testing:

text1 = "Date 01/01/1901 and 01/01/3001 are outside range"
text2 = "Year 2000 should not be changed"
text3 = "Date 01/12/1970 shall be changed"
text4 = "'Date' 99/99/1970 should not be changed"
text5 = "'Date' 13/13/1970 should not be changed"
text6 = "Date 1/2/1970 shall be changed"
text7 = "Dates 01/12/1972,04/11/2022 shall be changed"
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text1))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text2))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text3))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text4))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text5))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text6))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text7))

And the output is:

Date 01/01/1901 and 01/01/3001 are outside range
Year 2000 should not be changed
Date 1970-12-01 shall be changed
'Date' 99/99/1970 should not be changed
'Date' 13/13/1970 should not be changed
Date 1970-02-01 shall be changed
Dates 1972-12-01,2022-11-04 shall be changed
  • Related