I have a text file (with utf-8
text) with lots of text and dates in the format dd/mm/yyyy
, with years ranging from 1970 to 2022. I want to read this file and convert the dates to yyyy-mm-dd
format, while keeping all the text as it is. Do you know how to do it with Python? Or I don't mind using another tool (such as awk
, sed
) but as long as the rest of the file will not be affected.
Optionally, I want to search also for dates without leading zeros in the day or month, and convert them too. But first I want to display them (I'm not sure if there are such dates).
It's important not to convert other strings, so if the "year" is not from 1970 to 2022, don't convert the string.
I wrote this program but it needs debugging, I don't know how to write the _repl
function properly.
import re
import io
def _repl(s):
x = s.split("/")
if ((len(x) == 3) and (0 < int(x[0]) <= 31) and (0 < int(x[1]) <= 12) and (1970 <= int(x[2]) <= 2022)):
return "{:04d}-{:02d}-{:02d}".format(int(x[2]), int(x[1]), int(x[0]))
return x
with io.open("1.txt", mode="r", encoding="utf-8") as f:
b = f.readlines()
c = list()
for line in b:
_line = ""
while (not (_line == line)):
# _line = re.sub(pattern=r'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})', repl=_repl, string=line)
_line = re.sub(pattern=r'([0-9]{2})/([0-9]{2})/([0-9]{4})', repl=_repl, string=line)
c.append(_line)
with io.open('2.txt', mode='w', encoding="utf-8") as f:
for line in c:
f.write("{}".format(line))
CodePudding user response:
repl
function gets as its' single argument match object, consider following simple example
import re
def repl(m):
day, month, year = m.groups()
return '-'.join([year, month, day]) if 1970 <= int(year) <= 2022 else m.group()
text1 = "Date 01/01/1901 and 01/01/3001 are outside range"
text2 = "Year 2000 should not be changed"
text3 = "Date 01/12/1970 shall be changed"
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text1))
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text2))
print(re.sub(r'(\d{2})/(\d{2})/(\d{4})', repl, text3))
output
Date 01/01/1901 and 01/01/3001 are outside range
Year 2000 should not be changed
Date 1970-12-01 shall be changed
Explanation: I use argument unpacking to get day, month, year, then I check if year (as numerical value) is inside range of [1970,2022] if yes I do create -
-sheared year, month, day otherwise I left what was matched as-is.
CodePudding user response:
Thanks to @Daweo for your answer. I just changed it a bit to accept also dates without leading zeros in the day or the month.
This is my program:
import re
import io
def _repl(m):
day, month, year = m.groups()
if ((0 < int(day) <= 31) and (0 < int(month) <= 12) and (1970 <= int(year) <= 2022)):
return "{:04d}-{:02d}-{:02d}".format(int(year), int(month), int(day))
else:
return m.group()
with io.open("1.txt", mode="r", encoding="utf-8") as f:
b = f.readlines()
c = list()
for line in b:
_line = re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=line)
c.append(_line)
with io.open('2.txt', mode='w', encoding="utf-8") as f:
for line in c:
f.write("{}".format(line))
Here are some texts for testing:
text1 = "Date 01/01/1901 and 01/01/3001 are outside range"
text2 = "Year 2000 should not be changed"
text3 = "Date 01/12/1970 shall be changed"
text4 = "'Date' 99/99/1970 should not be changed"
text5 = "'Date' 13/13/1970 should not be changed"
text6 = "Date 1/2/1970 shall be changed"
text7 = "Dates 01/12/1972,04/11/2022 shall be changed"
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text1))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text2))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text3))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text4))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text5))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text6))
print(re.sub(pattern=r'(\d{1,2})/(\d{1,2})/(\d{4})', repl=_repl, string=text7))
And the output is:
Date 01/01/1901 and 01/01/3001 are outside range
Year 2000 should not be changed
Date 1970-12-01 shall be changed
'Date' 99/99/1970 should not be changed
'Date' 13/13/1970 should not be changed
Date 1970-02-01 shall be changed
Dates 1972-12-01,2022-11-04 shall be changed