I have a list of titles with combined dates and descriptions, but I have to reduce this to just a list of dates. Some examples of these titles are stuff like this:
1/16 Stories of Time
5/18 Cock'a'doodle'do
However, some people are really bad at typing and have forgotten the spaces between the dates and the rest of the title. I need to remove everything except for numbers and the slashes between them. Using any method, but preferably regex, is there a simple way to do this? For the record, I do understand how to split and recompile the list for any method that would work on a single string.
CodePudding user response:
You can import string
to get easy access to a string of all digits, add the slash to it, and then compare your date string against that to drop any character from the date string that's not in there:
import string
string.digits = "/"
for character in date_string:
if not character in string.digits:
date_string = date_string.replace(character, "")
This will convert the date_string 5/18 Cock'a'doodle'do
to just 5/18
without using regex at all.
CodePudding user response:
You're thinking about this backwards. If you want to extract the date at the start of a line, do that instead of trying to get rid of everything else.
You can use a regex like this: ^\d{1,2}/\d{1,2}
which means:
^
start of line\d
digit{1,2}
repeated one or two times
For example:
import re
lines = [
'1/16 Stories of Time',
"5/18 Cock'a'doodle'do",
'6/22Bible']
for line in lines:
match = re.match(r'^\d{1,2}/\d{1,2}', line)
if match:
print(match.group(0))
Output:
1/16
5/18
6/22
(Note that re.match
always starts matching from the start of the string, so the ^
is redundant here.)
This is more rigorous against titles containing numbers and slashes, like say, 4/5 The 39 Steps / The Thirty-Nine Steps
-> 4/5
.
However, you'll have a problem if someone forgot the space for a title starts with a number, like say, 7/8100 Years of Solitude
-> 7/81
.