I would really appreciate some assistance...
I'm trying to retrieve an ISBN number (13 digits) from pages, but the number set in so many different formats and that's why I can't retrieve all the different instances:
ISBN-13: 978 1 4310 0862 9
ISBN: 9781431008629
ISBN9781431008629
ISBN 9-78-1431-008-629
ISBN: 9781431008629 more text of the number
isbn : 9781431008629
My output should be: ISBN: 9781431008629
myISBN = re.findall("ISBN" r'\[\\w\\W\]{1,17}',text)
myISBN = myISBN\[0\]
print (myISBN)
I appreciate your time
CodePudding user response:
You can use
(?i)ISBN(?:-13)?\D*(\d(?:\W*\d){12})
See the regex demo. Then, remove all non-digits from Group 1 value.
Regex details:
(?i)
- case insensitive modifier, same asre.I
ISBN
- anISBN
string(?:-13)?
- an optional-13
string\D*
- zero or more non-digits(\d(?:\W*\d){12})
- Group 1: a digit and then twelve occurrences of any zero or more non-word chars and then a digit.
See the Python demo:
import re
texts = ['ISBN-13: 978 1 4310 0862 9',
'ISBN: 9781431008629',
'ISBN9781431008629',
'ISBN 9-78-1431-008-629',
'ISBN: 9781431008629 more text of the number',
'isbn : 9781431008629']
rx = re.compile(r'ISBN(?:-13)?\D*(\d(?:\W*\d){12})', re.I)
for text in texts:
m = rx.search(text)
if m:
print(text, '=> ISBN:', ''.join([d for d in m.group(1) if d.isdigit()]))
Output:
ISBN-13: 978 1 4310 0862 9 => ISBN: 9781431008629
ISBN: 9781431008629 => ISBN: 9781431008629
ISBN9781431008629 => ISBN: 9781431008629
ISBN 9-78-1431-008-629 => ISBN: 9781431008629
ISBN: 9781431008629 more text of the number => ISBN: 9781431008629
isbn : 9781431008629 => ISBN: 9781431008629
CodePudding user response:
import re
text = "ISBN-13: 978 1 4310 0862 9" \
"ISBN: 9781431008629" \
"ISBN9781431008629" \
"ISBN 9-78-1431-008-629" \
"ISBN: 9781431008629" \
"isbn : 9781431008629 "
myISBN = re.findall(r"ISBN:\s\d{13}", text)
print(myISBN)
Output:
['ISBN: 9781431008629', 'ISBN: 9781431008629']
\s
: one whitespace.\d{13}
: exactly 13 digits.
CodePudding user response:
I'd split the problem to two steps. First to extract the potential ISBN and in the second step to check if the ISBN is correct (13 numbers):
import re
text = """\
ISBN-13: 978 1 4310 0862 9
ISBN: 9781431008629
ISBN9781431008629
ISBN 9-78-1431-008-629
ISBN: 9781431008629 more text of the number
isbn : 9781431008629"""
pat1 = re.compile(r"(?i)ISBN(?:-13)?\s*:?([ \d-] )")
pat2 = re.compile(r"\d ")
for m in pat1.findall(text):
numbers = "".join(pat2.findall(m))
if len(numbers) == 13:
print("ISBN:", numbers)
Prints:
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629