How do you find all instances of ISBN number using Python Regex-CodePudding

I would really appreciate some assistance...

I'm trying to retrieve an ISBN number (13 digits) from pages, but the number set in so many different formats and that's why I can't retrieve all the different instances:

ISBN-13: 978 1 4310 0862 9
ISBN: 9781431008629
ISBN9781431008629
ISBN 9-78-1431-008-629
ISBN: 9781431008629 more text of the number
isbn : 9781431008629

My output should be: ISBN: 9781431008629

myISBN = re.findall("ISBN"   r'\[\\w\\W\]{1,17}',text)
myISBN = myISBN\[0\]
print (myISBN)

I appreciate your time

CodePudding user response：

You can use

(?i)ISBN(?:-13)?\D*(\d(?:\W*\d){12})

See the regex demo. Then, remove all non-digits from Group 1 value.

Regex details:

(?i) - case insensitive modifier, same as re.I
ISBN - an ISBN string
(?:-13)? - an optional -13 string
\D* - zero or more non-digits
(\d(?:\W*\d){12}) - Group 1: a digit and then twelve occurrences of any zero or more non-word chars and then a digit.

See the Python demo:

import re
texts = ['ISBN-13: 978 1 4310 0862 9',
    'ISBN: 9781431008629',
    'ISBN9781431008629',
    'ISBN 9-78-1431-008-629',
    'ISBN: 9781431008629 more text of the number',
    'isbn : 9781431008629']
rx = re.compile(r'ISBN(?:-13)?\D*(\d(?:\W*\d){12})', re.I)
for text in texts:
    m = rx.search(text)
    if m:
        print(text, '=> ISBN:', ''.join([d for d in m.group(1) if d.isdigit()]))

Output:

ISBN-13: 978 1 4310 0862 9 => ISBN: 9781431008629
ISBN: 9781431008629 => ISBN: 9781431008629
ISBN9781431008629 => ISBN: 9781431008629
ISBN 9-78-1431-008-629 => ISBN: 9781431008629
ISBN: 9781431008629 more text of the number => ISBN: 9781431008629
isbn : 9781431008629 => ISBN: 9781431008629

CodePudding user response：

import re

text = "ISBN-13: 978 1 4310 0862 9" \
    "ISBN: 9781431008629" \
    "ISBN9781431008629" \
    "ISBN 9-78-1431-008-629" \
    "ISBN: 9781431008629" \
    "isbn : 9781431008629 "

myISBN = re.findall(r"ISBN:\s\d{13}", text)
print(myISBN)

Output:

['ISBN: 9781431008629', 'ISBN: 9781431008629']

\s : one whitespace.
\d{13}: exactly 13 digits.

CodePudding user response：

I'd split the problem to two steps. First to extract the potential ISBN and in the second step to check if the ISBN is correct (13 numbers):

import re

text = """\
ISBN-13: 978 1 4310 0862 9
ISBN: 9781431008629
ISBN9781431008629
ISBN 9-78-1431-008-629
ISBN: 9781431008629 more text of the number
isbn : 9781431008629"""

pat1 = re.compile(r"(?i)ISBN(?:-13)?\s*:?([ \d-] )")
pat2 = re.compile(r"\d ")

for m in pat1.findall(text):
    numbers = "".join(pat2.findall(m))
    if len(numbers) == 13:
        print("ISBN:", numbers)

Prints:

ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629
ISBN: 9781431008629