I am looking for names of books and authors in a bunch of texts, like:
my_text = """
My favorites books of all time are:
Harry potter by J. K. Rowling, Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times). That's it by the way.
"""
Right now I am using the following code to split the text on separators like this:
pattern = r" *(. ) by ((?: ?\w ) )"
matches = re.findall(pattern, my_text)
res = []
for match in matches:
res.append((match[0], match[1]))
print(res) # [('Harry potter', 'J'), ('K. Rowling, Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery '), ("I read it many times). That's it", 'the way')]
Even if there are false positive (like 'that's it by the way') my main problem is with authors that are cut when written as initials, which is pretty common.
I can't figure out how to allow initials like "J. K. Rowling" (or the same without space before / after dot like "J.K.Rowling")
CodePudding user response:
change pattern to the following
pattern = r" *(. ) by ((?: ?[A-Z].?) ?(?:[A-Z][a-z] ) )"
To allow for initials in the author's name, we need to make some modifications to the pattern. First, we will add an optional dot after the initial, using the character class "[A-Z]", which matches any upper case letter, followed by a "." (dot) and "?" (question mark) to make it optional. Next, we will add an optional space " ?" after the dot. Next, we will repeat the pattern for multiple initials using " ".
when I tried your code I with my pattern I got:
('Harry potter', 'J. K. Rowling')
It seems to ignore the rest of the authors but it works for authors with initials. let me know if you want me to figure out how to make it work with both initials and non-initials if that make any sense.
Here I solve the problem, it took a while:
import re
pattern = r" *(?:and )?(. ?) by ([A-Z](?:\.|\w) (?: [A-Z](?:\.|\w) )*)"
matches = re.finditer(pattern, my_text)
result = []
for match in matches:
book_title = match.group(1)
author = match.group(2)
result.append((book_title, author))
print(result)
which will give:
[('Harry potter', 'J. K. Rowling'), (', Dune (first book)', 'Frank Herbert'), ('Le Petit Prince', 'Antoine')]