Considering the following string:
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
I want to extract the name books and authors, so expected output is:
output = [
['Harry Potter', 'JK Rowling'],
['Dune (first book)', 'Frank Herbert'],
['and Le Petit Prince', 'Antoine de Saint Exupery']
]
The basic 2-step approach would be:
- Use re.split to split on a list of non ascii characters ((),;\n etc) to extract sentences or at least pieces of sentences.
- Keep only strings containing 'by' and use split again on 'by' to separate title and author.
While this method would cover 90% of cases, the main issue is the consideration of brackets (): I want to keep them in book titles (like Dune), but use them as delimiters after authors (like Saint Exupery).
I suspect a powerful regex would cover both, but not sure how exactly
CodePudding user response:
I'm not sure if that is "a powerful regex", but it does the job:
import re
text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
pattern = r" *(. ) by ((?: ?\w ) )"
matches = re.findall(pattern, text)
res = []
for match in matches:
res.append((match[0], match[1]))
print(res) # [('Harry potter', 'JK Rowling'), ('Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery')]
CodePudding user response:
After splitting by lines with :
lines = my_text.splitlines()
You could then use a Regex such as ([A-Z0-9].*?) by ([a-zA-Z' -] )
on each line.
This will match a capital letter (or a digit) followed by any character util by
is encountered. The capital letter or digit is to avoid matching the "and " at the beginning of the last line, as I think most books start with either a number or a capital letter.
After the by
, the regex tries to match everything containing letters, apostrophes, spaces and dashes, as I guess it should match most English names. Feel free to add more characters, such as accents or different alphabets.
CodePudding user response:
You can use the 're' module to make a strong regular expression to derive the desired answer. I wrote it with easy-to-understand code. If you go further, you can write your own simplified code.
import re
my_text = """
My favorites books of all time are:
Harry potter by JK Rowling,
Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""
output = []
for sentence in re.split(r'[;,\n]', my_text):
match = re.search(r'(.*)\sby\s(.*)', sentence)
if match:
match_group_2 = re.sub(r'\s*\(.*\)', '', match.group(2))
output.append([match.group(1).strip(), match_group_2.strip()])
print(output)
[
['Harry potter', 'JK Rowling'],
['Dune (first book)', 'Frank Herbert'],
['and Le Petit Prince', 'Antoine de Saint Exupery.']
]
thank you.