Split string in Python using two conditions (one delimiter and one "contain")-CodePudding

Considering the following string:

my_text = """
    My favorites books of all time are:
    Harry potter by JK Rowling,
    Dune (first book) by Frank Herbert;
    and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""

I want to extract the name books and authors, so expected output is:

output = [
    ['Harry Potter', 'JK Rowling'],
    ['Dune (first book)', 'Frank Herbert'],
    ['and Le Petit Prince', 'Antoine de Saint Exupery']
]

The basic 2-step approach would be:

Use re.split to split on a list of non ascii characters ((),;\n etc) to extract sentences or at least pieces of sentences.
Keep only strings containing 'by' and use split again on 'by' to separate title and author.

While this method would cover 90% of cases, the main issue is the consideration of brackets (): I want to keep them in book titles (like Dune), but use them as delimiters after authors (like Saint Exupery).

I suspect a powerful regex would cover both, but not sure how exactly

CodePudding user response：

I'm not sure if that is "a powerful regex", but it does the job:

import re

text = """
My favorites books of all time are:
    Harry potter by JK Rowling,
    Dune (first book) by Frank Herbert;
    and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""

pattern = r" *(. ) by ((?: ?\w ) )"

matches = re.findall(pattern, text)

res = []
for match in matches:
    res.append((match[0], match[1]))

print(res) # [('Harry potter', 'JK Rowling'), ('Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery')]

CodePudding user response：

After splitting by lines with :

lines = my_text.splitlines()

You could then use a Regex such as ([A-Z0-9].*?) by ([a-zA-Z' -] ) on each line.

This will match a capital letter (or a digit) followed by any character util by is encountered. The capital letter or digit is to avoid matching the "and " at the beginning of the last line, as I think most books start with either a number or a capital letter.

After the by , the regex tries to match everything containing letters, apostrophes, spaces and dashes, as I guess it should match most English names. Feel free to add more characters, such as accents or different alphabets.

CodePudding user response：

You can use the 're' module to make a strong regular expression to derive the desired answer. I wrote it with easy-to-understand code. If you go further, you can write your own simplified code.

import re

my_text = """
    My favorites books of all time are:
    Harry potter by JK Rowling,
    Dune (first book) by Frank Herbert;
    and Le Petit Prince by Antoine de Saint Exupery (I read it many times).
"""

output = []
for sentence in re.split(r'[;,\n]', my_text):
    match = re.search(r'(.*)\sby\s(.*)', sentence)
    if match:
        match_group_2 = re.sub(r'\s*\(.*\)', '', match.group(2))
        output.append([match.group(1).strip(), match_group_2.strip()])


print(output)

[
    ['Harry potter', 'JK Rowling'], 
    ['Dune (first book)', 'Frank Herbert'], 
    ['and Le Petit Prince', 'Antoine de Saint Exupery.']
]

thank you.