Regex: find all sentence with a citation-CodePudding

I've found this code to detect all citation in a text:

author = r"(?:[A-Z][A-Za-z'`-] )"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9] )?"  
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'

It's actually working great, but I need to find all the sentence (from where it start after a dot untile the end, another dot) where the citation is. So in this example:

"Nothing is here. In this line, actually, there is a ciation (Author et al., 2022). Once again, In this line there is nothing."

I'd like to get this "In this line, actually, there is a ciation (Author et al., 2022)."

How should I edit the above code to achieve this?

CodePudding user response：

You can use the following regular expression:

r"\s*([^.] (?=\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?))(\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?)"

Proof here.

CodePudding user response：

You need to solve the problem in two steps: a) break the text into sentences, b) detect sentences with a citation. Sentence tokenization is non-trivial to do right, so use a library to do it. For example:

>>> import nltk
>>> text = "Nothing is here. In this line, actually, there is a citation (Author et al., 2022). Once again, In this line there is nothing."
>>> sentences = nltk.sent_tokenize(text)
>>> print(sentences)
['Nothing is here.', 'In this line, actually, there is a citation (Author et al., 2022).', 'Once again, In this line there is nothing.']

Then, using your definitions:

>>> citation = fr"{author}{additional}*{year}" 
>>> for s in sentences:
>>> ...     if re.search(citation, s):
>>> ...             print(s)
>>> ... 
In this line, actually, there is a citation (Author et al., 2022).

PS. If you've never used the nltk before, you'll need to do a one-time download for the sentence tokenizer. You'll see an error message telling you to run this, just do it once and you're done forever.

nltk.download('punkt')

CodePudding user response：

Try with this one:

(?<=\. )[^(] \(([^)] )\).*?\.

Explanation:

(?<=\. ): lookbehind that checks for previous sequence of dot and space
[^(\.] : any combination of characters other than open parentheses and dots
\( : open parenthesis
([^)] ) : any combination of characters other than closed parenthesis
\) : closed parenthesis
.*? : optional lazy combination of characters
\. : sequence of dot and space

Corner cases that this solution is not able to address:

<space><dot><word> (like .dotnet) is an inner word before parenthesis: it will always treat <space><dot> as begin of sentence.
<word><dot><space> (like e.g.) is an inner word after parenthesis: it will always treat <dot><space> as end of sentence.

One possibility of addressing these corner cases is to do some preprocessing first and transforming/removing any abbreviation present in the raw text.

Try it here.