Home > Software engineering >  Regex: find all sentence with a citation
Regex: find all sentence with a citation

Time:05-16

I've found this code to detect all citation in a text:

author = r"(?:[A-Z][A-Za-z'`-] )"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9] )?"  
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'

It's actually working great, but I need to find all the sentence (from where it start after a dot untile the end, another dot) where the citation is. So in this example:

"Nothing is here. In this line, actually, there is a ciation (Author et al., 2022). Once again, In this line there is nothing."

I'd like to get this "In this line, actually, there is a ciation (Author et al., 2022)."

How should I edit the above code to achieve this?

CodePudding user response:

You can use the following regular expression:

r"\s*([^.] (?=\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?))(\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?)"

Proof here.

CodePudding user response:

You need to solve the problem in two steps: a) break the text into sentences, b) detect sentences with a citation. Sentence tokenization is non-trivial to do right, so use a library to do it. For example:

>>> import nltk
>>> text = "Nothing is here. In this line, actually, there is a citation (Author et al., 2022). Once again, In this line there is nothing."
>>> sentences = nltk.sent_tokenize(text)
>>> print(sentences)
['Nothing is here.', 'In this line, actually, there is a citation (Author et al., 2022).', 'Once again, In this line there is nothing.']

Then, using your definitions:

>>> citation = fr"{author}{additional}*{year}" 
>>> for s in sentences:
>>> ...     if re.search(citation, s):
>>> ...             print(s)
>>> ... 
In this line, actually, there is a citation (Author et al., 2022).

PS. If you've never used the nltk before, you'll need to do a one-time download for the sentence tokenizer. You'll see an error message telling you to run this, just do it once and you're done forever.

nltk.download('punkt')

CodePudding user response:

Try with this one:

(?<=\. )[^(] \(([^)] )\).*?\. 

Explanation:

  • (?<=\. ): lookbehind that checks for previous sequence of dot and space
  • [^(\.] : any combination of characters other than open parentheses and dots
  • \( : open parenthesis
  • ([^)] ) : any combination of characters other than closed parenthesis
  • \) : closed parenthesis
  • .*? : optional lazy combination of characters
  • \. : sequence of dot and space

Corner cases that this solution is not able to address:

  • <space><dot><word> (like .dotnet) is an inner word before parenthesis: it will always treat <space><dot> as begin of sentence.
  • <word><dot><space> (like e.g.) is an inner word after parenthesis: it will always treat <dot><space> as end of sentence.

One possibility of addressing these corner cases is to do some preprocessing first and transforming/removing any abbreviation present in the raw text.

Try it here.

  • Related