I've found this code to detect all citation in a text:
author = r"(?:[A-Z][A-Za-z'`-] )"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9] )?"
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'
It's actually working great, but I need to find all the sentence (from where it start after a dot untile the end, another dot) where the citation is. So in this example:
"Nothing is here. In this line, actually, there is a ciation (Author et al., 2022). Once again, In this line there is nothing."
I'd like to get this "In this line, actually, there is a ciation (Author et al., 2022)."
How should I edit the above code to achieve this?
CodePudding user response:
You can use the following regular expression:
r"\s*([^.] (?=\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?))(\([\w ,.] (, *\?)?(\d{4}|\d{2})\)\.?)"
Proof here.
CodePudding user response:
You need to solve the problem in two steps: a) break the text into sentences, b) detect sentences with a citation. Sentence tokenization is non-trivial to do right, so use a library to do it. For example:
>>> import nltk
>>> text = "Nothing is here. In this line, actually, there is a citation (Author et al., 2022). Once again, In this line there is nothing."
>>> sentences = nltk.sent_tokenize(text)
>>> print(sentences)
['Nothing is here.', 'In this line, actually, there is a citation (Author et al., 2022).', 'Once again, In this line there is nothing.']
Then, using your definitions:
>>> citation = fr"{author}{additional}*{year}"
>>> for s in sentences:
>>> ... if re.search(citation, s):
>>> ... print(s)
>>> ...
In this line, actually, there is a citation (Author et al., 2022).
PS. If you've never used the nltk before, you'll need to do a one-time download for the sentence tokenizer. You'll see an error message telling you to run this, just do it once and you're done forever.
nltk.download('punkt')
CodePudding user response:
Try with this one:
(?<=\. )[^(] \(([^)] )\).*?\.
Explanation:
(?<=\. )
: lookbehind that checks for previous sequence of dot and space[^(\.]
: any combination of characters other than open parentheses and dots\(
: open parenthesis([^)] )
: any combination of characters other than closed parenthesis\)
: closed parenthesis.*?
: optional lazy combination of characters\.
: sequence of dot and space
Corner cases that this solution is not able to address:
<space><dot><word>
(like.dotnet
) is an inner word before parenthesis: it will always treat<space><dot>
as begin of sentence.<word><dot><space>
(likee.g.
) is an inner word after parenthesis: it will always treat<dot><space>
as end of sentence.
One possibility of addressing these corner cases is to do some preprocessing first and transforming/removing any abbreviation present in the raw text.
Try it here.