I want to get information between two dates in a PDF. I manage to get matches at the beginning of the dates, but i cant get it to match all over until the beginning of the next date. I^ve been trying with the following regex code:
(?=\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2})
Here is a sample of some of the texts from the PDFs
25/03/2021 -11:42 ANTONIO LUCIVA SALDANHAALVES (472959COREN) ANTONIO LUCIVA SALDANHAALVES (472959COREN) ENFERMAGEMPCT JÁ ESTA DE ALTA HOSPITALAR MELHORADA,EM AGUARDO DO PAD PARA LIBERAÇÃO , NO QUAL ENFERMEIRO MANOEL VEIO AVALIAR CLIENTE ONDE O MESMO LIBEROU PARA ACOMPANHAMENTO DOMICILIAR. EVOLUI COM MELHORA SATISFATÓRIA,HUMOR PRESERVADO, CONSCIENTE,ORIENTADA, VERBALIZA, DEAMBULA SE NECESSÁRIO. NEGA DISPNEIA OU MAIORES QUEIXAS. ELIMINAÇÕES FISIOLOGICAS PRESENTES SEM ALTERAÇÕES. DESSA FORMA CLIENTE É LIBERADO E SERÁ ACOMPANHADA PELO (PAD). 25/03/2021 -08:22LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)EM TEMPO SOLICITO EXAMES 25/03/2021 -08:20LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)
Thats what I want it to match, and all occurances that come next
CodePudding user response:
Your expression is a proper lookahead, but you still need to define what you want to match before it.
You have the proper way of matching a date, now you just need to find how to match everything, including new lines.
So, using this solution, we get:
"\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2}(?s:.*?)(?=\d{2}\/\d{2}\/\d{4} -\d{2}\:\d{2})"
CodePudding user response:
It is sometimes easier to just split the text at the target pattern, e.g.,
using your date pattern with re.split(your_pattern, your_text)
, we get the following list:
['',
' ANTONIO LUCIVA SALDANHAALVES (472959COREN) ANTONIO LUCIVA SALDANHAALVES (472959COREN) ENFERMAGEMPCT JÁ ESTA DE ALTA HOSPITALAR MELHORADA,EM AGUARDO DO PAD PARA LIBERAÇÃO , NO QUAL ENFERMEIRO MANOEL VEIO AVALIAR CLIENTE ONDE O MESMO LIBEROU PARA ACOMPANHAMENTO DOMICILIAR. EVOLUI COM MELHORA SATISFATÓRIA,HUMOR PRESERVADO, CONSCIENTE,ORIENTADA, VERBALIZA, DEAMBULA SE NECESSÁRIO. NEGA DISPNEIA OU MAIORES QUEIXAS. ELIMINAÇÕES FISIOLOGICAS PRESENTES SEM ALTERAÇÕES. DESSA FORMA CLIENTE É LIBERADO E SERÁ ACOMPANHADA PELO (PAD). ',
'LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)EM TEMPO SOLICITO EXAMES ',
'LIA FERNANDES ALVES DE LIMA (8308CRM)LIA FERNANDES ALVES DE LIMA (8308CRM)']
CodePudding user response:
If you want to be able to cross newline boundaries, you can use a capture group:
\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d)([\s\S]*?)(?=\s*\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d)|$)
Explanation
\b
A word boundary\d{2}/\d{2}/\d{4} -\d{2}:\d{2}
Match the date like pattern(?!\d)
Negative lookahead, assert not a digit to the right([\s\S]*?)
Capture group 1, match any character 0 times if an empty string is also valid(?=
Positive lookahead\s*\b\d{2}/\d{2}/\d{4} -\d{2}:\d{2}(?!\d)
Same as the first pattern with optional leading whitespace chars|
Or$
End of string
)
Close lookahead