The string that I ended up after scraping 1000 Reuters articles looks like this:
<TEXT>
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR> By Yoshiko Mori</AUTHOR>
<DATELINE> TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER
</BODY></TEXT>
I want to extract the title, author, dateline and body out of this string. To do that, I have the below regex but unfortunately, it is not working for the body section.
try:
body=re.search('<BODY>(.)</BODY>',example_txt).group(1)
except:
body='NA'
This try-except always returns NA
for body but works for title, author and dateline. Any idea why?
Thanks!
CodePudding user response:
Use re.DOTALL
so that .
matches newline as well.
re.DOTALL
Make the
'.'
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.
Also you need *
for multiple characters matching, and ?
for non-greedy matching.
Finally, I have a hunch that try
here is not quite recommended. You can instead check whether the match object from re.search
is None
or not.
import re
example_txt = '''<TEXT>
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR> By Yoshiko Mori</AUTHOR>
<DATELINE> TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER
</BODY></TEXT>'''
m = re.search(r'<BODY>(.*?)</BODY>', example_txt, flags=re.DOTALL)
body = m.group(1) if m else 'NA'
print(body)
Output:
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER