Home > front end >  How to find a substring using regex
How to find a substring using regex

Time:11-15

The string that I ended up after scraping 1000 Reuters articles looks like this:

<TEXT>&#2;
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR>    By Yoshiko Mori</AUTHOR>
<DATELINE>    TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;</BODY></TEXT>

I want to extract the title, author, dateline and body out of this string. To do that, I have the below regex but unfortunately, it is not working for the body section.

try:
  body=re.search('<BODY>(.)</BODY>',example_txt).group(1)
except:
  body='NA'

This try-except always returns NA for body but works for title, author and dateline. Any idea why?

Thanks!

CodePudding user response:

Use re.DOTALL so that . matches newline as well.

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

https://docs.python.org/3/library/re.html

Also you need * for multiple characters matching, and ? for non-greedy matching.

Finally, I have a hunch that try here is not quite recommended. You can instead check whether the match object from re.search is None or not.

import re

example_txt = '''<TEXT>&#2;
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR>    By Yoshiko Mori</AUTHOR>
<DATELINE>    TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;</BODY></TEXT>'''

m = re.search(r'<BODY>(.*?)</BODY>', example_txt, flags=re.DOTALL)
body = m.group(1) if m else 'NA'

print(body)

Output:

Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;
  • Related