Home > Software engineering >  How to find a regular expression to display headlines without extra characters?
How to find a regular expression to display headlines without extra characters?

Time:03-07

I am attempting to figure out a regular expression that will display the headlines from a news feed of a stock.

This is the code I have so far, with the special characters of the regular expression being "<title.*?</":

def yahoo_hl(ticker):
    import re, requests
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
    xml = requests.get(f'https://feeds.finance.yahoo.com/rss/2.0/headline?s={ticker}', headers=headers).text
    news_headlines = re.findall(r'<title.*?</', xml, re.DOTALL) # put your regular expression between the single quotes
    return news_headlines

When I run it, it displays the following output with the headlines showing in addition to "< title >" and the "< /" characters at the beginning and end of each headline:

['<title>Yahoo! Finance: TSLA News</',
 '<title>Tesla Is About to Start Production at Its Berlin Gigafactory</',
 '<title>Tesla CEO Elon Musk Wants the U.S. and the World to Pump More Oil</',
 '<title>Tesla Gets Stronger With Oil Rising, Other EV Stocks Not So Much</',
 '<title>What Is The Boring Company?</']

The goal is to remove the "< title >" and "<" to output the headlines like this:

['Yahoo! Finance: TSLA News',
 'Tesla Is About to Start Production at Its Berlin Gigafactory',
 'Tesla CEO Elon Musk Wants the U.S. and the World to Pump More Oil',
 'Tesla Gets Stronger With Oil Rising, Other EV Stocks Not So Much',
 'What Is The Boring Company?']

Any help would be appreciated. Thank you in advance.

CodePudding user response:

You can make a "capturing group" in the regex:

import re, requests

def yahoo_hl(ticker):
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
    xml = requests.get(f'https://feeds.finance.yahoo.com/rss/2.0/headline?s={ticker}', headers=headers).text
    news_headlines = re.findall(r'<title>(.*?)</title', xml, re.DOTALL)
    return news_headlines

print(*yahoo_hl('TSLA'), sep='\n') # yahoo_hl('TSLA') is the list you want

Output:

Yahoo! Finance: TSLA News
Tesla Is About to Start Production at Its Berlin Gigafactory
Tesla CEO Elon Musk Wants the U.S. and the World to Pump More Oil
Tesla Gets Stronger With Oil Rising, Other EV Stocks Not So Much
What Is The Boring Company?
...

You can find the relevant information in the doc:

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group.

  • Related