creating a list of company news stories and matching dates-CodePudding

I'm attempting to create a list which groups a company stock symbol with a news headline and its corresponding date.

The head of the data essentially looks like the following:

{'ford-motor-co': "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup 
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian 
shares at a price of $26.88, the company says. That followed an 8-million-share 
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in 
EV maker Rivian for $188.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - 
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive 
Inc for about $188.2 million, or $26.88 apiece, the U.S. automaker said in a filing 
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh

I've manage to sucefully extract the date and stock symbol but I cannot figure out how to group the date with its associated news heading.

parsed_data = []

for stock , stock_news_table in stock_news_tables.items():

    date_data = re.findall(r'[A-Z][a-z]{2} \d{1,2}, \d{4}' , str(stock_news_table))

    headline = stock_news_table

    #print(date_data)

    parsed_data.append([stock , date_data , headline])

The output so far looks like the following. As you can see the headlines are split where there are multiple new lines: \n\n\n\n .

 [['ford-motor-co',
  ['May 14, 2022',
   'May 14, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 13, 2022',
   'May 12, 2022',
   'May 12, 2022',
   'May 12, 2022'],
  "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup Rivian\nBy The 
   Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 million Rivian shares at a 
   price of $26.88, the company says. That followed an 8-million-share sale earlier 
   in the week at about the same price.\n\n\n\n\n \nFord sells shares in EV maker 
   Rivian for $188.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - Ford 
   Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Aut

CodePudding user response：

I managed to solve your question using dateparser, a natural language date parser, and 2 different regexes. Hopefully it'll be enough.

First, install dateparaser:

pip install dateparser

Then run the code:

import collections, re, dateparser
Stock = collections.namedtuple("Stock", ["name", "symbol", "headlines"])

# Remember, '.' is not multiline, equiv to '[^\n] '
headline_re =re.compile(r"\n\n ?\n(?P<headline>. )\nBy . ?\xa0-\xa0(?P<date>[\w ,] )")
symbol_re = re.compile(r"\(([A-Z]{1,4}:[A-Z]{1,4})\)")
input_data = {'ford-motor-co':(
    "\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup "
    "Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian "
    "shares at a price of $26.88, the company says. That followed an 8-million-share "
    "sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in "
    "EV maker Rivian for $188.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - "
    "Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive "
    "Inc for about $188.2 million, or $26.88 apiece, the U.S. automaker said in a filing "
    "on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh")}

stocks = []
for name, data in input_data.items():
    headlines = []
    for match in headline_re.finditer(data):
        date_str = match.group("date")
        date = dateparser.parse(date_str)
        headlines.append((match.group("headline"), date))
    symbol = symbol_re.search(data).group(1)
    stocks.append(Stock(name, symbol, headlines))

Output (stocks):

[Stock(name='ford-motor-co', symbol='NYSE:F', headlines=[('Ford Unloads More Shares in Electric-Vehicle Startup Rivian', datetime.datetime(2022, 5, 14, 20, 58, 28, 30552)), ('Ford sells shares in EV maker Rivian for $188.2 million', datetime.datetime(2022, 5, 14, 0, 0))])]

Do make sure the symbol regex is correct, as I'm not sure regarding the constraints in the stock market.

CodePudding user response：

You can re.split.

The document says, if capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

So if you use r'([A-Z][a-z]{2} \d{1,2}, \d{4})'

stock = 'ford-motor-co'
stock_news_table =  """\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup 
Rivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian 
shares at a price of $26.88, the company says. That followed an 8-million-share 
sale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in 
EV maker Rivian for $188.2 million\nBy Reuters\xa0-\xa0May 14, 2022  (Reuters) - 
Ford Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive 
Inc for about $188.2 million, or $26.88 apiece, the U.S. automaker said in a filing 
on Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh"""
date_data = re.split(r'([A-Z][a-z]{2} \d{1,2}, \d{4})' , str(stock_news_table))
headline = stock_news_table
date_data

will return

['\n\n\n\n\n\nFord Unloads More Shares in Electric-Vehicle Startup \nRivian\nBy The Wall Street Journal\xa0-\xa020 hours ago\nFord sold 7 millionRivian \nshares at a price of $26.88, the company says. That followed an 8-million-share \nsale earlier in the week at about the same price.\n\n\n\n\n \nFord sells shares in \nEV maker Rivian for $188.2 million\nBy Reuters\xa0-\xa0',
 'May 14, 2022',
 '  (Reuters) - \nFord Motor (NYSE:F) Co sold 7 million shares of electric carmaker Rivian Automotive \nInc for about $188.2 million, or $26.88 apiece, the U.S. automaker said in a filing \non Friday. Ford now... \n\n\n\n\n\n\n\nFord sells sh']