Need help extracting date from text in Python-CodePudding

I have data that comes in every day via python code as such:

id="ContentPlaceHolder1_cph_main_cph_main_SummaryGrid">\r\n\t\t<tr class="tr-header">\r\n\t\t\t<th scope="col">&nbsp;</th><th class="right-align" scope="col">Share<br>Price</th><th class="right-align" scope="col">NAV</th><th class="right-align" scope="col">Premium/<br>Discount</th>\r\n\t\t</tr><tr>\r\n\t\t\t<td>Current</td><td class="right-align">$19.14</td><td class="right-align">$21.82</td><td class="right-align">-12.28%</td>\r\n\t\t</tr>

I need to extract the 2 prices and percentage values, in this example the "$19.14" "$21.82" and "-12.28%", but I am having trouble figuring out how to parse through and pull, is there a way to do this by looping through and searching for the text before/after?

The text before and after is always the same but the date changes. If not possible by this method, is there another way? Thank you very much!

CodePudding user response：

Here is the desired output:

from bs4 import BeautifulSoup

markup = """
<div class="row-fluid">
 <div class="span6">
  <p class="as-of-date">
   <span id="ContentPlaceHolder1_cph_main_cph_main_AsOfLabel">
    As of 9/24/2021
   </span>
  </p>
  <div class="table-wrapper">
   <div>
    &lt;table class="cefconnect-table-1 table table-striped" cellspacing="0" cellpadding="5" 
Border="0
   </div>
  </div>
 </div>
</div>

"""

soup = BeautifulSoup(markup, 'html.parser')
#print(soup.prettify())

tags= soup.select_one('#ContentPlaceHolder1_cph_main_cph_main_AsOfLabel').get_text()
print(tags.replace('As of ', ' '))

Output:

9/24/2021

CodePudding user response：

If the date is the only content of the string changing you can split up the string to get the date:

result = mystring.split(
'</span>\r\n\t\t\t\t\t\t\t</p>\r\n\r\n\t\t\t\t\t\t\t<div class="table-wrapper">')


date = result[0][-10:]

Here you will get the date as a pure string, but you can also split it up to get a integer for each component of the date like this:

month, day, year = [int(num) for num in date.split('/')]

CodePudding user response：

Here is the solution:

from bs4 import BeautifulSoup

id = """
id="ContentPlaceHolder1_cph_main_cph_main_SummaryGrid"&gt;
<tr class="tr-header">
 <th scope="col">
 </th>
 <th class="right-align" scope="col">
  Share
  <br/>
  Price
 </th>
 <th class="right-align" scope="col">
  NAV
 </th>
 <th class="right-align" scope="col">
  Premium/
  <br/>
  Discount
 </th>
</tr>
<tr>
 <td>
  Current
 </td>
 <td class="right-align">
  $19.14
 </td>
 <td class="right-align">
  $21.82
 </td>
 <td class="right-align">
  -12.28%
 </td>
</tr>

"""

soup = BeautifulSoup(id, 'html.parser')
#print(soup.prettify())

tags= soup.select('td.right-align')
for tag in tags:
    print(tag.get_text())

Output:

 $19.14


  $21.82


  -12.28%

CodePudding user response：

I would suggest that you use regex rather than making it hard for yourself. If you are unsure you can look up how regex works and all the syntax for it. It is a very useful module to know.