I am trying to webscrape data from https://www.mygov.in/covid-19, but when I extract the digits, there raises a new problem. . The number indicate current value and value of how much it changed. eg: 3,81,74,366⬆54,229.
When I scrape I get the text as 3,81,74,36654,229. So how can I get the current value only?
eg:
3,81,74,36654,229 to 3,81,74,366
10,79,894198 to 10,79,894
22,40,7200 to 22,40,720
How to do this? Please help
CodePudding user response:
Here's an extract of an HTML fragment from that page:
<p >8,43,56,092
<span >39,477</span>
</p>
If you get the text for the p element, the return value will be merged with the span content.
Consider doing this:
for p in soup.select('p.mid-wrap'):
span = p.find('span')
if span:
spantext = span.getText()
print(spantext)
span.extract()
print(p.getText())
Output:
39,477
8,43,56,092
CodePudding user response:
Assuming all numbers are bigger than one thousand and current value is the first thing in the string something like this should work
^.*?,\d{3}