How to scrape only texts from specific HTML elements?-CodePudding

I have a problem with selecting the appropriate items from the list.

For example - I want to omit "1." then the first "5" (as in the example) Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".

import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep

driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})

res = []
for i in content:
    line = i.text.split()[0]
    if re.search('Ajax', line):
        res.append(line)
print(res)

results

['1.Ajax550016:315?WWWWW']

I need

Ajax;5;5;0;16;3;W;W;W;W;W

CodePudding user response：

you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.

import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep

from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})

res = []
found = 0
for i in content.find('div'):
    line = i.text.split()[0]
    if re.search('Ajax', line):
        found = 8
    if found:
        found -= 1
        res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4]  res[5].split(':')   res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1]   [ i for i in res[-1]][1:]
print(";".join(res))

returns

Ajax;5;5;0;16;3;W;W;W;W;W

This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

CodePudding user response：

I would recommend to select your elements more specific:

for e in soup.select('.ui-table__row'):

Iterate the ResultSet and decompose() unwanted tag:

e.select_one('.wld--tbd').decompose()

Extract texts with stripped_strings and join() them to your expected string:

data.append(';'.join(e.stripped_strings))

Example

Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.

...
soup = BS2(page,'html.parser')
data = []

for e in soup.select('.ui-table__row'):
    e.select_one('.wld--tbd').decompose()
    e.select_one('.tableCellRank').decompose()
    e.select_one('.table__cell--points').decompose()
    e.select_one('.table__cell--score').string = e.select_one('.table__cell--score').text.split(':')[0]
    pattern = {'W':'WIN','R':'RRR','P':'PPP'}
    data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data

Output

['Ajax;5;5;0;0;16;WIN;WIN;WIN;WIN;WIN',
 'Feyenoord;5;4;1;0;14;WIN;WIN;WIN;RRR;WIN',
 'Alkmaar;5;4;1;0;10;WIN;RRR;WIN;WIN;WIN',
 'PSV;5;4;0;1;23;PPP;WIN;WIN;WIN;WIN',
 'Twente;5;4;0;1;10;WIN;WIN;PPP;WIN;WIN',
 'Heerenveen;5;2;3;0;6;RRR;WIN;WIN;RRR;RRR',
 'Sparta Rotterdam;5;2;1;2;7;WIN;WIN;PPP;PPP;RRR',
 'Waalwijk;5;1;3;1;10;WIN;RRR;PPP;RRR;RRR',
 'Nijmegen;5;1;3;1;6;RRR;RRR;RRR;WIN;PPP',
 'Excelsior;5;2;0;3;8;PPP;PPP;PPP;WIN;WIN',
 'Utrecht;5;1;2;2;8;WIN;PPP;PPP;RRR;RRR',
 'Groningen;5;1;2;2;5;PPP;RRR;WIN;PPP;RRR',
 'Cambuur;5;1;1;3;4;PPP;PPP;WIN;RRR;PPP',
 'Vitesse;5;1;1;3;6;WIN;RRR;PPP;PPP;PPP',
 'FC Emmen;5;1;1;3;5;PPP;PPP;WIN;RRR;PPP',
 'FC Volendam;5;1;1;3;5;PPP;PPP;WIN;PPP;RRR',
 'G.A. Eagles;5;0;0;5;5;PPP;PPP;PPP;PPP;PPP',
 'Sittard;5;0;0;5;7;PPP;PPP;PPP;PPP;PPP']