Home > database >  How exclusively scrape content of <ol> in dependency to preceding <b>?
How exclusively scrape content of <ol> in dependency to preceding <b>?

Time:05-26

I want to attract the journal information on the website https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/.

However, when I tried to use the below method, it attracts all the information like Selected Research and Development Projects with ol layer, which I don't want to get.

for ol in soup.select(".ar-faculty-section-content ol li"):

It will return all ol things in the class name ".ar-faculty-section-content".

What I expected to get is to get only ol content under <b> Refereed Journal articles </b> tag. How can I deal with that?

1

CodePudding user response:

Use css selectors in example to find the element by its text and use adjacent sibling combinator to pick the ol with its li:

for e in soup.select('b:-soup-contains("Refereed Journal articles")   ol li'):
    print(e.text)
Example
import requests
from bs4 import BeautifulSoup    

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)

soup = BeautifulSoup(r.text)

for e in soup.select('b:-soup-contains("Refereed Journal articles")   ol li'):
    print(e.text)
Output
Wang,  J., & Bai, B., (2022). Whose goal emphases play a more important role in  ESL/EFL learners’ motivation, self-regulated learning and achievement?:  Teachers’ or parents’. Research Papers in  Education, 1-22. doi:10.1080/02671522.2022.2030395. 
Guo, W. J., Lau, K. L., Wei, J., & Bai, B.  (2021). Academic subject and gender differences in high school students’  self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9. 
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based  instruction on students’ use of self-regulated learning strategies in EFL  writing. System, 1-11. doi: 10.1016/j.system.2021.102578. 
...

EDIT

Just in fact that not all sites have a <b> with text Refereed Journal articles above approach ends in an empty list.

If the goal is to get all publications instead, you could select the <h5> with text Selected Publications and general sibling combinator in combination with nth-of-type():

for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
    print(e.text)
  • Related