I want to attract the journal information on the website https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/.
However, when I tried to use the below method, it attracts all the information like Selected Research and Development Projects with ol
layer, which I don't want to get.
for ol in soup.select(".ar-faculty-section-content ol li"):
It will return all ol
things in the class name ".ar-faculty-section-content"
.
What I expected to get is to get only ol
content under <b> Refereed Journal articles </b>
tag. How can I deal with that?
CodePudding user response:
Use css selectors
in example to find the element by its text and use adjacent sibling combinator
to pick the ol
with its li
:
for e in soup.select('b:-soup-contains("Refereed Journal articles") ol li'):
print(e.text)
Example
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.fed.cuhk.edu.hk/cri/faculty/prof-bai-barry/', headers=headers)
soup = BeautifulSoup(r.text)
for e in soup.select('b:-soup-contains("Refereed Journal articles") ol li'):
print(e.text)
Output
Wang, J., & Bai, B., (2022). Whose goal emphases play a more important role in ESL/EFL learners’ motivation, self-regulated learning and achievement?: Teachers’ or parents’. Research Papers in Education, 1-22. doi:10.1080/02671522.2022.2030395.
Guo, W. J., Lau, K. L., Wei, J., & Bai, B. (2021). Academic subject and gender differences in high school students’ self-regulated learning of language and mathematics. Current Psychology, 1-16. doi: 10.1007/s12144-021-02120-9.
Guo, W. J., Bai, B, & Song, H. (2021). Influences of process-based instruction on students’ use of self-regulated learning strategies in EFL writing. System, 1-11. doi: 10.1016/j.system.2021.102578.
...
EDIT
Just in fact that not all sites have a <b>
with text Refereed Journal articles
above approach ends in an empty list.
If the goal is to get all publications instead, you could select the <h5>
with text Selected Publications
and general sibling combinator
in combination with nth-of-type()
:
for e in soup.select('h5:-soup-contains("Selected Publications") ~ ol:nth-of-type(1) li'):
print(e.text)