I have a series of URL's I am looping through and want to retrieve everything in the "Known As" row at the top of the page (example). However, not every stadium has this (example).
1 Arrowhead Drive, Kansas City, MO 64129
Years Active: 1972-2021 (406 games)
===> Known As: Arrowhead Stadium (1972-2020), GEHA Field at Arrowhead Stadium (2021-2021) <===
Surfaces: astroturf (1972-1993), grass (1994-2021)
Here is code to return everything in the div
it's in, but I'm unsure how to proceed and return the text in the specific p
tag I want.
url_base = "https://www.pro-football-reference.com/stadiums/KAN00.htm"
response = requests.get(url_base)
soup = bs(response.text, "html.parser")
div = soup.find("div", {"id":"meta"})
<div id="meta">
<div>
<h1 itemprop="name">GEHA Field at Arrowhead Stadium History</h1>
<p>1 Arrowhead Drive, Kansas City, MO 64129</p><p><b>Years Active:</b> 1972-2021 (<a href="https://stathead.com/football/tgl_finder.cgi?request=1&match=career&year_min=1950&year_max=2022&game_type=E&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&temperature_gtlt=lt&stadium_id=KAN00&c5val=1.0&order_by=pass_td">406 games</a>)<p><b>Known As:</b> Arrowhead Stadium (1972-2020), GEHA Field at Arrowhead Stadium (2021-2021)<p><b>Surfaces:</b> astroturf (1972-1993), grass (1994-2021)<p><b>Teams:</b><ul><li><a href="/teams/kan/">Kansas City Chiefs</a> (1972-2022)</li><li> Regular Season: <a href="https://stathead.com/football/tgl_finder.cgi?request=1&match=game&team_id=kan&year_min=1950&year_max=2022&game_type=R&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&temperature_gtlt=lt&stadium_id=KAN00&c5val=1.0&order_by=pass_td">233-155-1</a></li><li> Playoffs: <a href="https://stathead.com/football/tgl_finder.cgi?request=1&match=game&team_id=kan&year_min=1950&year_max=2022&game_type=P&game_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&temperature_gtlt=lt&stadium_id=KAN00&c5val=1.0&order_by=pass_td">9-8</a></li></ul></p>
<button data- data-id="info" data-type="hide_after" id="meta_more_button">More venue info</button>
<script>
// see sr.menus.js:sr_menus_checkInfoCookie to explain
function sr_menus_checkInfoCookie_inline(browserType) {
var el_info = document.getElementById('info');
var el_button = document.getElementById('meta_more_button');
var bling_len = 0;
if (!el_button || !el_info || !el_info.classList) { console.log('no meta_button'); return; }
var el = el_button;
var siblingsHidden = 0;
while (el = el.previousSibling) { if ((el.nodeType === 1) && (el.offsetWidth <= 0 || el.offsetHeight <= 0)) { siblingsHidden ; } }
var button_cookie = false;
if (browserType === 'desktop') { button_cookie = vjs_readCookie('meta_more_button'); }
// We allow up to four of bling lines or additional player bio data entries in mobile.
if (el_info && el_button && (button_cookie || (siblingsHidden bling_len <= 4))) {el_button.parentNode.removeChild(el_button); el_info.classList.add('open'); }
else { el_button.classList.add('show'); }
}
if (Modernizr.desktop || Modernizr.laptop) { sr_menus_checkInfoCookie_inline('desktop'); } else { sr_menus_checkInfoCookie_inline('mobile'); }
var sr_menus_checkInfoCookie_run_inline = true;
</script>
</p></p></p></div>
</div>
CodePudding user response:
You can select <b>
tag with text "Known As" and then get next text sibling:
txt = (
soup.select_one('b:-soup-contains("Known As")')
.find_next_sibling(text=True)
.strip()
)
print(txt)
Prints:
Arrowhead Stadium (1972-2020), GEHA Field at Arrowhead Stadium (2021-2021)