How to scrape "markup"-text from html code-CodePudding

I would like to scrape text from a website using Beautifulsoup.

From the code which can be seen in the screenshot above, I would like to retrieve the text stored under markup="content".

I know how to retrieve all markup-parts using soup.find("div", class_="break-word news_main"). But I don't know how to only access the markup="content"-part.

Help would be very much appreciated :)

(Also, is markup a special term in html like "class" or "id"? What does "markup" mean?)

CodePudding user response：

soup.find_all(markup="content"), as you can see you can filter by any attribute. See docs for more details. markup is some kind of custom attribute; it's not a valid attribute for div elements but you can technically have any attribute on an element and it will still work, both in HTML and in BeautifulSoup.

CodePudding user response：

You can use soup.find_all(markup='content') or a different syntax is soup.find_all('div', {'markup': 'content'} which will give you all divs that contain the attribute mark up with a value of content.

Markup can mean quite a few things. It's basically a system for stylizing text. You can see more here. In this context it's probably something used to style html on a webpage.