Extracting specific values from strings?-CodePudding

I'm trying to extract information from the following block of HTML code:

<div ><span title="This is a marketplace ad topic." ></span></div>
<a data-nologvisit  href="/pinball/forum/forum/games-for-sale" rel="7"  title="Pinball machines for sale">MFS</a>
<a  href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span >&#36; 25,000 </span><span >Whiteland, IN</span></a>
<span >By ARW55 (1 year ago)<span > - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" >

The fields I want to extract are the name (which in this example is "Pirates of the Caribbean (LE)"), the price ($25,000), location (Whiteland, IN), and last post (Last post 3 days ago). So far, I've used this line of code

soup.findAll(True, {'class': ['t', 'by']})

to get the following output:

FS: Pirates of the Caribbean (LE)$ 25,000 Whiteland, IN
By ARW55 (1 year ago) - Last post 3 days ago

However, I am lost on how to extract the information I want from these strings. There are hundreds of other similar entries, e.g.

FS: Teenage Mutant Ninja Turtles (Pro)$ 8,000 (OBO) Downers Grove, IL
By Thorn-in-pinball (3 days ago) - Last post 3 days ago

and I am not sure where to get started. I would appreciate any advice or guidance.

Thank you!

CodePudding user response：

The following code will get you the data you're looking for:

from bs4 import BeautifulSoup

html = '''
<div ><span title="This is a marketplace ad topic." ></span></div>
<a data-nologvisit  href="/pinball/forum/forum/games-for-sale" rel="7"  title="Pinball machines for sale">MFS</a>
<a  href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span >&#36; 25,000 </span><span >Whiteland, IN</span></a>
<span >By ARW55 (1 year ago)<span > - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" >
'''

soup = BeautifulSoup(html, 'html.parser')
name = soup.select_one('a.t').contents[0].strip()
price = soup.select_one('a.t').contents[1].text.strip()    
location = soup.select_one('a.t').contents[2].text.strip()    
last_post = soup.select_one('span.last').text.strip()
author = soup.select_one('span.by').contents[0].strip()
print(name)
print(price)
print(location)
print(last_post)
print(author)

Result:

FS: Pirates of the Caribbean (LE)
$ 25,000
Whiteland, IN
- Last post 3 days ago
By ARW55 (1 year ago)

Documentation for bs4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

CodePudding user response：

With Beautiful Soup there is an easy way to pick out attributes from elements, as these elements are nested we can look individually at the contents of each find and grab the respective text attribute to get the information you want.

# Parent elements
movie_post_element = soup.find("a", class_="t")

# Child element
movie_element = movie_post_element.contents[0]

# Child text
movie = movie_element.text

Then a full example of this would be..

import bs4

html = """<div ><span title="This is a marketplace ad topic." ></span></div>
<a data-nologvisit  href="/pinball/forum/forum/games-for-sale" rel="7"  title="Pinball machines for sale">MFS</a>
<a  href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">FS: Pirates of the Caribbean (LE)<span >&#36; 25,000 </span><span >Whiteland, IN</span></a>
<span >By ARW55 (1 year ago)<span > - Last post 3 days ago</span></span>
</div><div rel="319235" data-vu="" >"""

soup = bs4.BeautifulSoup(html)

# Parent elements 
movie_element = soup.find("a", class_="t")
author_element = soup.find("span", class_="by")

movie = movie_element.contents[0].text
price = movie_element.contents[1].text
location = movie_element.contents[2].text

author = author_element.contents[0].text
post_date = author_element.contents[1].text
by_text = author_element.text

CodePudding user response：

Extract the types of information separately. So if <a > elements are the names, for example, find those and save the results in a variable, then move on to the 'by' elements, etc.

For more structured data, perhaps it's better to search per container div however:

document.querySelectorAll('.containerClass').forEach((div) => {
  const name = div.querySelector('a.t');
  // etc
});

I don't know what this .findAll(True, {'class': ['t', 'by']}) is or how it works, but you get the idea.

CodePudding user response：

FS: Pirates of the Caribbean (LE)$ 25,000 Whiteland, IN
By ARW55 (1 year ago) - Last post 3 days ago

If your expected output is the above way, then you can try the next example:

import requests
from bs4 import BeautifulSoup

html='''
<html>
 <body>
  <div >
   <span  title="This is a marketplace ad topic.">
   </span>
  </div>
  <a  data-nologvisit="" href="/pinball/forum/forum/games-for-sale" rel="7" title="Pinball machines for sale">
   MFS
  </a>
  <a  href="/pinball/forum/topic/for-sale-pirates-of-the-caribbean-le-58">
   FS: Pirates of the Caribbean (LE)
   <span >
    $ 25,000
   </span>
   <span >
    Whiteland, IN
   </span>
  </a>
  <span >
   By ARW55 (1 year ago)
   <span >
    - Last post 3 days ago
   </span>
  </span>
  <div  data-vu="" rel="319235">   
  </div>
 </body>
</html>
'''

soup= BeautifulSoup(html,'lxml')

info= '\n'.join([x.text.strip().replace(' ','').replace('\n',' ') for x in soup.select('a.t')]   [x.get_text(strip=True) for x in soup.select('span.by')])
print(info)

Output:

FS:PiratesoftheCaribbean(LE)  $25,000   Whiteland,IN
By ARW55 (1 year ago)- Last post 3 days ago