Home > Software engineering >  How to clean up pulled data from BeautifulSoup, Pandas, Python
How to clean up pulled data from BeautifulSoup, Pandas, Python

Time:03-16

Hello everyone I have the information I want pulled using BeautiuflSoup but I can't seem to get it printed out correctly to send to pandas and excel.

html_f ='''
<li >
<div>
<div >
<p >
07/01/2022 Date
<span > </span>
</p>
</div>
<div  style="display: block; overflow: hidden;">
<p >
<span >Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''

My code used to pull the data I want:

soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
    print (child.text)

Here is the info it pulls But it prints it out weird with tons of spacing

        07/01/2022   Date





  Comment
       [1] Comments

Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:

07/01/2022 Date
Comment
[1] Comments

CodePudding user response:

So far so good, it's my trying

doc='''

<li >
 <div>
  <div >
   <p >
    07/01/2022 Date
    <span >
    </span>
   </p>
  </div>
  <div  style="display: block; overflow: hidden;"> 
   <p >
    <span >
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')

text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)

Output:

['07/01/2022, Comments']   

Try this ways,must work

text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]  ','  text[2]
final_text= text[1] text[2].split()#if you want to make list



  

CodePudding user response:

To get your information printed as expected in your question, you could use stripped_strings and iterate over its elements:

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

Note: In new code use find_all() instead of old syntax findAll().

Example

html='''
<li >
 <div>
  <div >
   <p >
    07/01/2022 Date
    <span >
    </span>
   </p>
  </div>
  <div  style="display: block; overflow: hidden;"> 
   <p >
    <span >
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

Output

07/01/2022 Date
Comment
[1] Comments

Not sure cause you are talking about pandas, you also could pick each information, clean it up and append to a list of dicts:

data = []
for e in soup.find_all('li',class_='list-group-item'):
    data.append({
        'date': e.p.text.strip().replace(' Date',''),
        'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
    })
pd.DataFrame(data)

Output

date comment
07/01/2022 [1] Comments
  • Related