beautiful soup parse table columns and stripping newlines-CodePudding

I am using the following code to cycle through each row and column of an html table

data = []
table = page.find('table', attrs={'class':'table table-no-border table-hover table-striped keyword_result_table'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

This table column is giving me a few

    <td class="keyword">
     <span class="is_in_saved_list" id="is_in_saved_list_81864060">
     </span>
     <a href="javascript:void(0);">
      <b>
       what
      </b>
      <b>
       is
      </b>
      <b>
       in
      </b>
      <b>
       house
      </b>
      <b>
       paint
      </b>
     </a>
    </td>

The output is coming out as

['what\n \n\n is\n \n\n in\n \n\n house\n \n\n paint', '5756', '979', '2', 'Great', '89', '.com\n \n\n .net\n \n\n .org']

On the console and on the prompt screen here, there seems to be tab spaces but they are not displaying in the post. I have tried .rstrip() after strip() but no change. Is there a way to grab only the text content that the link is attached to?

CodePudding user response：

Did you try remove '\n' from the strings?

s = 'what\n \n\n is\n \n\n in\n \n\n house\n \n\n paint'
s.replace('\n', '')
'what  is  in  house  paint'

CodePudding user response：

You can grab link text using .find('a').text directly. Then a replace and a split will do the job.

from bs4 import BeautifulSoup

html_doc = """
    <td class="keyword">
     <span class="is_in_saved_list" id="is_in_saved_list_81864060">
     </span>
     <a href="javascript:void(0);">
      <b>
       what
      </b>
      <b>
       is
      </b>
      <b>
       in
      </b>
      <b>
       house
      </b>
      <b>
       paint
      </b>
     </a>
    </td>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
link_text = soup.find('a').text

cleaned_link_text = " ".join(link_text.replace('\n', '').split())
print(cleaned_link_text)
# what is in house paint