Home > Blockchain >  removing `\n` using bs4 get_text()
removing `\n` using bs4 get_text()

Time:05-06

from bs4 import BeautifulSoup


# current output as below
"""
'DOMINGUEZ, JONATHAN D. VS. RAMOS,\n
                                           SILVIA M'
"""

# desired one is

#  DOMINGUEZ, JONATHAN D. VS. RAMOS, SILVIA M

x = """<td width="350px" valign="top"
   style="padding:.5rem;">
   DOMINGUEZ, JONATHAN D. VS. RAMOS,
   SILVIA M
</td>"""

soup = BeautifulSoup(x, 'lxml')
print(soup.select_one('td').get_text(strip=True, separator='\n'))

I checked the docs and I believe that get_text() can do that but am not sure how!

CodePudding user response:

change separator='\n' to separator=' '

CodePudding user response:

You can apply stripped_strings method

from bs4 import BeautifulSoup

x = """<td width="350px" valign="top"
   style="padding:.5rem;">
   DOMINGUEZ, JONATHAN D. VS. RAMOS,
   SILVIA M
</td>"""

soup = BeautifulSoup(x, 'lxml')
txt=''.join([x.replace('\n','') for x in list(soup.select_one('td').stripped_strings)])
print(txt)

Output:

DOMINGUEZ, JONATHAN D. VS. RAMOS,   SILVIA M
  • Related