I am trying to get text between tag and also text between sets of tags, I have tried but I haven't got what I want. Can anyone help? I really appreciate it.
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
the expected output:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
the code I have tried, this only gave me the text between tags, but not text outside tags:
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))
I also tried following, but it gave me all text on the page, I only want tags and text outside of the tags, :
soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)
the current output is:
Doc Type:
Doc No:
System No:
VCode:
G Code:
CodePudding user response:
YOu could use the .next_sibling
from each of those elements.
Code:
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')
for each in bs:
eachFollowingText = each.next_sibling.strip()
print(f'{each.text} {eachFollowingText}')
Output:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
CodePudding user response:
Try this:
from bs4 import BeautifulSoup
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
result = [
i.getText(strip=True) for i in
BeautifulSoup(text, "html.parser").find_all(text=True)
if i.getText(strip=True)
]
print("\n".join([" ".join(result[i:i 2]) for i in range(0, len(result), 2)]))
Output:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
CodePudding user response:
You can get the full text by finding the parent tag, not given in the question, and then access to its string content with .text
and some formatting operations such as removing empty lines.
BeautifulSoup
always add an html
tag if missing so in my example soup.html
. A replacement with soup.find_all(my parent tag)
should fix it assuming that you know it.
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
parent_tag = soup.html
s = '\n'.join(line for line in parent_tag.text.split('\n') if line != '')
print(s)