how to get text between two SETS of tags in python-CodePudding

I am trying to get text between tag and also text between sets of tags, I have tried but I haven't got what I want. Can anyone help? I really appreciate it.

text = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />  
'''

the expected output:

Doc Type: AABB
Doc No:   BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045

the code I have tried, this only gave me the text between tags, but not text outside tags:

soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))

I also tried following, but it gave me all text on the page, I only want tags and text outside of the tags, :

soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)

the current output is:

Doc Type: 
Doc No:   
System No: 
VCode: 
G Code:

CodePudding user response：

YOu could use the .next_sibling from each of those elements.

Code:

html = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')


for each in bs:
    eachFollowingText = each.next_sibling.strip()
    print(f'{each.text} {eachFollowingText}')

Output:

Doc Type:  AABB
Doc No:  BBBBF
System No:  aaa bbb
VCode:  040000033
G Code:  000045

CodePudding user response：

Try this:

from bs4 import BeautifulSoup

text = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />  
'''

result = [
    i.getText(strip=True) for i in 
    BeautifulSoup(text, "html.parser").find_all(text=True)
    if i.getText(strip=True)
]
print("\n".join([" ".join(result[i:i   2]) for i in range(0, len(result), 2)]))

Output:

Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045

CodePudding user response：

You can get the full text by finding the parent tag, not given in the question, and then access to its string content with .text and some formatting operations such as removing empty lines.

BeautifulSoup always add an html tag if missing so in my example soup.html. A replacement with soup.find_all(my parent tag) should fix it assuming that you know it.

html = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

parent_tag = soup.html
s = '\n'.join(line for line in parent_tag.text.split('\n') if line != '')
print(s)