Home > other >  python/beautifulsoup/get_text().split() - how can I get connected text including blank value
python/beautifulsoup/get_text().split() - how can I get connected text including blank value

Time:12-15

In short, I need "KBSTAR Fn K-뉴딜디지털플러스" but I get three values 'KBSTAR', 'Fn', 'K-뉴딜디지털플러스'

I need:

['1501', 'KBSTAR Fn K-뉴딜디지털플러스', '11,830', '90', '-0.76%', '0', '95', '800', '0.00', '180', 'N/A', 'N/A']

but the result is like:

['1501', 'KBSTAR', 'Fn', 'K-뉴딜디지털플러스', '11,830', '90', '-0.76%', '0', '95', '800', '0.00', '180', 'N/A', 'N/A']

Here is my code: I am getting this because of blank in a data. But I don't know how to get connected words data without using split() method. Is there any way to get list data having blank in itself? Because most of the data have no blank itself

stock_list = soup.find("table", attrs={"class": "type_2"}).find("tbody").find_all("tr")
for stock in stock_list: 
    stock.get_text().split()

Below is my raw HTML

<tr onm ouseover="mouseOver(this)" onm ouseout="mouseOut(this)" style="background-color: rgb(255, 255, 255);">
                    <td >1501</td>
                    <td><a href="/item/main.naver?code=368200" >KBSTAR Fn K-뉴딜디지털플러스</a></td>
                    <td >11,830</td>
                    <td >
                <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span >
                90
                </span>
            </td>
                    <td >
                <span >
                -0.76%
                </span>
            </td>
                    <td >0</td>
                                    <td >180</td>
                                    <td >800</td>
                                    <td >95</td>
                    <td >N/A</td>
                    <td >N/A</td>
                    <td >N/A</td>
                    <td ><a href="/item/board.naver?code=368200"><img src="https://ssl.pstatic.net/imgstock/images5/ico_debatebl2.gif" width="15" height="13" alt="토론실"></a></td>
                </tr>

CodePudding user response:

Why are you using split? Just select for the td child elements. Use strip to tidy. The html looks a little off as well.

from bs4 import BeautifulSoup as bs

tr_html = '''<tr onm ouseover="mouseOver(this)" onm ouseout="mouseOut(this)" style="background-color: rgb(255, 255, 255);">
                    <td >1501</td>
                    <td><a href="/item/main.naver?code=368200" >KBSTAR Fn K-뉴딜디지털플러스</a></td>
                    <td >11,830</td>
                    <td >
                <img src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" width="7" height="6" style="margin-right:4px;" alt="하락"><span >
                90
                </span>
            </td>
                    <td >
                <span >
                -0.76%
                </span>
            </td>
                    <td >0</td>
                                    <td >180</td>
                                    <td >800</td>
                                    <td >95</td>
                    <td >N/A</td>
                    <td >N/A</td>
                    <td >N/A</td>
                    <td ><a href="/item/board.naver?code=368200"><img src="https://ssl.pstatic.net/imgstock/images5/ico_debatebl2.gif" width="15" height="13" alt="토론실"></a></td>
                </tr>'''

soup = bs(tr_html, 'lxml')
[i.text.strip() for i in soup.select('td') if i.text.strip()]

CodePudding user response:

str.split() without sep argument splits the string by (consecutive) whitespace(s). Use str.strip() instead to remove leading and trailing whitespaces.

  • Related