Home > Net >  Read image alt text with pandas.read_html
Read image alt text with pandas.read_html

Time:01-03

Is there a way using pandas.read_html to get the img alt text from an image ? The page I am scrapping just replaced some texts with pictures, and the old text is now the alt text for those images. Here is an example:

<td>
<div>...
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>

This is how it looked, and it was perfect for pandas.read_html

<td>
WF
</td>

CodePudding user response:

As per the thread linked to in the previous answer, read_html is designed only for tables with text content and isn't able to parse <a>, <div>, <img>, etc. You'll need to preprocess the HTML you get from scraping before passing it to read_html.

If you're in the market for a quick and dirty solution, you could do this using regex:

import pandas as pd
import re

table_html = """
<table>
<tr>
<td>
<div>
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>
</tr>
</table>"""

# Delete all <div>, </div>, <a> and </a> tags
table_html = re.sub(r'</?div[^>]*>', '', table_html)
table_html = re.sub(r'</?a[^>]*>', '', table_html)
# Replace <img> tags with the content of their alt attribute
table_html = re.sub(r'<img[^>]*alt="([^"]*)"[^>]*>', r'\1', table_html)

print(pd.read_html(table_html))

Outputs:

[       0
0  WFrag]

However this is not a very robust solution as it may need to be adapted to any irregular output (if a tag was written in capital letters for example). A better way would be to parse the HTML string and perform the same operations with a library specifically designed for parsing HTML, such as beautifulsoup4.

CodePudding user response:

It seems like it is not possible with pandas. You can check out the answer in this thread Pandas read_html to return raw HTML contents [for certain rows/cells/etc.]

  • Related