Read image alt text with pandas.read

Is there a way using pandas.read_html to get the img alt text from an image ? The page I am scrapping just replaced some texts with pictures, and the old text is now the alt text for those images. Here is an example:

<td>
<div>...
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>

This is how it looked, and it was perfect for pandas.read_html

<td>
WF
</td>

CodePudding user response：

As per the thread linked to in the previous answer, read_html is designed only for tables with text content and isn't able to parse <a>, <div>, <img>, etc. You'll need to preprocess the HTML you get from scraping before passing it to read_html.

If you're in the market for a quick and dirty solution, you could do this using regex:

import pandas as pd
import re

table_html = """
<table>
<tr>
<td>
<div>
<a href="/index.php/WF..." title="WF"><img alt="WFrag" src="/images/thumb/WF.png" </a>
</div>
</td>
</tr>
</table>"""

# Delete all <div>, </div>, <a> and </a> tags
table_html = re.sub(r'</?div[^>]*>', '', table_html)
table_html = re.sub(r'</?a[^>]*>', '', table_html)
# Replace <img> tags with the content of their alt attribute
table_html = re.sub(r'<img[^>]*alt="([^"]*)"[^>]*>', r'\1', table_html)

print(pd.read_html(table_html))

Outputs:

[       0
0  WFrag]

However this is not a very robust solution as it may need to be adapted to any irregular output (if a tag was written in capital letters for example). A better way would be to parse the HTML string and perform the same operations with a library specifically designed for parsing HTML, such as beautifulsoup4.

CodePudding user response：

It seems like it is not possible with pandas. You can check out the answer in this thread Pandas read_html to return raw HTML contents [for certain rows/cells/etc.]