I have a large HTML file that contains about 400 customer reviews. Is there a tool that I can use to scrap the file and grab specific data out of it and put them in a CSV file? The goal is to transfer these reviews from an old website into a new website for the same company.
The HTML that contains each review looks like this (the file has 400 of these blocks):
<section >
<div >
<div >
<div >
<div ><span>Joe K.</span></div>
<div >
<div style="width: 100%">
<i ></i><i ></i
><i ></i><i ></i
><i ></i>
</div>
<div >
<i ></i><i ></i
><i ></i><i ></i
><i ></i>
</div>
</div>
<div >
<meta content="1" />
<meta content="5.0" />
<meta content="5" />
</div>
<div >
<meta content="2022-01-05" />Submitted 01/05/22
</div>
</div>
<div >
<div >
<span
>Review goes here Review Goes Here Review Goes Here</span
>
</div>
</div>
</div>
</div>
</section>
The data I need to get is the reviewer's name, rating, date, and review.
I prefer a tool either in js, node js, php, or python.
CodePudding user response:
Using python (although it can be also done with both js and php) and utilizing xpath, you can try the following:
import lxml.html as lh
reviews = """your html above"""
doc = lh.fromstring(reviews)
sections = doc.xpath('//section')
for section in sections:
reviewer = section.xpath('.//div[@]/span/text()')[0]
date = section.xpath('.//div[@]/meta/@content')[0]
review = section.xpath('.//div[@]/span/text()')[0]
rating = section.xpath('.//div[@]//meta/@content')[1]
print(f"{date}, {reviewer}, {review}, {rating}")
The output should be
2022-01-05, Joe K., Review goes here Review Goes Here Review Goes Here, 5.0