Extracting data from html and generating a CSV-CodePudding

I have a large HTML file that contains about 400 customer reviews. Is there a tool that I can use to scrap the file and grab specific data out of it and put them in a CSV file? The goal is to transfer these reviews from an old website into a new website for the same company.

The HTML that contains each review looks like this (the file has 400 of these blocks):

  <section >
    <div >
      <div >
        <div >
          <div ><span>Joe K.</span></div>
          <div >
            <div  style="width: 100%">
              <i ></i><i ></i
              ><i ></i><i ></i
              ><i ></i>
            </div>
            <div >
              <i ></i><i ></i
              ><i ></i><i ></i
              ><i ></i>
            </div>
          </div>
          <div >
            <meta content="1" />
            <meta content="5.0" />
            <meta content="5" />
          </div>
          <div >
            <meta content="2022-01-05" />Submitted 01/05/22
          </div>
        </div>
        <div >
          <div >
            <span
              >Review goes here Review Goes Here Review Goes Here</span
            >
          </div>
        </div>
      </div>
    </div>
  </section>

The data I need to get is the reviewer's name, rating, date, and review.

I prefer a tool either in js, node js, php, or python.

CodePudding user response：

Using python (although it can be also done with both js and php) and utilizing xpath, you can try the following:

import lxml.html as lh
reviews = """your html above"""

doc = lh.fromstring(reviews)
sections = doc.xpath('//section')
for section in sections:
    reviewer = section.xpath('.//div[@]/span/text()')[0]
    date = section.xpath('.//div[@]/meta/@content')[0]
    review = section.xpath('.//div[@]/span/text()')[0]
    rating = section.xpath('.//div[@]//meta/@content')[1]

    print(f"{date}, {reviewer}, {review}, {rating}")

The output should be

2022-01-05, Joe K., Review goes here Review Goes Here Review Goes Here, 5.0