Home > front end >  Extracting data from html and generating a CSV
Extracting data from html and generating a CSV

Time:01-20

I have a large HTML file that contains about 400 customer reviews. Is there a tool that I can use to scrap the file and grab specific data out of it and put them in a CSV file? The goal is to transfer these reviews from an old website into a new website for the same company.

The HTML that contains each review looks like this (the file has 400 of these blocks):

  <section >
    <div >
      <div >
        <div >
          <div ><span>Joe K.</span></div>
          <div >
            <div  style="width: 100%">
              <i ></i><i ></i
              ><i ></i><i ></i
              ><i ></i>
            </div>
            <div >
              <i ></i><i ></i
              ><i ></i><i ></i
              ><i ></i>
            </div>
          </div>
          <div >
            <meta content="1" />
            <meta content="5.0" />
            <meta content="5" />
          </div>
          <div >
            <meta content="2022-01-05" />Submitted 01/05/22
          </div>
        </div>
        <div >
          <div >
            <span
              >Review goes here Review Goes Here Review Goes Here</span
            >
          </div>
        </div>
      </div>
    </div>
  </section>

The data I need to get is the reviewer's name, rating, date, and review.

I prefer a tool either in js, node js, php, or python.

CodePudding user response:

Using python (although it can be also done with both js and php) and utilizing xpath, you can try the following:

import lxml.html as lh
reviews = """your html above"""

doc = lh.fromstring(reviews)
sections = doc.xpath('//section')
for section in sections:
    reviewer = section.xpath('.//div[@]/span/text()')[0]
    date = section.xpath('.//div[@]/meta/@content')[0]
    review = section.xpath('.//div[@]/span/text()')[0]
    rating = section.xpath('.//div[@]//meta/@content')[1]

    print(f"{date}, {reviewer}, {review}, {rating}")

The output should be

2022-01-05, Joe K., Review goes here Review Goes Here Review Goes Here, 5.0
  •  Tags:  
  • Related