Home > OS >  Converting an 'improper' HTML table' into something thats machine readable
Converting an 'improper' HTML table' into something thats machine readable

Time:10-02

I need to copy thousands of 'mini tables' into a CSV. Essentially every 'mini table' should actually be a single row in an CSV table. The issue is, the code from the website looks like this:

<li > <div> <strong> <a href="www.link.com/">Junior Sales Rep</a> </strong> 
</div> <div > <div > <div > Date of notification: 2022-09-23 <br> End date of waiting period: 2022-09-28 <br> Company Name <br> Toronto (Ontario) 
</div> <div > PB-78 <br> Selection process: <span>22-563-ZB-B7S/span> </div> 
</div> </div> <div> <br><strong> Name of person being considered: </strong> Samuel Adams </div> <hr > </li>

Just from your expertise, is this something that requires custom extensive code to scrape and convert to CSV, or is there a premade way of doing this? I was considering using Beautiful Soup, but before I proceeded I would want a smart person's guidance on the direction I should take, or if this is a lost cause?

CodePudding user response:

How about:

  1. View the webpage in a browser
  2. Copy the text into a code editor like Sublime or VScode
  3. Use multi-line select (or "tall cursor") to put a cursor at the end (or beginning) of every line
  4. Put a comma at the end of each line, then delete the newline
  5. If there's already an extra newline between the tables, then you have your record separator (you might have to delete some commas). Or you might have to locate some part of the data which you can find/replace to add newlines.

CodePudding user response:

I ended up successfully using BS4.

  • Related