Web Scrapping Using Python for nlp project-CodePudding

I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?

TIA

CodePudding user response：

The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.

Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.

As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.

I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)

CodePudding user response：

I din't check the html of the website but you can use beautifulsoup for parsing html and pandas for converting data into csv

sample code

import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')

soup = BeautifulSoup(res.content,'html.parser')

# suppose i want all 'li' tags and links in 'li' tags.

lis = soup.find_all("li")
links = []

for li in lis:
  a_tag = li.find("a")
  link = a_tag.get("href")
  links.appedn(link)

And you can get lots of tutorial on pandas online.