Home > Software engineering >  How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way i
How to extract specific table data (div\tr\td) from multiple URLs in a website in a literate way i

Time:10-16

I am learning python and practicing it for extracting data in a public site. but I found a problem in this learning. I'd like to get your kindly help me out. Thanks for your help in advance! I will keep track this thread daily to wait for your kindly comments :)

Purpose:
extract all 65 pages' col, row with contents into a csv in one script

65 pages URLs loop rule:
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1
..........
http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=65

Question1:
When running below one page script to extract one page data into csv. I had to run twice with different filename, then data can be extracted to 1st time run file for example if I run it with test.csv, excel keep 0kb status, after I change filename to test2, then run this script again, after that data can be extract to test.csv..., but test2.csv keep no data with 0 KB. any idea?

here is one page extract code:

import requests
import csv
from bs4 import BeautifulSoup as bs
url = requests.get("http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo=1")
soup = bs(url.content, 'html.parser')
filename = "test.csv"
csv_writer = csv.writer(open(filename, 'w', newline=''))
divs = soup.find_all("div", class_ = "iiright")
for div in divs:
         for tr in div.find_all("tr")[1:]:
            data = []
            for td in tr.find_all("td"):
                data.append(td.text.strip())
            if data:
                print("Inserting data: {}".format(','.join(data)))
                csv_writer.writerow(data)

Question2: I found problem to literate 65 pages urls to extract data into csv. it doesn't work... any idea fix it..

here are 65 pages urls' extract code:

import requests
import csv
from bs4 import BeautifulSoup as bs
url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={}"
def get_data(url):
      for url in [url.format(pageNo) for pageNo in range(1,65)]:
        soup = bs(url.content, 'html.parser')
        for div in soup.find_all("div", class_ = "iiright"):
            for tr in div.find_all("tr"):
                data = []
                for td in tr.find_all("td"):
                    data.append(td.text.strip())
                    if data:
                        print("Inserting data: {}".format(','.join(data)))
                        writer.writerow(data)
                
if __name__ == '__main__':
        with open("test.csv","w",newline="") as infile:
            writer = csv.writer(infile)
            get_data(url)

CodePudding user response:

Just an alternativ approach

Try to keep it simple and may use pandas, cause it will do all these things for you under the hood.

  1. define a list (data) to keep your results
  2. iterate over the urls with pd.read_html
  3. concat the data frames in data and write them to_csvor to_excel
read_html
  1. find the table that matches a string -> match='预售信息查询:' and select it with [0] cause read_html() will always give you a list of tables
  2. take a special row as header header =2
  3. get rid of the last row with navigation and last column that is caused by the wrong colspan with .iloc[:-1,:-1]

Example

import pandas as pd

data = []

for pageNo in range(1,5):
    data.append(pd.read_html(f'http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey=&pageNo={pageNo}', header =2, match='预售信息查询:')[0].iloc[:-1,:-1])

pd.concat(data).to_csv('test.csv', index=False)

Example (based on your code with function)

import pandas as pd

url = "http://fcjyw.dlhitech.gov.cn/ysxkzList.xhtml?method=doQuery&selYsxk=xmmc&txtkey="

def get_data(url):
    
    data = []

    for pageNo in range(1,2):
        data.append(pd.read_html(f'{url}&pageNo={pageNo}', header=2, match='预售信息查询:')[0].iloc[:-1,:-1])

    pd.concat(data).to_csv('test.csv', index=False)
                
if __name__ == '__main__':
    get_data(url)
  • Related