Home > Software design >  Using Beautiful Soup to log in web page and download multiple zip files
Using Beautiful Soup to log in web page and download multiple zip files

Time:11-26

I am first time users to web scrapping and Beautiful soup.

I have two queries first to pass login information to the files I want download and secondly to download multiple zip file. I am pasting my code below without the curl/log in information.

Firstly, I have a web page which requires log in to download file. I am able to log in with Beautiful soup but thereafter I unable to go further as I not able to pass the login information in python to the particular file I want to download. So basically how can I let python know to use the login credential to the file= baseurl href_link file.

And secondly my link file is a zip file without having .zip at the end. For example my baseurl= 'https://consumerpyramidsdx.cmie.com' and href_link file /kommon/bin/sr.php? kall=wsubsdl&fn=consumption_pyramids_20140131_MS_rev&fmt=csv&rrurl=consumptionpyramidsdx So how can I use it download all the zip files and unzip it? Most of the forum query on this uses '.zip' explicitly as their href has .zip but in my case it doesn't.

Sample zip file which get downloaded after click href_link is Sample zip file name which get downloaded is "consumption_pyramids_20140131_MS_rev_csv"

My code are following:

response = requests.post('https://consumerpyramidsdx.cmie.com/kommon/bin/sr.php', headers=headers, params=params, cookies=cookies, data=data)
soup = BeautifulSoup(response.content, "lxml")
baseurl= 'https://consumerpyramidsdx.cmie.com'
print(soup) 

for x in soup.find_all("a"):
    if x.text =='CSV':
        file_link = x.get('href') #contains the href_link file I want to download 
        print(file_link)
        # After this I want to download all the baseurl file_link files  
        

CodePudding user response:

You can do it in one go (assuming you checked that your current code works and is correct):

for x in soup.find_all("a"):
    if x.text =='CSV':
        file_link = x.get('href')
        response = requests.get(url=urllib.parse.urljoin(baseurl, file_link), headers=headers)
        content = response.content
        with zipfile.ZipFile(io.BytesIO(content)) as zf:
            zf.extractall('target/directory')

CodePudding user response:

my suggestions:

The method one

response = requests.get(url=url,headers=headers)
content = response.content
with open("xxxx.zip", "wb") as f:
    f.write(content)

this url is your will be downloaded website url. this XXXX is your file name, zip file.

the method two

import wget 
url = ""
wget.download(url, path)

the path is your zip fill will be saved path on your Mac/Linux/windows.

The last but not the least, while you are decompressing files, you can use zipfile, for example:

import zipfile

zip_file = zipfile.ZipFile(path)
zip_list = zip_file.namelist() # get all files after decompressing files.

# Cycle unzip files to the specified directory。
for f in zip_list:
    zip_file.extract(f, floder_abs)

zip_file.close()

Thanks for your reading, it's my honor. If these suggestions have any question, please tell me.

  • Related