Home > other >  Help me have a look at my re. The.findall () method why not... Save the children!
Help me have a look at my re. The.findall () method why not... Save the children!

Time:09-24

# coding=utf-8
The import requests
The import re
# 1, download a web page
Url='https://www.fpzw.com/xiaoshuo/88/88413/'
# 2, and simulate the browser sends an HTTP request
The response=requests. Get (url) # type: object
# 3, encoding
The response. The encoding='GBK'
# 4, we get the source file
HTML=response. The text
# 5, the novel name
Title=re. The.findall (r 'var articlename=\' (. *?) \ '; ', HTML)
Print (the title)
# 6, a new file and save the
Fb=open (' % s.t xt '% title,' w ', encoding='GBK')
# 7, each chapter information
Dl=re. The.findall (r '& lt;/strong>

', HTML, re S) [0]
Chapter_info_list=re. The.findall (r '& lt; Dd> (. *?) ', dl, re S)
Print (chapter_info_list)
# 8, cycle each chapter respectively to download
For chapter_info chapter_info_list in:
Chapter_title=chapter_info [1]
Chapter_url=chapter_info [0]
Chapter_url="https://www.fpzw.com%s" % chapter_url
# 8.2 download content
Chapter_response=requests. Get (chapter_url)
Chapter_response. Encoding="utf-8"
Chapter_html=chapter_response. Text
# 8.3 extraction section
Chapter_content=re. The.findall (r '& lt; Script language="javascript" & gt; Tongzhi \ (\); </script> (. *?)

', chapter_html, re S) [0]
# 8.4 sorting data
Chapter_content=chapter_content. Replace (', ')
Chapter_content=chapter_content. Replace (' & amp; nbsp; ', ' ')
Chapter_content=chapter_content. Replace (' & lt; Br/& gt; ', '\ n')
# 8.5 save
Fb. Write (chapter_title)
Fb. Write (chapter_content)
Fb. Write (' \ n ')

Print (chapter_url)

CodePudding user response:

Is getting the null value, just learn it didn't take long, are also following tutorials, why wrong oh, crying

CodePudding user response:

Don't sink, top, top!

CodePudding user response:

Try the brackets into English

CodePudding user response:

references a smile program monkey reply: 3/f
change the brackets to try English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,

CodePudding user response:

This site has no hee hee

CodePudding user response:

references a smile program monkey reply: 3/f
change the brackets to try English
thank you, and a step forward, come out, I didn't see the

CodePudding user response:

reference Loeb, 4/f, Keith's reply:
Quote: refer to the third floor smiled program monkey reply:

Try the brackets into English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,
I can't use headers, first find a site experiment without the crawler mechanism

CodePudding user response:

refer to 7th floor response: the distance of the scenery
Quote: reference Loeb, 4/f, Keith's reply:
Quote: reference program monkey reply: 3/f

Try the brackets into English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,
I can't use headers, first find a site experiment without the crawler mechanism

Let me test the function of BBS

CodePudding user response:

I do also in chapter_info_list=re. The.findall (r 'href="https://bbs.csdn.net/topics/(. *?) "> (. *?) <'and dl) run here is empty, various attempts failed, can you teach me?

CodePudding user response:

How do I see var, this was written by js, such a crawl is less than, advice from some of the first web site without the climb mechanism, and pure HTML website began to learn,

CodePudding user response:

Hello, I do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, can try to obtain the request data from the server

CodePudding user response:

I also want to learn, requires more knowledge a bit

CodePudding user response:

The.findall the second parameter is the string you pass is a text name pseudo code

CodePudding user response:

Do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, can try to get the request data from the server, you can use XML, beautifulsoup

CodePudding user response:

reference 15 floor ZhuCheng Xie response:
do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, you can try to get the request data from the server and can use XML, beautifulsoup
thank you, I was in the elective course homework

  • Related