Help me have a look at my re. The.findall () method why not... Save the children!-CodePudding

# coding=utf-8
The import requests
The import re
# 1, download a web page
Url='https://www.fpzw.com/xiaoshuo/88/88413/'
# 2, and simulate the browser sends an HTTP request
The response=requests. Get (url) # type: object
# 3, encoding
The response. The encoding='GBK'
# 4, we get the source file
HTML=response. The text
# 5, the novel name
Title=re. The.findall (r 'var articlename=\' (. *?) \ '; ', HTML)
Print (the title)
# 6, a new file and save the
Fb=open (' % s.t xt '% title,' w ', encoding='GBK')
# 7, each chapter information
Dl=re. The.findall (r '& lt;/strong>

', HTML, re S) [0]
Chapter_info_list=re. The.findall (r '& lt; Dd> (. ?) ', dl, re S)
Print (chapter_info_list)
# 8, cycle each chapter respectively to download
For chapter_info chapter_info_list in:
Chapter_title=chapter_info [1]
Chapter_url=chapter_info [0]
Chapter_url="https://www.fpzw.com%s" % chapter_url
# 8.2 download content
Chapter_response=requests. Get (chapter_url)
Chapter_response. Encoding="utf-8"
Chapter_html=chapter_response. Text
# 8.3 extraction section
Chapter_content=re. The.findall (r '& lt; Script language="javascript" & gt; Tongzhi \ (\); </script> (. ?)
', chapter_html, re S) [0]
# 8.4 sorting data
Chapter_content=chapter_content. Replace (', ')
Chapter_content=chapter_content. Replace (' & amp; nbsp; ', ' ')
Chapter_content=chapter_content. Replace (' & lt; Br/& gt; ', '\ n')
# 8.5 save
Fb. Write (chapter_title)
Fb. Write (chapter_content)
Fb. Write (' \ n ')

Print (chapter_url)
CodePudding user response:
Is getting the null value, just learn it didn't take long, are also following tutorials, why wrong oh, crying
CodePudding user response:
Don't sink, top, top!
CodePudding user response:
Try the brackets into English
CodePudding user response:
references a smile program monkey reply: 3/f
change the brackets to try English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,
CodePudding user response:
This site has no hee hee
CodePudding user response:
references a smile program monkey reply: 3/f
change the brackets to try English

thank you, and a step forward, come out, I didn't see the
CodePudding user response:
reference Loeb, 4/f, Keith's reply:

Quote: refer to the third floor smiled program monkey reply:

Try the brackets into English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,

I can't use headers, first find a site experiment without the crawler mechanism
CodePudding user response:
refer to 7th floor response: the distance of the scenery

Quote: reference Loeb, 4/f, Keith's reply:

Quote: reference program monkey reply: 3/f

Try the brackets into English

And chapter_url="https://www.fpzw.com/%s" % chapter_url

The other site has anti crawler mechanism, so crawl is getting less than content directly,

I can't use headers, first find a site experiment without the crawler mechanism

Let me test the function of BBS
CodePudding user response:
I do also in chapter_info_list=re. The.findall (r 'href="https://bbs.csdn.net/topics/(. ?) "> (. ?) <'and dl) run here is empty, various attempts failed, can you teach me?
CodePudding user response:
How do I see var, this was written by js, such a crawl is less than, advice from some of the first web site without the climb mechanism, and pure HTML website began to learn,
CodePudding user response:
Hello, I do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, can try to obtain the request data from the server
CodePudding user response:
I also want to learn, requires more knowledge a bit
CodePudding user response:
The.findall the second parameter is the string you pass is a text name pseudo code
CodePudding user response:
Do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, can try to get the request data from the server, you can use XML, beautifulsoup
CodePudding user response:
reference 15 floor ZhuCheng Xie response:
do not recommend the use of re to parse web pages, especially for the large-scale crawler, so efficiency is slow, the algorithm complexity, high, you can try to get the request data from the server and can use XML, beautifulsoup

thank you, I was in the elective course homework