I'm trying to scrape the links to the 400 models listed on this website: https://www.printables.com/model?category=14&fileType=fff&includeUserGcodes=1, which I refer to as webpage in my code below. However, when I run my code, I get no links.
User_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS i686 4319.74.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36'}
r = requests.get(webpage, headers = User_agent).text
soup = BeautifulSoup(r,'html5lib')
for link in soup.find_all('a'):
print(link['href'])
So I check if links are even available via: print(soup.prettify())
and none of the desired links appear in the HTML view as well. This led me to assume that the website doesn't allow scraping but r.status_code
returns 200 meaning I'm able to scrape.
Is there a different approach I could take? Where else would these links be stored? Thank you.
CodePudding user response:
The data is loaded from external URL via Javascript, so BeautifulSoup doesn't see it. To get info about all items you can use following example:
import json
import requests
url = "https://www.printables.com/graphql/"
payload = {
"operationName": "PrintList",
"query": "query PrintList($limit: Int!, $cursor: String, $categoryId: ID, $materialIds: [Int], $userId: ID, $printerIds: [Int], $licenses: [ID], $ordering: String, $hasModel: Boolean, $filesType: [FilterPrintFilesTypeEnum], $includeUserGcodes: Boolean, $nozzleDiameters: [Float], $weight: IntervalObject, $printDuration: IntervalObject, $publishedDateLimitDays: Int, $featured: Boolean, $featuredNow: Boolean, $usedMaterial: IntervalObject, $hasMake: Boolean, $competitionAwarded: Boolean, $onlyFollowing: Boolean, $collectedByMe: Boolean, $madeByMe: Boolean, $likedByMe: Boolean) {\n morePrints(\n limit: $limit\n cursor: $cursor\n categoryId: $categoryId\n materialIds: $materialIds\n printerIds: $printerIds\n licenses: $licenses\n userId: $userId\n ordering: $ordering\n hasModel: $hasModel\n filesType: $filesType\n nozzleDiameters: $nozzleDiameters\n includeUserGcodes: $includeUserGcodes\n weight: $weight\n printDuration: $printDuration\n publishedDateLimitDays: $publishedDateLimitDays\n featured: $featured\n featuredNow: $featuredNow\n usedMaterial: $usedMaterial\n hasMake: $hasMake\n onlyFollowing: $onlyFollowing\n competitionAwarded: $competitionAwarded\n collectedByMe: $collectedByMe\n madeByMe: $madeByMe\n liked: $likedByMe\n ) {\n cursor\n items {\n ...PrintListFragment\n printer {\n id\n __typename\n }\n user {\n rating\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment PrintListFragment on PrintType {\n id\n name\n slug\n ratingAvg\n ratingCount\n likesCount\n liked\n datePublished\n dateFeatured\n firstPublish\n downloadCount\n displayCount\n inMyCollections\n foundInUserGcodes\n userGcodeCount\n userGcodesCount\n materials {\n id\n __typename\n }\n category {\n id\n path {\n id\n name\n __typename\n }\n __typename\n }\n modified\n images {\n ...ImageSimpleFragment\n __typename\n }\n filesType\n hasModel\n user {\n ...AvatarUserFragment\n __typename\n }\n ...LatestCompetitionResult\n __typename\n}\n\nfragment AvatarUserFragment on UserType {\n id\n publicUsername\n avatarFilePath\n slug\n badgesProfileLevel {\n profileLevel\n __typename\n }\n __typename\n}\n\nfragment LatestCompetitionResult on PrintType {\n latestCompetitionResult {\n placement\n competitionId\n __typename\n }\n __typename\n}\n\nfragment ImageSimpleFragment on PrintImageType {\n id\n filePath\n rotation\n __typename\n}\n",
"variables": {
"categoryId": "14",
"collectedByMe": False,
"competitionAwarded": False,
"cursor": "",
"featured": False,
"filesType": ["GCODE"],
"hasMake": False,
"includeUserGcodes": True,
"likedByMe": False,
"limit": 36,
"madeByMe": False,
"materialIds": None,
"nozzleDiameters": None,
"ordering": "-first_publish",
"printDuration": None,
"printerIds": None,
"publishedDateLimitDays": None,
"weight": None,
},
}
cnt = 0
while True:
data = requests.post(url, json=payload).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in data["data"]["morePrints"]["items"]:
cnt = 1
print(
cnt,
i["name"],
"https://www.printables.com/model/{}-{}".format(i["id"], i["slug"]),
)
if not data["data"]["morePrints"]["cursor"]:
break
payload["variables"]["cursor"] = data["data"]["morePrints"]["cursor"]
Prints:
1 White Spiral Vase https://www.printables.com/model/189114-white-spiral-vase
2 Calibrating Before Battle - 3DPN Mr. Print-It - Superhero Remix https://www.printables.com/model/188733-calibrating-before-battle-3dpn-mr-print-it-superhe
3 twitter 3d bird https://www.printables.com/model/187083-twitter-3d-bird
4 Welcome To Rapture plaque https://www.printables.com/model/186669-welcome-to-rapture-plaque
...