I'm trying to scrape some information from the a Kaggle page. All the elements I'm looking for are in <ul role="list" >
. And each element decomposes within the <li role="listitem" >
. I'm trying to scrape all these elements from this page. Here is an HTML example of a page element:
<ul role="list" ><li role="listitem" ><div ><div ><div ><a href="/pmarcelino" target="_blank" aria-label="Pedro Marcelino, PhD"><div data-testid="avatar-image" title="Pedro Marcelino, PhD" style="background-image: url("https://storage.googleapis.com/kaggle-avatars/thumbnails/175415-gr.jpg");"></div><svg width="64" height="64" viewBox="0 0 64 64"><circle r="30.5" cx="32" cy="32" fill="none" stroke-width="3" style="stroke: rgb(241, 243, 244);"></circle><path d="M 49.92745019492043 56.6750183284359 A 30.5 30.5 0 0 0 32 1.5" fill="none" stroke-width="3" style="stroke: rgb(32, 190, 255);"></path></svg></a></div></div><a href="/code/pmarcelino/comprehensive-data-exploration-with-python"><div ><div >Comprehensive data exploration with Python</div><span ><span><span>Updated <span title="Sat Apr 30 2022 21:20:37 GMT 0200 (heure d’été d’Europe centrale)" aria-label="7 months ago">7mo ago</span></span></span> </span><span ><span ><a href="/code/pmarcelino/comprehensive-data-exploration-with-python/comments" >1819 comments</a> · <span ><span >House Prices - Advanced Regression Techniques</span></span></span></span></div></a><div ><div ><button mode="default" data-testid="upvotebutton__upvote" aria-label="Upvote" ><i sizevalue="18px">arrow_drop_up</i></button><span mode="default" >12770</span></div><span ><span ><img role="presentation" alt="" src="/static/images/medals/competitions/[email protected]" style="height: 9px; width: 9px;"> Gold</span><div ><button aria-label="more_horiz" >more_horiz</button></div></span></div></div><div ></div></li>
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'}
url = "https://www.kaggle.com/code?sortBy=voteCount&page=1"
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.text, 'html.parser')
html_content = soup.find_all('li', attrs = {'class': 'sc-jfmDQi hfJycS'})
data = []
for elements in html_content:
data.append({
'title': elements.find("div", {"class": "sc-iBkjds sc-fLlhyt sc-fbPSWO uVZhN izULIq A-dENW"}).text,
'stars': elements.find("span", {"class": "sc-gXmSlM sc-cCsOjp sc-hAGLhy cKhlzA piYDj mWvOY"}).text,
'resume': elements.find("span", {"class": "sc-cKajLJ jNrpDQ"}).text,
'comments': elements.find("span", {"class": "sc-dPyBCJ sc-bBXxYQ sc-bOJcbE cSRCiy cFEurs gTFrUa"}).text,
'link': elements.get('href')})
print(data)
My output:
[]
CodePudding user response:
The page loads data dynamically using an API, so you don't see that data when u try to get it. U need to figure out which query provides the information and what data is required to get it. I made a small example where we get the necessary tokens and information via api. Next change page
varible to get new ids.
import requests
import pandas as pd
import json
def get_token():
url = 'https://www.kaggle.com/code?sortBy=voteCount&page=1'
response = requests.get(url)
return response.cookies.get_dict()['CSRF-TOKEN'], response.cookies.get_dict()['XSRF-TOKEN']
def get_kernel_ids(csrf_token: str, xsrf_token: str, page: int):
url = "https://www.kaggle.com/api/i/kernels.KernelsService/ListKernelIds"
payload = json.dumps({
"sortBy": "VOTE_COUNT",
"pageSize": 100,
"group": "EVERYONE",
"page": page,
"tagIds": "",
"excludeResultsFilesOutputs": False,
"wantOutputFiles": False,
"excludeKernelIds": []
})
headers = {
'accept': 'application/json',
'content-type': 'application/json',
'cookie': f'CSRF-TOKEN={csrf_token}',
'x-xsrf-token': xsrf_token
}
response = requests.post(url, headers=headers, data=payload)
return response.json()['kernelIds']
def get_info(csrf_token: str, xsrf_token: str):
url = "https://www.kaggle.com/api/i/kernels.KernelsService/GetKernelListDetails"
payload = json.dumps({
"deletedAccessBehavior": "RETURN_NOTHING",
"unauthorizedAccessBehavior": "RETURN_NOTHING",
"excludeResultsFilesOutputs": False,
"wantOutputFiles": False,
"kernelIds": get_kernel_ids(csrf_token, xsrf_token, 1),
"outputFileTypes": [],
"includeInvalidDataSources": False
})
headers = {
'accept': 'application/json',
'content-type': 'application/json',
'cookie': f'CSRF-TOKEN={csrf_token}',
'x-xsrf-token': xsrf_token
}
response = requests.post(url, headers=headers, data=payload)
data = []
for kernel in response.json()['kernels']:
data.append({
'title': kernel['title'],
'stars': kernel['totalVotes'],
'resume': kernel['dataSources'][0]['name'] if 'dataSources' in kernel else 'No attached data sources',
'comments': kernel['totalComments'],
'url': f'https://www.kaggle.com{kernel["scriptUrl"]}'
})
return data
csrf, xsrf = get_token()
df = pd.DataFrame(get_info(csrf, xsrf))
print(df.to_string())
OUTPUT:
title stars resume comments url
0 Comprehensive data exploration with Python 12772 House Prices - Advanced Regression Techniques 1819 https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python
1 Titanic Data Science Solutions 9374 Titanic - Machine Learning from Disaster 2316 https://www.kaggle.com/code/startupsci/titanic-data-science-solutions
2 Titanic Tutorial 8161 Titanic - Machine Learning from Disaster 26348 https://www.kaggle.com/code/alexisbcook/titanic-tutorial
3 Stacked Regressions : Top 4% on LeaderBoard 6749 House Prices - Advanced Regression Techniques 1075 https://www.kaggle.com/code/serigne/stacked-regressions-top-4-on-leaderboard
4 Introduction to CNN Keras - 0.997 (top 6%) 6438 Digit Recognizer 1002 https://www.kaggle.com/code/yassineghouzam/introduction-to-cnn-keras-0-997-top-6
5 Data ScienceTutorial for Beginners 6213 Pokemon- Weedle's Cave 1160 https://www.kaggle.com/code/kanncaa1/data-sciencetutorial-for-beginners
6 Hello, Python 5952 No attached data sources 329 https://www.kaggle.com/code/colinmorris/hello-python
7 Introduction to Ensembling/Stacking in Python 5657 Titanic - Machine Learning from Disaster 1031 https://www.kaggle.com/code/arthurtok/introduction-to-ensembling-stacking-in-python
8 How Models Work 5308 Mobile Price Classification 2 https://www.kaggle.com/code/dansbecker/how-models-work
9 A Data Science Framework: To Achieve 99% Accuracy 5266 Titanic - Machine Learning from Disaster 657 https://www.kaggle.com/code/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
10 Credit Fraud || Dealing with Imbalanced Datasets 4274 Credit Card Fraud Detection 629 https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
11 Exploring Survival on the Titanic 3735 Titanic - Machine Learning from Disaster 1041 https://www.kaggle.com/code/mrisdal/exploring-survival-on-the-titanic
12 Start Here: A Gentle Introduction 3372 Home Credit Default Risk 543 https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introduction
13 Functions and Getting Help 2974 No attached data sources 145 https://www.kaggle.com/code/colinmorris/functions-and-getting-help
14 Your First Machine Learning Model 2913 Mobile Price Classification 386 https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model
15 Exercise: Your First Machine Learning Model 2840 Melbourne Housing Snapshot 381 https://www.kaggle.com/code/yogeshtak/exercise-your-first-machine-learning-model
16 Titanic Top 4% with ensemble modeling 2724 Titanic - Machine Learning from Disaster 408 https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling
17 Model Validation 2655 Mobile Price Classification 5 https://www.kaggle.com/code/dansbecker/model-validation
18 EDA To Prediction(DieTanic) 2544 Titanic - Machine Learning from Disaster 342 https://www.kaggle.com/code/ash316/eda-to-prediction-dietanic
19 Winning solutions of kaggle competitions 2527 [Private Datasource] 207 https://www.kaggle.com/code/sudalairajkumar/winning-solutions-of-kaggle-competitions
20 Booleans and Conditionals 2501 No attached data sources 75 https://www.kaggle.com/code/colinmorris/booleans-and-conditionals
21 Underfitting and Overfitting 2500 Mobile Price Classification 7 https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting
22 Getting Started with Kaggle 2420 No attached data sources 4265 https://www.kaggle.com/code/alexisbcook/getting-started-with-kaggle
23 Full Preprocessing Tutorial 2366 Data Science Bowl 2017 490 https://www.kaggle.com/code/gzuidhof/full-preprocessing-tutorial
24 Machine Learning Tutorial for Beginners 2333 Biomechanical features of orthopedic patients 292 https://www.kaggle.com/code/kanncaa1/machine-learning-tutorial-for-beginners
25 Everything you can do with a time series 2330 DJIA 30 Stock Time Series 171 https://www.kaggle.com/code/thebrownviking20/everything-you-can-do-with-a-time-series
26 Deep Learning Tutorial for Beginners 2296 Sign Language Digits Dataset 248 https://www.kaggle.com/code/kanncaa1/deep-learning-tutorial-for-beginners
27 Getting staRted in R: First Steps 2216 Chocolate Bar Ratings 102 https://www.kaggle.com/code/rtatman/getting-started-in-r-first-steps
28 Approaching (Almost) Any NLP Problem on Kaggle 2169 glove.840B.300d.txt 231 https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
29 Data Science Glossary on Kaggle 2125 [Private Datasource] 229 https://www.kaggle.com/code/shivamb/data-science-glossary-on-kaggle
30 Coronavirus (COVID-19) Visualization & Prediction 2069 No attached data sources 691 https://www.kaggle.com/code/therealcyberlord/coronavirus-covid-19-visualization-prediction
31 Creating, Reading and Writing 2063 18,393 Pitchfork Reviews 95 https://www.kaggle.com/code/residentmario/creating-reading-and-writing
32 Dive into dplyr (tutorial #1) 1828 Palmer Archipelago (Antarctica) penguin data 134 https://www.kaggle.com/code/jessemostipak/dive-into-dplyr-tutorial-1
33 Back to (predict) the future - Interactive M5 EDA 1820 US Natural Disaster Declarations 322 https://www.kaggle.com/code/headsortails/back-to-predict-the-future-interactive-m5-eda
34 COVID-19 - Analysis, Visualization & Comparisons 1773 World Happiness Report 359 https://www.kaggle.com/code/imdevskp/covid-19-analysis-visualization-comparisons
35 Time series Basics : Exploring traditional TS 1767 Predict Future Sales 174 https://www.kaggle.com/code/jagangupta/time-series-basics-exploring-traditional-ts
36 Titanic - Advanced Feature Engineering Tutorial 1753 Titanic - Machine Learning from Disaster 322 https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial
37 Regularized Linear Models 1721 House Prices - Advanced Regression Techniques 336 https://www.kaggle.com/code/apapiu/regularized-linear-models
38 Random Forests 1711 Mobile Price Classification 2 https://www.kaggle.com/code/dansbecker/random-forests
39 Deep Learning For NLP: Zero To Transformers & BERT 1682 glove.840B.300d.txt 89 https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
40 Feature Selection and Data Visualization 1681 Breast Cancer Wisconsin (Diagnostic) Data Set 304 https://www.kaggle.com/code/kanncaa1/feature-selection-and-data-visualization
41 Is it a bird? Creating a model from your own data 1670 No attached data sources 32 https://www.kaggle.com/code/jhoward/is-it-a-bird-creating-a-model-from-your-own-data
42 Lists 1651 No attached data sources 60 https://www.kaggle.com/code/colinmorris/lists
43 Submitting From A Kernel 1650 House Prices - Advanced Regression Techniques 491 https://www.kaggle.com/code/dansbecker/submitting-from-a-kernel
44 Basic Data Exploration 1552 Mobile Price Classification 8 https://www.kaggle.com/code/dansbecker/basic-data-exploration
45 Indexing, Selecting & Assigning 1494 18,393 Pitchfork Reviews 121 https://www.kaggle.com/code/residentmario/indexing-selecting-assigning
46 Getting Started with a Movie Recommendation System 1482 TMDB 5000 Movie Dataset 181 https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system
47 House prices: Lasso, XGBoost, and a detailed EDA 1470 House Prices - Advanced Regression Techniques 255 https://www.kaggle.com/code/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda
48 Twitter sentiment Extaction-Analysis,EDA and Model 1440 [Private Datasource] 122 https://www.kaggle.com/code/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
49 Loops and List Comprehensions 1436 No attached data sources 75 https://www.kaggle.com/code/colinmorris/loops-and-list-comprehensions
50 Exploratory Analysis - Zillow 1428 Zillow Prize: Zillow’s Home Value Prediction (Zestimate) 169 https://www.kaggle.com/code/philippsp/exploratory-analysis-zillow
51 Data Analysis & XGBoost Starter (0.35460 LB) 1414 Quora Question Pairs 168 https://www.kaggle.com/code/anokas/data-analysis-xgboost-starter-0-35460-lb
52 A Statistical Analysis & ML workflow of Titanic 1411 Titanic - Machine Learning from Disaster 321 https://www.kaggle.com/code/masumrumi/a-statistical-analysis-ml-workflow-of-titanic
53 A Journey through Titanic 1399 Titanic - Machine Learning from Disaster 417 https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic
54 Plotly Tutorial for Beginners 1370 World University Rankings 143 https://www.kaggle.com/code/kanncaa1/plotly-tutorial-for-beginners
55 Tutorial on reading large datasets 1365 Riiid train data (multiple formats) 111 https://www.kaggle.com/code/rohanrao/tutorial-on-reading-large-datasets
56 Python Data Visualizations 1365 Iris Species 162 https://www.kaggle.com/code/benhamner/python-data-visualizations
57 Working with External Libraries 1353 No attached data sources 53 https://www.kaggle.com/code/colinmorris/working-with-external-libraries
58 Strings and Dictionaries 1346 No attached data sources 52 https://www.kaggle.com/code/colinmorris/strings-and-dictionaries
59 Explore Your Data 1329 home data for ml course 237 https://www.kaggle.com/code/dansbecker/explore-your-data
60 Handling Missing Values 1324 Melbourne Housing Market 440 https://www.kaggle.com/code/dansbecker/handling-missing-values
61 Basic EDA,Cleaning and GloVe 1311 GloVe: Global Vectors for Word Representation 162 https://www.kaggle.com/code/shahules/basic-eda-cleaning-and-glove
62 Pytorch Tutorial for Deep Learning Lovers 1310 Digit Recognizer 122 https://www.kaggle.com/code/kanncaa1/pytorch-tutorial-for-deep-learning-lovers
63 EDA is Fun! 1298 PUBG Finish Placement Prediction (Kernels Only) 186 https://www.kaggle.com/code/deffro/eda-is-fun
64 NLP with Disaster Tweets - EDA, Cleaning and BERT 1278 Pickled glove.840B.300d 209 https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
65 Data Cleaning Challenge: Handling missing values 1260 Detailed NFL Play-by-Play Data 2009-2018 377 https://www.kaggle.com/code/rtatman/data-cleaning-challenge-handling-missing-values
66 Titanic Survival Predictions (Beginner) 1252 Titanic - Machine Learning from Disaster 272 https://www.kaggle.com/code/nadintamer/titanic-survival-predictions-beginner
67 Santander EDA and Prediction 1215 Santander Customer Transaction Prediction 189 https://www.kaggle.com/code/gpreda/santander-eda-and-prediction
68 Titanic best working Classifier 1181 Titanic - Machine Learning from Disaster 195 https://www.kaggle.com/code/sinakhorami/titanic-best-working-classifier
69 Data Science for tabular data: Advanced Techniques 1143 No Data Sources 101 https://www.kaggle.com/code/vbmokin/data-science-for-tabular-data-advanced-techniques
70 Keras U-Net starter - LB 0.277 1120 2018 Data Science Bowl 163 https://www.kaggle.com/code/keegil/keras-u-net-starter-lb-0-277
71 Simple Exploration Notebook - Zillow Prize 1104 Zillow Prize: Zillow’s Home Value Prediction (Zestimate) 131 https://www.kaggle.com/code/sudalairajkumar/simple-exploration-notebook-zillow-prize
72 EDA and models 1096 IEEE-CIS Fraud Detection 218 https://www.kaggle.com/code/artgor/eda-and-models
73 How to: Preprocessing when using embeddings 1094 Quora Insincere Questions Classification 102 https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings
74 COVID-19 Literature Clustering 1087 COVID-19 Open Research Dataset Challenge (CORD-19) 232 https://www.kaggle.com/code/maksimeren/covid-19-literature-clustering
75 Head Start for Data Scientist 1086 Titanic - Machine Learning from Disaster 233 https://www.kaggle.com/code/hiteshp/head-start-for-data-scientist
76 Be my guest - Recruit Restaurant EDA 1079 Weather Data for Recruit Restaurant Competition 237 https://www.kaggle.com/code/headsortails/be-my-guest-recruit-restaurant-eda
77 NB-SVM strong linear baseline 1070 Toxic Comment Classification Challenge 152 https://www.kaggle.com/code/jhoward/nb-svm-strong-linear-baseline
78 Seaborn Tutorial for Beginners 1070 Fatal Police Shootings in the US 182 https://www.kaggle.com/code/kanncaa1/seaborn-tutorial-for-beginners
79 A look at different embeddings.! 1057 Quora Insincere Questions Classification 111 https://www.kaggle.com/code/sudalairajkumar/a-look-at-different-embeddings
80 Deep Neural Network Keras way 1053 Digit Recognizer 201 https://www.kaggle.com/code/poonaml/deep-neural-network-keras-way
81 Simple Exploration Baseline - GA Customer Revenue 1044 Google Analytics Customer Revenue Prediction 162 https://www.kaggle.com/code/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue
82 Simple Matplotlib & Visualization Tips