Home > Software engineering >  Scrape all elements inside a li tag
Scrape all elements inside a li tag

Time:11-30

I'm trying to scrape some information from the a Kaggle page. All the elements I'm looking for are in <ul role="list" >. And each element decomposes within the <li role="listitem" >. I'm trying to scrape all these elements from this page. Here is an HTML example of a page element:

<ul role="list" ><li role="listitem" ><div ><div ><div ><a href="/pmarcelino" target="_blank"  aria-label="Pedro Marcelino, PhD"><div data-testid="avatar-image" title="Pedro Marcelino, PhD"  style="background-image: url(&quot;https://storage.googleapis.com/kaggle-avatars/thumbnails/175415-gr.jpg&quot;);"></div><svg width="64" height="64" viewBox="0 0 64 64"><circle r="30.5" cx="32" cy="32" fill="none" stroke-width="3" style="stroke: rgb(241, 243, 244);"></circle><path d="M 49.92745019492043 56.6750183284359 A 30.5 30.5 0 0 0 32 1.5" fill="none" stroke-width="3" style="stroke: rgb(32, 190, 255);"></path></svg></a></div></div><a  href="/code/pmarcelino/comprehensive-data-exploration-with-python"><div ><div >Comprehensive data exploration with Python</div><span ><span><span>Updated <span title="Sat Apr 30 2022 21:20:37 GMT 0200 (heure d’été d’Europe centrale)" aria-label="7 months ago">7mo ago</span></span></span> </span><span ><span ><a href="/code/pmarcelino/comprehensive-data-exploration-with-python/comments" >1819 comments</a> · <span ><span >House Prices - Advanced Regression Techniques</span></span></span></span></div></a><div ><div ><button mode="default" data-testid="upvotebutton__upvote" aria-label="Upvote" ><i  sizevalue="18px">arrow_drop_up</i></button><span mode="default" >12770</span></div><span ><span ><img role="presentation" alt="" src="/static/images/medals/competitions/[email protected]" style="height: 9px; width: 9px;"> Gold</span><div ><button aria-label="more_horiz" >more_horiz</button></div></span></div></div><div ></div></li>

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'} 


url = "https://www.kaggle.com/code?sortBy=voteCount&page=1"

req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.text, 'html.parser')

html_content = soup.find_all('li', attrs = {'class': 'sc-jfmDQi hfJycS'})  

data = []

for elements in html_content:
    data.append({
   'title': elements.find("div", {"class": "sc-iBkjds sc-fLlhyt sc-fbPSWO uVZhN izULIq A-dENW"}).text,
   'stars': elements.find("span", {"class": "sc-gXmSlM sc-cCsOjp sc-hAGLhy cKhlzA piYDj mWvOY"}).text,
   'resume': elements.find("span", {"class": "sc-cKajLJ jNrpDQ"}).text,
   'comments': elements.find("span", {"class": "sc-dPyBCJ sc-bBXxYQ sc-bOJcbE cSRCiy cFEurs gTFrUa"}).text,
   'link': elements.get('href')})

print(data)

My output:

[]

CodePudding user response:

The page loads data dynamically using an API, so you don't see that data when u try to get it. U need to figure out which query provides the information and what data is required to get it. I made a small example where we get the necessary tokens and information via api. Next change page varible to get new ids.

import requests
import pandas as pd
import json


def get_token():
    url = 'https://www.kaggle.com/code?sortBy=voteCount&page=1'
    response = requests.get(url)
    return response.cookies.get_dict()['CSRF-TOKEN'], response.cookies.get_dict()['XSRF-TOKEN']


def get_kernel_ids(csrf_token: str, xsrf_token: str, page: int):
    url = "https://www.kaggle.com/api/i/kernels.KernelsService/ListKernelIds"
    payload = json.dumps({
        "sortBy": "VOTE_COUNT",
        "pageSize": 100,
        "group": "EVERYONE",
        "page": page,
        "tagIds": "",
        "excludeResultsFilesOutputs": False,
        "wantOutputFiles": False,
        "excludeKernelIds": []
    })
    headers = {
        'accept': 'application/json',
        'content-type': 'application/json',
        'cookie': f'CSRF-TOKEN={csrf_token}',
        'x-xsrf-token': xsrf_token
    }
    response = requests.post(url, headers=headers, data=payload)
    return response.json()['kernelIds']


def get_info(csrf_token: str, xsrf_token: str):
    url = "https://www.kaggle.com/api/i/kernels.KernelsService/GetKernelListDetails"
    payload = json.dumps({
      "deletedAccessBehavior": "RETURN_NOTHING",
      "unauthorizedAccessBehavior": "RETURN_NOTHING",
      "excludeResultsFilesOutputs": False,
      "wantOutputFiles": False,
      "kernelIds": get_kernel_ids(csrf_token, xsrf_token, 1),
      "outputFileTypes": [],
      "includeInvalidDataSources": False
    })
    headers = {
      'accept': 'application/json',
      'content-type': 'application/json',
      'cookie': f'CSRF-TOKEN={csrf_token}',
      'x-xsrf-token': xsrf_token
    }
    response = requests.post(url, headers=headers, data=payload)
    data = []
    for kernel in response.json()['kernels']:
        data.append({
            'title': kernel['title'],
            'stars': kernel['totalVotes'],
            'resume': kernel['dataSources'][0]['name'] if 'dataSources' in kernel else 'No attached data sources',
            'comments': kernel['totalComments'],
            'url': f'https://www.kaggle.com{kernel["scriptUrl"]}'
        })
    return data


csrf, xsrf = get_token()
df = pd.DataFrame(get_info(csrf, xsrf))
print(df.to_string())

OUTPUT:

                                                 title  stars                                                    resume  comments                                                                                           url
0           Comprehensive data exploration with Python  12772             House Prices - Advanced Regression Techniques      1819             https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python
1                       Titanic Data Science Solutions   9374                  Titanic - Machine Learning from Disaster      2316                         https://www.kaggle.com/code/startupsci/titanic-data-science-solutions
2                                     Titanic Tutorial   8161                  Titanic - Machine Learning from Disaster     26348                                      https://www.kaggle.com/code/alexisbcook/titanic-tutorial
3          Stacked Regressions : Top 4% on LeaderBoard   6749             House Prices - Advanced Regression Techniques      1075                  https://www.kaggle.com/code/serigne/stacked-regressions-top-4-on-leaderboard
4           Introduction to CNN Keras - 0.997 (top 6%)   6438                                          Digit Recognizer      1002              https://www.kaggle.com/code/yassineghouzam/introduction-to-cnn-keras-0-997-top-6
5                   Data ScienceTutorial for Beginners   6213                                    Pokemon- Weedle's Cave      1160                       https://www.kaggle.com/code/kanncaa1/data-sciencetutorial-for-beginners
6                                        Hello, Python   5952                                  No attached data sources       329                                          https://www.kaggle.com/code/colinmorris/hello-python
7        Introduction to Ensembling/Stacking in Python   5657                  Titanic - Machine Learning from Disaster      1031           https://www.kaggle.com/code/arthurtok/introduction-to-ensembling-stacking-in-python
8                                      How Models Work   5308                               Mobile Price Classification         2                                        https://www.kaggle.com/code/dansbecker/how-models-work
9    A Data Science Framework: To Achieve 99% Accuracy   5266                  Titanic - Machine Learning from Disaster       657        https://www.kaggle.com/code/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
10   Credit Fraud  || Dealing with Imbalanced Datasets   4274                               Credit Card Fraud Detection       629       https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
11                   Exploring Survival on the Titanic   3735                  Titanic - Machine Learning from Disaster      1041                         https://www.kaggle.com/code/mrisdal/exploring-survival-on-the-titanic
12                   Start Here: A Gentle Introduction   3372                                  Home Credit Default Risk       543                     https://www.kaggle.com/code/willkoehrsen/start-here-a-gentle-introduction
13                          Functions and Getting Help   2974                                  No attached data sources       145                            https://www.kaggle.com/code/colinmorris/functions-and-getting-help
14                   Your First Machine Learning Model   2913                               Mobile Price Classification       386                      https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model
15         Exercise: Your First Machine Learning Model   2840                                Melbourne Housing Snapshot       381              https://www.kaggle.com/code/yogeshtak/exercise-your-first-machine-learning-model
16               Titanic Top 4% with ensemble modeling   2724                  Titanic - Machine Learning from Disaster       408               https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling
17                                    Model Validation   2655                               Mobile Price Classification         5                                       https://www.kaggle.com/code/dansbecker/model-validation
18                         EDA To Prediction(DieTanic)   2544                  Titanic - Machine Learning from Disaster       342                                 https://www.kaggle.com/code/ash316/eda-to-prediction-dietanic
19            Winning solutions of kaggle competitions   2527                                      [Private Datasource]       207          https://www.kaggle.com/code/sudalairajkumar/winning-solutions-of-kaggle-competitions
20                           Booleans and Conditionals   2501                                  No attached data sources        75                             https://www.kaggle.com/code/colinmorris/booleans-and-conditionals
21                        Underfitting and Overfitting   2500                               Mobile Price Classification         7                           https://www.kaggle.com/code/dansbecker/underfitting-and-overfitting
22                         Getting Started with Kaggle   2420                                  No attached data sources      4265                           https://www.kaggle.com/code/alexisbcook/getting-started-with-kaggle
23                         Full Preprocessing Tutorial   2366                                    Data Science Bowl 2017       490                              https://www.kaggle.com/code/gzuidhof/full-preprocessing-tutorial
24             Machine Learning Tutorial for Beginners   2333             Biomechanical features of orthopedic patients       292                  https://www.kaggle.com/code/kanncaa1/machine-learning-tutorial-for-beginners
25            Everything you can do with a time series   2330                                 DJIA 30 Stock Time Series       171         https://www.kaggle.com/code/thebrownviking20/everything-you-can-do-with-a-time-series
26                Deep Learning Tutorial for Beginners   2296                              Sign Language Digits Dataset       248                     https://www.kaggle.com/code/kanncaa1/deep-learning-tutorial-for-beginners
27                   Getting staRted in R: First Steps   2216                                     Chocolate Bar Ratings       102                          https://www.kaggle.com/code/rtatman/getting-started-in-r-first-steps
28      Approaching (Almost) Any NLP Problem on Kaggle   2169                                       glove.840B.300d.txt       231             https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
29                     Data Science Glossary on Kaggle   2125                                      [Private Datasource]       229                           https://www.kaggle.com/code/shivamb/data-science-glossary-on-kaggle
30   Coronavirus (COVID-19) Visualization & Prediction   2069                                  No attached data sources       691    https://www.kaggle.com/code/therealcyberlord/coronavirus-covid-19-visualization-prediction
31                       Creating, Reading and Writing   2063                                  18,393 Pitchfork Reviews        95                        https://www.kaggle.com/code/residentmario/creating-reading-and-writing
32                       Dive into dplyr (tutorial #1)   1828             Palmer Archipelago (Antarctica) penguin data        134                          https://www.kaggle.com/code/jessemostipak/dive-into-dplyr-tutorial-1
33   Back to (predict) the future - Interactive M5 EDA   1820                          US Natural Disaster Declarations       322        https://www.kaggle.com/code/headsortails/back-to-predict-the-future-interactive-m5-eda
34    COVID-19 - Analysis, Visualization & Comparisons   1773                                    World Happiness Report       359              https://www.kaggle.com/code/imdevskp/covid-19-analysis-visualization-comparisons
35      Time series Basics : Exploring traditional TS    1767                                      Predict Future Sales       174            https://www.kaggle.com/code/jagangupta/time-series-basics-exploring-traditional-ts
36     Titanic - Advanced Feature Engineering Tutorial   1753                  Titanic - Machine Learning from Disaster       322         https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial
37                           Regularized Linear Models   1721             House Prices - Advanced Regression Techniques       336                                  https://www.kaggle.com/code/apapiu/regularized-linear-models
38                                      Random Forests   1711                               Mobile Price Classification         2                                         https://www.kaggle.com/code/dansbecker/random-forests
39  Deep Learning For NLP: Zero To Transformers & BERT   1682                                       glove.840B.300d.txt        89     https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
40            Feature Selection and Data Visualization   1681             Breast Cancer Wisconsin (Diagnostic) Data Set       304                 https://www.kaggle.com/code/kanncaa1/feature-selection-and-data-visualization
41   Is it a bird? Creating a model from your own data   1670                                  No attached data sources        32          https://www.kaggle.com/code/jhoward/is-it-a-bird-creating-a-model-from-your-own-data
42                                               Lists   1651                                  No attached data sources        60                                                 https://www.kaggle.com/code/colinmorris/lists
43                            Submitting From A Kernel   1650             House Prices - Advanced Regression Techniques       491                               https://www.kaggle.com/code/dansbecker/submitting-from-a-kernel
44                              Basic Data Exploration   1552                               Mobile Price Classification         8                                 https://www.kaggle.com/code/dansbecker/basic-data-exploration
45                     Indexing, Selecting & Assigning   1494                                  18,393 Pitchfork Reviews       121                        https://www.kaggle.com/code/residentmario/indexing-selecting-assigning
46  Getting Started with a Movie Recommendation System   1482                                   TMDB 5000 Movie Dataset       181       https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system
47    House prices: Lasso, XGBoost, and a detailed EDA   1470             House Prices - Advanced Regression Techniques       255           https://www.kaggle.com/code/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda
48  Twitter sentiment Extaction-Analysis,EDA and Model   1440                                      [Private Datasource]       122  https://www.kaggle.com/code/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
49                       Loops and List Comprehensions   1436                                  No attached data sources        75                         https://www.kaggle.com/code/colinmorris/loops-and-list-comprehensions
50                       Exploratory Analysis - Zillow   1428  Zillow Prize: Zillow’s Home Value Prediction (Zestimate)       169                             https://www.kaggle.com/code/philippsp/exploratory-analysis-zillow
51        Data Analysis & XGBoost Starter (0.35460 LB)   1414                                      Quora Question Pairs       168                   https://www.kaggle.com/code/anokas/data-analysis-xgboost-starter-0-35460-lb
52     A Statistical Analysis & ML workflow of Titanic   1411                  Titanic - Machine Learning from Disaster       321           https://www.kaggle.com/code/masumrumi/a-statistical-analysis-ml-workflow-of-titanic
53                           A Journey through Titanic   1399                  Titanic - Machine Learning from Disaster       417                             https://www.kaggle.com/code/omarelgabry/a-journey-through-titanic
54                       Plotly Tutorial for Beginners   1370                                 World University Rankings       143                            https://www.kaggle.com/code/kanncaa1/plotly-tutorial-for-beginners
55                  Tutorial on reading large datasets   1365                       Riiid train data (multiple formats)       111                       https://www.kaggle.com/code/rohanrao/tutorial-on-reading-large-datasets
56                          Python Data Visualizations   1365                                              Iris Species       162                              https://www.kaggle.com/code/benhamner/python-data-visualizations
57                     Working with External Libraries   1353                                  No attached data sources        53                       https://www.kaggle.com/code/colinmorris/working-with-external-libraries
58                            Strings and Dictionaries   1346                                  No attached data sources        52                              https://www.kaggle.com/code/colinmorris/strings-and-dictionaries
59                                   Explore Your Data   1329                                   home data for ml course       237                                      https://www.kaggle.com/code/dansbecker/explore-your-data
60                             Handling Missing Values   1324                                  Melbourne Housing Market       440                                https://www.kaggle.com/code/dansbecker/handling-missing-values
61                        Basic EDA,Cleaning and GloVe   1311             GloVe: Global Vectors for Word Representation       162                             https://www.kaggle.com/code/shahules/basic-eda-cleaning-and-glove
62           Pytorch Tutorial for Deep Learning Lovers   1310                                          Digit Recognizer       122                https://www.kaggle.com/code/kanncaa1/pytorch-tutorial-for-deep-learning-lovers
63                                         EDA is Fun!   1298           PUBG Finish Placement Prediction (Kernels Only)       186                                                 https://www.kaggle.com/code/deffro/eda-is-fun
64   NLP with Disaster Tweets - EDA, Cleaning and BERT   1278                                   Pickled glove.840B.300d       209        https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
65    Data Cleaning Challenge: Handling missing values   1260                  Detailed NFL Play-by-Play Data 2009-2018       377           https://www.kaggle.com/code/rtatman/data-cleaning-challenge-handling-missing-values
66             Titanic Survival Predictions (Beginner)   1252                  Titanic - Machine Learning from Disaster       272                  https://www.kaggle.com/code/nadintamer/titanic-survival-predictions-beginner
67                        Santander EDA and Prediction   1215                 Santander Customer Transaction Prediction       189                               https://www.kaggle.com/code/gpreda/santander-eda-and-prediction
68                     Titanic best working Classifier   1181                  Titanic - Machine Learning from Disaster       195                       https://www.kaggle.com/code/sinakhorami/titanic-best-working-classifier
69  Data Science for tabular data: Advanced Techniques   1143                                           No Data Sources       101         https://www.kaggle.com/code/vbmokin/data-science-for-tabular-data-advanced-techniques
70                      Keras U-Net starter - LB 0.277   1120                                   2018 Data Science Bowl        163                               https://www.kaggle.com/code/keegil/keras-u-net-starter-lb-0-277
71          Simple Exploration Notebook - Zillow Prize   1104  Zillow Prize: Zillow’s Home Value Prediction (Zestimate)       131          https://www.kaggle.com/code/sudalairajkumar/simple-exploration-notebook-zillow-prize
72                                      EDA and models   1096                                  IEEE-CIS Fraud Detection       218                                             https://www.kaggle.com/code/artgor/eda-and-models
73         How to: Preprocessing when using embeddings   1094                  Quora Insincere Questions Classification       102         https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings
74                      COVID-19 Literature Clustering   1087        COVID-19 Open Research Dataset Challenge (CORD-19)       232                         https://www.kaggle.com/code/maksimeren/covid-19-literature-clustering
75                       Head Start for Data Scientist   1086                  Titanic - Machine Learning from Disaster       233                             https://www.kaggle.com/code/hiteshp/head-start-for-data-scientist
76                Be my guest - Recruit Restaurant EDA   1079           Weather Data for Recruit Restaurant Competition       237                   https://www.kaggle.com/code/headsortails/be-my-guest-recruit-restaurant-eda
77                       NB-SVM strong linear baseline   1070                    Toxic Comment Classification Challenge       152                             https://www.kaggle.com/code/jhoward/nb-svm-strong-linear-baseline
78                      Seaborn Tutorial for Beginners   1070                          Fatal Police Shootings in the US       182                           https://www.kaggle.com/code/kanncaa1/seaborn-tutorial-for-beginners
79                    A look at different embeddings.!   1057                  Quora Insincere Questions Classification       111                    https://www.kaggle.com/code/sudalairajkumar/a-look-at-different-embeddings
80                       Deep Neural Network Keras way   1053                                          Digit Recognizer       201                             https://www.kaggle.com/code/poonaml/deep-neural-network-keras-way
81   Simple Exploration Baseline - GA Customer Revenue   1044              Google Analytics Customer Revenue Prediction       162   https://www.kaggle.com/code/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue
82            Simple Matplotlib & Visualization Tips            
  • Related