Home > OS >  Is there any convenient way to parse html elements by ccs styles applied to them in python?
Is there any convenient way to parse html elements by ccs styles applied to them in python?

Time:10-04

I know that i can extract elements using css selectors with libs such as bs4. but I have a problem where I don't know names of css classes used to style elements I need to extract, but I know that all this elements have common rule applied to them("position:fixed;" in my case). Is there any convenient way(some library) that I can use to do this?

CodePudding user response:

Unless the HTML source has inline styling, you will not easily be able to determine what style rules each element has, thus making the filtering process more complex. The solution below works in two parts:

  1. From the source in the target page, the CSS stylesheets are requested and parsed using tinycss.
  2. Then, the site is opened in selenium and the jsonified style rules are passed to a Javascript snippet that is run via driver.execute_script. This way, the built-in Javascript method element.matches can be leveraged to produce a substantial speedup in CSS path match detection:

import requests, urllib.parse, tinycss
import itertools
from bs4 import BeautifulSoup as soup
base_url = 'https://stackoverflow.com/questions/tagged/python' #replace with your own target link
def css(link):
    parser = tinycss.make_parser('page3')
    stylesheet = parser.parse_stylesheet(requests.get(link).text)
    return [[i.selector.as_css(), [[j.name, j.value.as_css()] for j in i.declarations]] 
      for i in stylesheet.rules if not isinstance(i, tinycss.css21.MediaRule)]

css_links = [css(urllib.parse.urljoin(base_url, i['href'])) for i in \
     soup(requests.get(base_url).text, 'html.parser').select('link[rel="stylesheet"]')]

Now, loading the target page with selenium:


from selenium import webdriver
import re, json
d = webdriver.Chrome('/path/to/chromedriver')
d.get(base_url)
rules = ['position:fixed', 'text-align: left'] #your rule list
new_rules = [re.split('\:(?:\s )*', i) for i in rules]
classes = d.execute_script('''
  var target_rules = JSON.parse(arguments[0]);
  var selectors = JSON.parse(arguments[1]);
  function* matching_rules(elem){
     for (var i of selectors){
         for (var [s, rules] of i){
             try{
                if (elem.matches(s)){
                   yield* rules
                }
             }
             catch(e){
             }
         }
     }
  }
  function* get_classes(elem){
      if (elem.getAttribute('class')){
         for (var rule of matching_rules(elem)){
            if (target_rules.map(JSON.stringify).includes(JSON.stringify(rule))){
               yield [elem.getAttribute('class'), rule]
            }
         }
      }
      for (var i of elem.children){
         yield* get_classes(i);
      }
  }
  return [...get_classes(document.querySelector('body'))]
''', json.dumps(new_rules), json.dumps(css_links))

print(classes)

Output:

[['top-bar js-top-bar top-bar__network', ['position', 'fixed']], ['topbar-dialog leftnav-dialog js-leftnav-dialog dno', ['text-align', 'left']], ['s-spinner s-spinner__sm fc-orange-400', ['text-align', 'left']], ['topbar-dialog siteSwitcher-dialog dno', ['text-align', 'left']], ['container', ['text-align', 'left']], ['left-sidebar--sticky-container js-sticky-leftnav', ['position', 'fixed']], ['mln12 mrn12 px12 py6 fl1 s-block-link', ['text-align', 'left']], ['mln12 mrn12 px12 py6 fl1 s-block-link', ['text-align', 'left']], ['mln12 mrn12 px12 py6 fl1 s-block-link', ['text-align', 'left']], ['mln12 mrn12 px12 py6 fl1 s-block-link', ['text-align', 'left']], ['s-block-link c-default fc-black-350 mln12 mrn12 px12 py6', ['text-align', 'left']], ['s-modal js-feed-link-modal', ['position', 'fixed']], ['ff-sans ps-fixed z-nav-fixed ws4 sm:w-auto p32 sm:p16 bg-black-750 fc-white bar-lg b16 l16 r16 js-consent-banner', ['position', 'fixed']]]

The output consists of the classnames of all elements that have style rules from your target rule set.

If you want to get the unique class names associated with each style rule, you can use collections.defaultdict:

from collections import defaultdict
d = defaultdict(set)
for c_name, rule in classes:
   d[tuple(rule)].add(c_name)

print({a:list(b) for a, b in d.items()})

Output:

{('position', 'fixed'): ['left-sidebar--sticky-container js-sticky-leftnav', 's-modal js-feed-link-modal', 'ff-sans ps-fixed z-nav-fixed ws4 sm:w-auto p32 sm:p16 bg-black-750 fc-white bar-lg b16 l16 r16 js-consent-banner', 'top-bar js-top-bar top-bar__network'], ('text-align', 'left'): ['s-spinner s-spinner__sm fc-orange-400', 'mln12 mrn12 px12 py6 fl1 s-block-link', 'topbar-dialog siteSwitcher-dialog dno', 's-block-link c-default fc-black-350 mln12 mrn12 px12 py6', 'topbar-dialog leftnav-dialog js-leftnav-dialog dno', 'container']}

CodePudding user response:

Selenium is a very good python library for automation, but it allows for web scraping and selecting multiple using their CSS selectors and more. This should allow you to store all the elements with the

("position:fixed;" in my case)

Using selenium you would be able to extract the data, then you could manipulate the data however you would want. There are many ways to parse the data itself, so please be more specific on that part. If in doubt send the website link as a comment and ill check the CSS selectors for you. Hope this has solved the issue.

  • Related