Home > Blockchain >  BeautifulSoup - scraping data through tags with conditions
BeautifulSoup - scraping data through tags with conditions

Time:09-21

I am new to web scraping and am facing difficulty in scraping data as needed. What I want is to scrap the data on the basis of tags with conditions. First, check if it is a 'h3' tag (i.e. It is a question scrap it) now I want to add a condition that if there is a 'p' tag or any other tag that occurs after 'h3' tag only then scrap it else not. I am facing difficulty in implementing such condition.

#This is what I am doing right now
soup = BeautifulSoup(req.content, "html.parser")

title = soup.find_all(['h3', 'p'])
print('List:', *title, sep='\n\n')

CodePudding user response:

Your question is a bit generic and confusing, with that 'any other tag' -- there will surely be other tags after 'h3'. Nonetheless, here is a solution which allows you to filter questions whose next_siblings are <p> tags, from the ones which have other siblings. You can use this as an example, and eventually modify the function to suit your needs:

import requests
from bs4 import BeautifulSoup as bs

def has_p_sibling(el):
    return el.find_next_sibling().name == 'p'

r = requests.get('https://www.geeksforgeeks.org/cpp-interview-questions/')
soup = bs(r.text, 'html.parser')
questions =soup.select('h3')
for q in questions:
    if  has_p_sibling(q):
        print('GOOD Q', q.next_sibling.name, q.text)
    else:
        print('BAD Q', q.next_sibling.name, q.text)

Result printed in terminal:

GOOD Q p Q-1. What is C  ? What are the advantages of C  ?
BAD Q div Q- 2. What are the different data types present in C  ?
GOOD Q p Q-3. Define ‘std’?
GOOD Q p Q-4. What are references in C  ?
GOOD Q p Q-5. What do you mean by Call by Value and Call by Reference?
GOOD Q p Q-6. Define token in C  
BAD Q figure Q-7. What is the difference between C and C  ?
BAD Q figure Q-8. What is the difference between struct and class?
BAD Q figure Q-9. What is the difference between reference and pointer?
BAD Q figure Q-10. What is the difference between function overloading and operator overloading?
BAD Q figure Q-11. What is the difference between an array and a list?
BAD Q figure Q-12: What is the difference between a while loop and a do-while loop?
BAD Q figure Q-13. Discuss the difference between prefix and postfix?
BAD Q figure Q-14. What is the difference between new and malloc()?
BAD Q figure Q-15. What is the difference between virtual functions and pure virtual functions?
GOOD Q p Q-16. What are classes and objects in C  ?
GOOD Q p Q-17. What is Function Overriding?
BAD Q ul Q-18. What are the various OOPs concepts in C  ?
GOOD Q p Q-19. Explain inheritance
GOOD Q p Q-20. When should we use multiple inheritance?
GOOD Q p Q-21. What is virtual inheritance?
GOOD Q p Q-22. What is polymorphism in C  ?
GOOD Q p Q-23. What are the different types of polymorphism in C  ?
BAD Q figure Q-24. Compare compile-time polymorphism and Runtime polymorphism
GOOD Q p Q-25. Explain the constructor in C  .
GOOD Q p Q-26. What are destructors in C  ?
GOOD Q p Q-27. What is a virtual destructor?
GOOD Q p Q-28. Is destructor overloading possible? If yes then explain and if no then why?
GOOD Q p Q-29. Which operations are permitted on pointers?
GOOD Q p Q-30. What is the purpose of the “delete” operator?
BAD Q figure Q-31. How delete [] is different from delete?
GOOD Q p Q-32. What do you know about friend class and friend function?
GOOD Q p Q-33. What is an Overflow Error?
GOOD Q p Q-34. What does the Scope Resolution operator do?
GOOD Q p Q-35. What are the C   access modifiers?
GOOD Q p Q-36. Can you compile a program without the main function?
GOOD Q p Q-37. What is STL?
GOOD Q p Q-38. Define inline function. Can we have a recursive inline function in C  ?
GOOD Q p Q-39. What is an abstract class and when do you use it?
GOOD Q p Q-40. What are the static data members and static member functions?
GOOD Q p Q-41. What is the main use of the keyword “Volatile”?
GOOD Q p Q-42. Define storage class in C   and name some
GOOD Q p Q-43. What is a mutable storage class specifier? How can they be used?
GOOD Q p Q-44. Define the Block scope variable. 
GOOD Q p Q-45. What is the function of the keyword “Auto”?
GOOD Q p Q-46.  Define namespace in C  .
GOOD Q p Q-47. When is void() return type used?
BAD Q figure Q-48. What is the difference between shallow copy and deep copy?
GOOD Q p Q-49. Can we call a virtual function from a constructor?
GOOD Q p Q-50. What are void pointers?
GOOD Q p Q-1. What is ‘this‘ pointer in C  ?
BAD Q button Improve your Coding Skills with Practice

UPDATE: If you want to get all elements from one question to the next, you can do:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.geeksforgeeks.org/cpp-interview-questions/')
soup = bs(r.text, 'html.parser')
questions =soup.select('h3')
for q in questions:
    
    for i, s in enumerate(q.find_all_next()):
        if i == 0:
            print('   QUESTION', s.get_text(strip=True))
        else:
            if s.name != 'h3':
                if len(s.text) > 5 and s.text != q.find_all_next()[i-1].text:
                    print('----ANSWER', s.get_text(strip=True))
            else:
                break

And the result in terminal would be:

   QUESTION Q-1. What is C  ? What are the advantages of C  ?
----ANSWER C   is an object-oriented programming language that was introduced to overcome the jurisdictions where C was lacking. By object-oriented we mean that it works with the concept ofpolymorphism,inheritance,abstraction,encapsulation,object, and class.
----ANSWER polymorphism
----ANSWER inheritance
----ANSWER abstraction
----ANSWER encapsulation
----ANSWER object, and class
----ANSWER Advantages of C  :
----ANSWER Advantages of C  
----ANSWER C   is an OOPs language that means the data is considered as objects.C   is a multi-paradigm language; In simple terms, it means that we can program the logic, structure, and procedure of the program.Memory management is a key feature in C   as it enables dynamic memory allocationIt is a Mid-Level programming language which means it can develop games, desktop applications, drivers, and kernels
----ANSWER C   is an OOPs language that means the data is considered as objects.
----ANSWER C   is a multi-paradigm language; In simple terms, it means that we can program the logic, structure, and procedure of the program.
----ANSWER Memory management is a key feature in C   as it enables dynamic memory allocation
----ANSWER It is a Mid-Level programming language which means it can develop games, desktop applications, drivers, and kernels
----ANSWER To read more, refer to the article –What are the advantages of C  ?
----ANSWER What are the advantages of C  ?
   QUESTION Q- 2. What are the different data types present in C  ?
----ANSWER Different types of data types in C  
----ANSWER Different types of data types in C  
----ANSWER For more information, refer toC   data types
----ANSWER C   data types
   QUESTION Q-3. Define ‘std’?
----ANSWER ‘std’is also known as Standard or it can be interpreted [...]

Please review and try to understand BeautifulSoup foundational logic. Docs can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html

  • Related