for project in projects:
soup.findAll('a')
name = project.text
print(name)
I am trying to extract just the package names from https://libraries.io/search?order=desc&page=1&platforms=Maven&sort=rank. However, when I run the code above I get an unecessary amount of information that looks something like this:
junit:junit
JUnit is a unit testing framework for Java, created by Erich Gamma and Kent Beck.
Latest release 4.13.2 -
Updated
Feb 13, 2021
- 8.34K stars
The only Output that I want is the "junit:junit". Any tips on how to achieve this? I have to do this for over 490,000 packages.
CodePudding user response:
Try this for the first page and then repeat:
import requests
from bs4 import BeautifulSoup
url = "https://libraries.io/search?order=desc&page=1&platforms=Maven&sort=rank"
soup = [
a.getText() for a
in BeautifulSoup(requests.get(url).text, "lxml").select("div.project > h5 > a")
]
print("\n".join(soup))
Output:
junit:junit
org.springframework:spring-context
org.springframework:spring-test
org.scala-lang:scala-library
org.springframework:spring-core
com.google.guava:guava
org.jetbrains.kotlin:kotlin-stdlib-jdk8
org.jetbrains.kotlin:kotlin-stdlib
com.h2database:h2
org.projectlombok:lombok
com.google.code.gson:gson
org.mockito:mockito-core
org.scala-lang:scala-reflect
org.springframework.boot:spring-boot-starter-test
org.springframework.boot:spring-boot-starter-web
org.springframework:spring-orm
org.springframework:spring-beans
org.springframework:spring-jdbc
org.springframework.boot:spring-boot-starter-actuator
org.springframework.boot:spring-boot-devtools
org.springframework:spring-web
org.junit.jupiter:junit-jupiter-engine
org.springframework:spring-tx
org.springframework:spring-aop
org.springframework:spring-context-support
com.fasterxml.jackson.core:jackson-databind
org.junit.jupiter:junit-jupiter-api
org.springframework:spring-webmvc
org.slf4j:slf4j-api
org.testng:testn