I want to filtrate website content that I have stored in a String
with StringUtils
.
Got some problems with the libraries.
Java-code:
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import org.apache.commons.lang3.*;
public class URLConnectionReader {
public static void main(String[] args) {
String siteContent = getUrlContents("https://www.tradegate.de/indizes.php?buchstabe=A");
System.out.println(siteContent);
inputHandler(siteContent);
}
public static void inputHandler(String input) {
String str = StringUtils.substringBetween(input, "<a id=", "</a>");
System.out.println(str);
}
private static String getUrlContents(String theUrl)
{
StringBuilder content = new StringBuilder();
try
{
URL url = new URL(theUrl);
URLConnection urlConnection = url.openConnection();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
String line;
while ((line = bufferedReader.readLine()) != null)
{
content.append(line "\n");
}
bufferedReader.close();
}
catch(Exception e)
{
e.printStackTrace();
}
return content.toString();
}
}
The following steps were performed:
- Downloading commons-lang3-3.12.0-bin.zip
- Unpacking and saving the JAR-files to the eclipse directory
- Add the external libraries to the JAVA build path and apply changes
- Deleting and reassigning, restarting ECLIPSE
- autobuild function = on
Although it is referenced as an external library, this exception happens:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/lang3/StringUtils
at URLConnectionReader.inputHandler(URLConnectionReader.java:21)
at URLConnectionReader.main(URLConnectionReader.java:16)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang3.StringUtils
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 2 more
I searched some threads for troubleshooting, but i don´t get a clue at which point i made the mistake.
CodePudding user response:
When parsing HTML, you should use a HTML-parser instead of trying manualy to manipulate using string methods or regex. Among many, Jsoup is one of the best known and in my opinion the most intuitive and easiest parser you can use when working with HTML using Java. Look at this examples to get familiar with the selector syntax or/and read the documentation of the Selector API
Get the jar or dependency from Maven central jsoup 1.15.3
Using Jsoup and assuming you are interessted in the content of the table body of that page from your question, something like below should give you a starting point:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public final class Example2 {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("https://www.tradegate.de/indizes.php?buchstabe=A").get();
Elements tableRows = doc.getElementById("kursliste_abc").select("tr");
tableRows.forEach(tr -> {
String gattung = tr.child(0).text();
String bid = tr.child(1).text();
String ask = tr.child(2).text();
String stueck = tr.child(3).text();
String ausgOrders = tr.child(4).text();
String change = tr.child(5).text();
String link = tr.child(0).selectFirst("a").absUrl("href");
System.out.printf("%-45s %-10s %-10s %-10s %-10s %-10s %-70s%n",
gattung, bid, ask, stueck, ausgOrders, change, link);
});
}
}
Output:
A-Cap Energy Ltd. 0,07 0,09 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=AU000000ACB7
A-Mark Precious Metals Inc. 28,80 29,08 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=US00181T1079
A.P.Moeller-Mærsk A/S 2 116,00 2 138,00 128 55 0,09% https://www.tradegate.de/orderbuch.php?isin=DK0010244425
A.P.Moeller-Mærsk A/S B 2 200,00 2 214,00 911 165 1,37% https://www.tradegate.de/orderbuch.php?isin=DK0010244508
A.S. Création Tapeten AG 12,70 13,30 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=DE000A1TNNN5
A.S. Roma S.p.A. 0,4465 0,455 1 1 1,68% https://www.tradegate.de/orderbuch.php?isin=IT0001008876
A10 Networks Inc. 13,40 13,745 160 2 1,03% https://www.tradegate.de/orderbuch.php?isin=US0021211018
a2 Milk Co. Ltd., The 3,7035 3,7665 822 2 0,09% https://www.tradegate.de/orderbuch.php?isin=NZATME0002S8
A2A S.p.A. 1,1205 1,1315 1 000 1 2,21% https://www.tradegate.de/orderbuch.php?isin=IT0001233417
A2B Australia Ltd. 0,79 0,825 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=AU0000032187
AAC Technologies Holdings Inc. 1,799 1,8785 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=KYG2953R1149
Aadi Biosciences Inc. 12,245 12,78 33 1 -1,98% https://www.tradegate.de/orderbuch.php?isin=US00032Q1040
AAK AB 14,88 15,02 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=SE0011337708
Aalberts N.V. 36,65 37,03 0 0 0,00% https://www.tradegate.de/orderbuch.php?isin=NL0000852564
....