Home > other >  How can I get rid of urls that contain special characters like, "#", and text chunks like,
How can I get rid of urls that contain special characters like, "#", and text chunks like,

Time:10-19

How can I get rid of urls that contain special characters like, "#", and small string like, ".pdf" from a string array of urls?

I am making a web spider. The goal is to be able to generate an entire list of urls of a website, minus the stuff I don't want. The spider goes to the home page, grabs the urls, then visits each url, then visits each url on those urls that it hasn't previously visited.

I am attempting to purify the data it gives me by getting rid of urls that have, ".zip", ".pdf", a " ", or a "#" in them from my string array.

I attempted to do it with a version of the following if statement, this is a simplified version of it that assumes there is a string array full of urls and an int called TotalNumberOfUrls with the total number of urls.

String j = "";

while( x != TotalNumberOfUrls)
{

j = ValueOfGiantStringArray[x];
    if(!(j.contains("#")) || !(j.contains(" ")) || !(j.contains(".pdf")) || !(j.contains(".zip")))
    {
//Runs a scraping module on the url contained in the string j.
    }

x  ;
}

This didn't work for me. For some reason, my scraper is running within this if statements even when j has a value of, "https://procomps.com/cherry-services/cadalog/#content" for example. It doesn't detect the # for some reason.

What is the best way to weed out urls with these unwanted characters and text chunks from my string array of urls?

CodePudding user response:

Change the ||s to &&s in the condition of your if statement. This way, the code inside the if block will only run if none of the characters you don't want are in the url.

CodePudding user response:

Your if condition is wrong. Generally if you want to express Something doesn't contain a or b or c is something like !( contains(a) || contains(b) || contains(c) ) or according to De Morgan's laws the following equivalent expression !contains(a) && !contains(b) && !contains(c)

Assuming you have an array which somehow looks similar to the following and you are using a while loop something like below should work:

String[] myURLs = { "someURL.com/somePath/file.zip",
                    "someURL.com/somePath/blabla#blup",
                    "someURL.com/somePath/another.pdf",
                    "someURL.com/somePath/some-content",
                    "http://myapp/mypage/mycontent",
                    "http://myapp/my test/jjj"};
String j = "";

int x = 0;
while( x != myURLs.length) {
    j = myURLs[x];
    if( !(j.contains("#") || j.contains(" ") || j.contains(".pdf") || j.contains(".zip")) ) {
        //Runs a scraping module on the url contained in the string j.
        System.out.println("Scraping content from URL: "   j);
    }
    x  ;
}

IMO a for loop would make your code more readable than a while loop. Using the same array as above:

for (int i = 0; i < myURLs.length; i  ) {
    String temp = myURLs[i];
    if( !(temp.contains("#") || temp.contains(" ") || temp.contains(".pdf") || temp.contains(".zip")) ) {
        //Runs a scraping module on the url contained in the string temp.
        System.out.println("Scraping content from URL: "   temp);
    }
}

If you want to play around with Java-8 features

Set<String> exclude = Set.of("#", " ", ".pdf", ".zip");
Arrays.stream(myURLs)
        .filter(url -> !exclude.stream().anyMatch(url::contains))
        .forEach(url -> {
            //Runs a scraping module on the url contained in the string url.
            System.out.println("Scraping content from URL: "   url);
        });

or regular expressions

Pattern pattern = Pattern.compile("#|\\ |\\.pdf|\\.zip");
for (String url : myURLs) {
    if(!pattern.matcher(url).find()){
        //Runs a scraping module on the url contained in the string url.
        System.out.println("Scraping content from URL: "   url);
    }
}
  • Related