Home > other >  Anyway to optimize a large (127K) reading english words txt file
Anyway to optimize a large (127K) reading english words txt file

Time:12-20

This is my function:

public void addToList() throws IOException {
    String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
    URL url = new URL(urlString);
    Scanner scannerWords = new Scanner(url.openStream());
    while (scannerWords.hasNextLine()) {
        words.add(scannerWords.nextLine());
    }
}

Which takes: 32.8 sec runtime to get executed.

Anyway I can optimize it (maybe read every 10 lines)?

CodePudding user response:

Here's my attempt. Instead of using the Scanner, I read character by character. This reduce the overhead and the layers of using Scanner.

        String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
        InputStream stream = new URL(urlString).openStream();
        
        
        BufferedInputStream bufferedStream = new BufferedInputStream(stream);
        ArrayList<String> words = new ArrayList<>();
        char[] chars = new char[100];
        int index = 0;
        
        
        long currentTimeMillis = System.currentTimeMillis();
        while(true){
            int c = bufferedStream.read();
            if (c == '\n'){
                words.add(new String(chars, 0, index));
                index=0;
            } else if (c < 0){
                words.add(new String(chars, 0, index));
                break;
            } else {
                chars[index  ]  = (char) c;
            }
        }
        long currentTimeMillis1 = System.currentTimeMillis();
        
        stream.close();
        
        System.out.println("Time       = "   (currentTimeMillis1-currentTimeMillis)   " ms");
        System.out.println("Word count = "   words.size());
        System.out.println( "First word = "     words.get(0));
        System.out.println( "Last word  = "   words.get(words.size()-1));

    }

Output

run:
Time       = 707 ms
Word count = 127142
First word = aa
Last word  = zyzzyvas
BUILD SUCCESSFUL (total time: 0 seconds)

CodePudding user response:

Well, the obvious thing is to just download the word list once and use the local copy instead of fetching it over the net every time you run your program.

You have a leak because you're never closing the stream returned by URL.openStream() (Same issue would exist with your current code if you changed it to use a file). That's easy to fix by adding a scannerWords.close(); after the loop, but a better, exception-safe way is to use try-with-resources.

I'd dispense with the Scanner entirely and just use a BufferedReader, though. Something like:

import java.net.URL;
import java.util.*;
import java.util.stream.*;
import java.io.*;

// ...


private List<String> readLinesFromURL(String url) throws IOException {
    try (BufferedReader br
         = new BufferedReader(new InputStreamReader(new URL(url).openStream()))) {         
        return br
            .lines()
            .collect(Collectors.toCollection(ArrayList<String>::new));
    }
}

CodePudding user response:

  1. Fetch all data together
  2. Apply your filter to get the expected words

  public static void main(String[] args) throws IOException {
       printWords(new ArrayList<>(150000));
    }

  private static void printWords(List<String> list) throws IOException {
        final long l = System.currentTimeMillis();
        String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
        URL url = new URL(urlString);
        final long l2;
        final long l3;
        Charset encoding=Charset.defaultCharset();
        try (Scanner scanner = new Scanner(url.openStream(), String.valueOf(encoding))) {
            l2 = System.currentTimeMillis();
            String content = scanner.useDelimiter("\\A").next();
            list = Arrays.asList(content.split("\\n"));
            l3 = System.currentTimeMillis();
            //System.out.println(list);
        }
        final long l4 = System.currentTimeMillis();
        System.out.println(String.format("Total Time: %d",l4-l));
        System.out.println(String.format("Data fetching Time: %d",l2-l));
        System.out.println(String.format("Data collection Time: %d",l3-l2));
    }

Output:

Total Time: 2482
Data fetching Time: 465
Data collection Time: 2017
  • Related