How to scrape just four numeric values from a HTML web page's table on Java for Android?-CodePudding

Here's my current code:

 private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
String userAgent1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";
try {
    Document doc1 = Jsoup.connect(url).userAgent(userAgent1).get();
    Elements divTags = doc1.getElementsByTag("div");
    String re = "^<div class=\\\"Ta\\(c\\) Py\\(6px\\) Bxz\\(bb\\) BdB Bdc\\(\\$seperatorColor\\) Miw\\(120px\\) Miw\\(100px\\)\\-\\-pnclg D\\(tbc\\)\\\" data-test=\\\"fin-col\\\"><span>.*</span></div>$";
    
    for (Element div : divTags) {
        Pattern pattern = Pattern.compile(re, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(div.html());

        if (matcher.find()) {
            String data = matcher.group(1);
            Log.d("Matched: ", data);
        }
        else {
            Log.d("Nothing Matched: ", "");
        }
    }
} catch (Exception e) {
    Log.e("err-new", "err", e);
}
return "";

}

This function takes a URL as input, in our case: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2 and extracts all the div tags using JSOUP.

And then, I need to extract these values using Pattern matching. But, in my code above, all I get is that "Nothing matched: ".

Here's the web page from which I am interested in getting the four numeric values corresponding to the first four yearly columns, corresponding to the row named EBIT. (Stands for Earnings Before Interest and Taxes)

Link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2

Input: Looking to get values 122,034,000, 111,852,000, 69,964,000, 69,313,000 on the EBIT row for columns 9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019.

On Inspect, these values are under the following <div> tags.

EBIT 1: <div data-test="fin-col"><span>122,034,000</span></div>

EBIT 2: <div data-test="fin-col"><span>111,852,000</span></div>

EBIT 3: <div data-test="fin-col"><span>69,964,000</span></div>

EBIT 4: <div data-test="fin-col"><span>69,313,000</span></div>

And the same thing for the 4 columns under the Quarterly tab on the same web page. Looking to get values 25,484,000, 23,785,000, 30,830,000, 41,935,000 on the EBIT row for columns 9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021.

EBIT 1: <div data-test="fin-col"><span>25,484,000</span></div>

EBIT 2: <div data-test="fin-col"><span>23,785,000</span></div>

EBIT 3: <div data-test="fin-col"><span>30,830,000</span></div>

EBIT 4: <div data-test="fin-col"><span>41,935,000</span></div>

Output: dates = {9/30/2022, 9/30/2021, 9/30/2020, 9/30/2019}

datesQ = {9/30/2022, 6/30/2022, 3/31/2022, 12/31/2021}

EBIT = {122,034,000, 111,852,000, 69,964,000, 69,313,000}

EBITQ = {25,484,000, 23,785,000, 30,830,000, 41,935,000}

Where Q stands for Quarterly.

OR, it could be two hashmaps with yearlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4} quarterlyHash = {date1: value1, date2: value2, date3: value3 and date4: value4}

My existing code is broken. Basically, I've used JSoup to get all the javascript related tags and used a pattern matcher to get the String values I wanted. However, the page I'm parsing now seems to look like some values in that tag are encrypted strings that can't be parsed anymore.

My use case is not that complex as you can imagine. I just need the dates and the 4 values corresponding to that one row. Even if it's a non-standard, non-optimized solution, I am fine with that.

Thank you.

CodePudding user response：

Annoyingly the annual data is on the page as loaded and the quarterly data is loaded with a AJAX call triggered by clicking on the "Quarterly" button. Anyway, the following code will do the job:

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.text.NumberFormat;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.google.gson.Gson;

public class App {
    private static final String PAGE_URL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";
    private static final String DATA_URL = "https://query1.finance.yahoo.com/ws/fundamentals-timeseries/v1/finance/timeseries/AAPL?lang=en-US&region=US&symbol=AAPL&padTimeSeries=true&type=quarterlyEBIT&merge=false&period1=493590046&period2=1674660504&corsDomain=finance.yahoo.com";

    private static final String REGEX_YAHOO_PAGE_EBIT = "^.*ttm</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?EBIT</span></div><div.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*?<span>(.*?)</span>.*$";
    private static final Pattern PATTERN_YAHOO_PAGE_REGEX = Pattern.compile(REGEX_YAHOO_PAGE_EBIT, Pattern.DOTALL);

    private static final Gson GSON = new Gson();

    private static final NumberFormat NUMBER_FORMAT = NumberFormat.getInstance(new Locale("en", "US"));

    public static void main(String[] args) throws IOException {
        String pageContent = fetch(PAGE_URL);
        Matcher m = PATTERN_YAHOO_PAGE_REGEX.matcher(pageContent);
        if (m.matches()) {
            System.out.println("Annual values");

            System.out.println(m.group(1)   ": "   m.group(6));
            System.out.println(m.group(2)   ": "   m.group(7));
            System.out.println(m.group(3)   ": "   m.group(8));
            System.out.println(m.group(4)   ": "   m.group(9));
        }

        // the quarterly data is not on the page. it is rendered dynamically from this
        // AJAX call
        String quarterlyData = fetch(DATA_URL);
        System.out.println("Quarterly values");
        Map map = GSON.fromJson(quarterlyData, Map.class);
        List<Map> result = (List<Map>) ((Map) map.get("timeseries")).get("result");
        for (Map entry : result) {
            Map meta = (Map) entry.get("meta");
            if (((List<String>) meta.get("type")).get(0).equals("quarterlyEBIT")) {
                List<Map<String, Object>> quarterlyEBIT = (List) entry.get("quarterlyEBIT");
                for (Map<String, Object> cell : quarterlyEBIT) {
                    System.out.print(cell.get("asOfDate")   ": ");
                    String fullNumberString = NUMBER_FORMAT
                            .format(((Map<String, Double>) cell.get("reportedValue")).get("raw"));
                    System.out.println(fullNumberString.substring(0, fullNumberString.length() - 4));

                }

            }
        }

    }

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        URL pageUrl = new URL(url);
        HttpURLConnection pageConnection = (HttpURLConnection) pageUrl.openConnection();
        try {
            InputStream inputStream = new BufferedInputStream(pageConnection.getInputStream());
            int bufferSize = 1024;
            char[] buffer = new char[bufferSize];
            StringBuilder out = new StringBuilder();
            Reader in = new InputStreamReader(inputStream, "UTF-8");
            for (int numRead; (numRead = in.read(buffer, 0, buffer.length)) > 0;) {
                out.append(buffer, 0, numRead);
            }
            return out.toString();
        } finally {
            pageConnection.disconnect();
        }
    }
}

Output:

Annual values
9/30/2022: 122,034,000
9/30/2021: 111,852,000
9/30/2020: 69,964,000
9/30/2019: 69,313,000
Quarterly values
2021-12-31: 41,935,000
2022-03-31: 30,830,000
2022-06-30: 23,785,000
2022-09-30: 25,484,000

If you prefer Apache HttpClient (v4 here) then fetch() can be coded as follows:

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

    private static String fetch(String url) throws MalformedURLException, IOException, UnsupportedEncodingException {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        CloseableHttpResponse response = httpclient.execute(httpGet);
        try {
            HttpEntity entity = response.getEntity();
            return EntityUtils.toString(entity);
        } finally {
            response.close();
        }
    }

CodePudding user response：

I guess you can use regular expression to match the div tags

Please change your regular expression to match the span element and extract the text inside it.

ex:

Elements spans = doc1.select("div.Ta(c) span");
for (Element span : spans) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Also you might use Jsoup's elements class & filter method to filter the divs to extract the span elements.

Elements divs = doc1.select("div[class*=Ta\\(c\\)]");
Elements spanElements = divs.filter(element -> element.select("span").size()>0);
for (Element span : spanElements) {
    String data = span.text();
    Log.d("Matched: ", data);
}

Using Css selectors will be also possible.