Reading java class file(?) from url as plain text in ubuntu terminal-CodePudding

I am not sure if the title makes sense. I am not very experienced in this kind of stuff.

This is the situation:

I am running a linux server with Ubuntu 20.04

I run a program through .sh scripts for scraping web pages from different URLs.

One url gives back a file which starts like this:

{"javaClass":"java.util.ArrayList","list":[

I am not sure if this is a Java Class file or a JSON file, becasue the url has no extension.

I can call this url in my browser and it will be displayed as text. (That's why I can see the above code.)

If I call it with curl in Ubuntu terminal, it does nothing.

How can I display and read this as plain text in Ubuntu Terminal for processing the data, like I do in my Chrome browser?

EDIT: The url in question is this: https://www.yes.co.il/o/yes/servletlinearsched/getscheduale?startdate=20211025&p_auth=w3wmBNc5

EDIT2: The token at the end is every time different. I read the token right, so that is not the issue.

I found out that the site makes requests for x-dtpc cookies. I found this about x-dtpc cookies:

this header is set by the JavaScript agent on XHRs and is used for correlating XHR requests to user actions

When I open the page and get the url from the developer console, I can open the url in a new tab. If I get the url through my script, I can not open the url in the browser.

I load standard cookies, but apparently that's not enough.

CodePudding user response：

That is the output of a tool that 'serializes' (turns objects that live in memory into a byte-based representation that can be transported over a network or stored on disk), specifically, that serialized an ArrayList into a JSON format.

The best way to read this with java code is to figure out which tool was used, and use the same tool. It's not baked into java itself; it's some third party library such as Jackson.

All such tools I know of have a 'hardcoded' special exception for anything that extends java.util.List (such as ArrayList), to just treat it as a plain JSON list. So, I'm a bit mystified as to what tool has been used here.

But, either [A] find the tool, or [B] reverse engineer the output.

NB: Ordinarily such serializer tools will represent in the JSON both the class name as well as each field. However, the relevant field in ArrayList is called elementData and not list. That is another further raised eyebrow: This is just bizarre.

CodePudding user response：

In the chrome dev tools you can right click on the specific request and copy the corresponding cURL. As I opened the page the following command was created:

curl 'https://www.yes.co.il/o/yes/servletlinearsched/getscheduale' \
  -H 'authority: www.yes.co.il' \
  -H 'sec-ch-ua: "Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"' \
  -H 'dnt: 1' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36' \
  -H 'x-dtpc: 1$418482539_291h9vJHPPKPVFOURUFMBUBCNHWPOEEHUEJATA-0e5' \
  -H 'accept: text/plain, */*; q=0.01' \
  -H 'x-requested-with: XMLHttpRequest' \
  -H 'content-type: application/x-www-form-urlencoded' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'origin: https://www.yes.co.il' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://www.yes.co.il/content/tvguide' \
  -H 'accept-language: en-US,en;q=0.9,de;q=0.8' \
  -H 'cookie: TS01be6705=01ef05715da91cbd2bd3d3708b62fc37483302654019cffd2d42c61fe786389bea0697bdddc6cfaa989d8e83d7ecaf362d163c3ba4d4cc2cb2171d826bc2189e24b33b372f; COOKIE_SUPPORT=true; GUEST_LANGUAGE_ID=iw_IL; rxVisitor=1635415629464DQEQ6RQ2S8SQQ1QCFS5J66I2US8LK1NG; _gcl_au=1.1.2116753794.1635415632; _gid=GA1.3.1516801157.1635415633; dtSa=-; _ga=GA1.3.1003301072.1635415633; LFR_SESSION_STATE_33706=1635417685179; JSESSIONID=B399B241579DB87AF9FEC02AD72D62CF.worker_ip-10-0-3-108.eu-west-1.compute.internal; dtCookie==3=srv=1=sn=BF10E6A2EFEE80BE2CAD31F302CAF608=perc=100000=ol=0=mul=1=app:e6d1c681b48e20c9=0; _ga_H6Z9EGVSQX=GS1.1.1635418482.2.0.1635418482.0; dtLatC=14; AWSALB=1Ky1qIweYH/VBDu2pUv/DACVcWq5dmx3PPhlghfLR0g4oNTdMV78d7G08LreVX0l2Lvm0wdW5oRh 3j THyZKDQmVldChB6XScu8 BVkqbSymgNrvMm4dOdT6TNL; AWSALBCORS=1Ky1qIweYH/VBDu2pUv/DACVcWq5dmx3PPhlghfLR0g4oNTdMV78d7G08LreVX0l2Lvm0wdW5oRh 3j THyZKDQmVldChB6XScu8 BVkqbSymgNrvMm4dOdT6TNL; TS01542e32=01ef05715d1819fe7529d33b6e77731cfbb0015d8964ad5bb6c2e651cbb0a0a4b1882c8fa451246095770e14517339cd0a68861c971608fb5c8f0731d6c968c519acbcd35ea68d53e6fc4ceb067d99238c1dfce91a494ac4abf599a7d66a54b92c363e5c1b004d2d59e86f6198819dd2f1b43c0191; dtPC=1$418482539_291h9vJHPPKPVFOURUFMBUBCNHWPOEEHUEJATA-0e5; rxvt=1635420284821|1635415629471' \
  --data-raw 'startdate=20211028&p_auth=c43Cdm7P' \
  --compressed

Most of the headers don't seem to be too relevant thus I was able to boil it down to

curl --location --request POST 'https://www.yes.co.il/o/yes/servletlinearsched/getscheduale' \
--header 'content-type: application/x-www-form-urlencoded' \
--header 'cookie:  JSESSIONID=B399B241579DB87AF9FEC02AD72D62CF.worker_ip-10-0-3-108.eu-west-1.compute.internal;' \
--data-urlencode 'startdate=20211028' \
--data-urlencode 'p_auth=c43Cdm7P'

There is however the JSESSIONID cookie this is set by the initial request to the page and potentially updated by further requests. That's why a request only works for a limited time without using a new session id. Your scrapper will have to extract a session id from previous requests and use this one.