I am not sure if the title makes sense. I am not very experienced in this kind of stuff.
This is the situation:
I am running a linux server with Ubuntu 20.04
I run a program through .sh scripts for scraping web pages from different URLs.
One url gives back a file which starts like this:
{"javaClass":"java.util.ArrayList","list":[
I am not sure if this is a Java Class file or a JSON file, becasue the url has no extension.
I can call this url in my browser and it will be displayed as text. (That's why I can see the above code.)
If I call it with curl in Ubuntu terminal, it does nothing.
How can I display and read this as plain text in Ubuntu Terminal for processing the data, like I do in my Chrome browser?
EDIT: The url in question is this: https://www.yes.co.il/o/yes/servletlinearsched/getscheduale?startdate=20211025&p_auth=w3wmBNc5
EDIT2: The token at the end is every time different. I read the token right, so that is not the issue.
I found out that the site makes requests for x-dtpc cookies. I found this about x-dtpc cookies:
this header is set by the JavaScript agent on XHRs and is used for correlating XHR requests to user actions
When I open the page and get the url from the developer console, I can open the url in a new tab. If I get the url through my script, I can not open the url in the browser.
I load standard cookies, but apparently that's not enough.
CodePudding user response:
That is the output of a tool that 'serializes' (turns objects that live in memory into a byte-based representation that can be transported over a network or stored on disk), specifically, that serialized an ArrayList into a JSON format.
The best way to read this with java code is to figure out which tool was used, and use the same tool. It's not baked into java itself; it's some third party library such as Jackson.
All such tools I know of have a 'hardcoded' special exception for anything that extends java.util.List
(such as ArrayList), to just treat it as a plain JSON list. So, I'm a bit mystified as to what tool has been used here.
But, either [A] find the tool, or [B] reverse engineer the output.
NB: Ordinarily such serializer tools will represent in the JSON both the class name as well as each field. However, the relevant field in ArrayList is called elementData
and not list
. That is another further raised eyebrow: This is just bizarre.
CodePudding user response:
In the chrome dev tools you can right click on the specific request and copy the corresponding cURL. As I opened the page the following command was created:
curl 'https://www.yes.co.il/o/yes/servletlinearsched/getscheduale' \
-H 'authority: www.yes.co.il' \
-H 'sec-ch-ua: "Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"' \
-H 'dnt: 1' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36' \
-H 'x-dtpc: 1$418482539_291h9vJHPPKPVFOURUFMBUBCNHWPOEEHUEJATA-0e5' \
-H 'accept: text/plain, */*; q=0.01' \
-H 'x-requested-with: XMLHttpRequest' \
-H 'content-type: application/x-www-form-urlencoded' \
-H 'sec-ch-ua-platform: "macOS"' \
-H 'origin: https://www.yes.co.il' \
-H 'sec-fetch-site: same-origin' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://www.yes.co.il/content/tvguide' \
-H 'accept-language: en-US,en;q=0.9,de;q=0.8' \
-H 'cookie: TS01be6705=01ef05715da91cbd2bd3d3708b62fc37483302654019cffd2d42c61fe786389bea0697bdddc6cfaa989d8e83d7ecaf362d163c3ba4d4cc2cb2171d826bc2189e24b33b372f; COOKIE_SUPPORT=true; GUEST_LANGUAGE_ID=iw_IL; rxVisitor=1635415629464DQEQ6RQ2S8SQQ1QCFS5J66I2US8LK1NG; _gcl_au=1.1.2116753794.1635415632; _gid=GA1.3.1516801157.1635415633; dtSa=-; _ga=GA1.3.1003301072.1635415633; LFR_SESSION_STATE_33706=1635417685179; JSESSIONID=B399B241579DB87AF9FEC02AD72D62CF.worker_ip-10-0-3-108.eu-west-1.compute.internal; dtCookie==3=srv=1=sn=BF10E6A2EFEE80BE2CAD31F302CAF608=perc=100000=ol=0=mul=1=app:e6d1c681b48e20c9=0; _ga_H6Z9EGVSQX=GS1.1.1635418482.2.0.1635418482.0; dtLatC=14; AWSALB=1Ky1qIweYH/VBDu2pUv/DACVcWq5dmx3PPhlghfLR0g4oNTdMV78d7G08LreVX0l2Lvm0wdW5oRh 3j THyZKDQmVldChB6XScu8 BVkqbSymgNrvMm4dOdT6TNL; AWSALBCORS=1Ky1qIweYH/VBDu2pUv/DACVcWq5dmx3PPhlghfLR0g4oNTdMV78d7G08LreVX0l2Lvm0wdW5oRh 3j THyZKDQmVldChB6XScu8 BVkqbSymgNrvMm4dOdT6TNL; TS01542e32=01ef05715d1819fe7529d33b6e77731cfbb0015d8964ad5bb6c2e651cbb0a0a4b1882c8fa451246095770e14517339cd0a68861c971608fb5c8f0731d6c968c519acbcd35ea68d53e6fc4ceb067d99238c1dfce91a494ac4abf599a7d66a54b92c363e5c1b004d2d59e86f6198819dd2f1b43c0191; dtPC=1$418482539_291h9vJHPPKPVFOURUFMBUBCNHWPOEEHUEJATA-0e5; rxvt=1635420284821|1635415629471' \
--data-raw 'startdate=20211028&p_auth=c43Cdm7P' \
--compressed
Most of the headers don't seem to be too relevant thus I was able to boil it down to
curl --location --request POST 'https://www.yes.co.il/o/yes/servletlinearsched/getscheduale' \
--header 'content-type: application/x-www-form-urlencoded' \
--header 'cookie: JSESSIONID=B399B241579DB87AF9FEC02AD72D62CF.worker_ip-10-0-3-108.eu-west-1.compute.internal;' \
--data-urlencode 'startdate=20211028' \
--data-urlencode 'p_auth=c43Cdm7P'
There is however the JSESSIONID
cookie this is set by the initial request to the page and potentially updated by further requests. That's why a request only works for a limited time without using a new session id. Your scrapper will have to extract a session id from previous requests and use this one.