Home > Enterprise >  Why does wget only retrieve 13KB HTML file instead of .dat file I want to download?
Why does wget only retrieve 13KB HTML file instead of .dat file I want to download?

Time:07-07

I am trying to download JRA-55 reanalysis data in the form of .dat files from the Data Integration and Analysis System using wget in Linux.

I used the following:

wget --user=$USER --password=$PASSWD https://data.diasjp.net/dl/storages/filelist/dataset:204/jra55/Hist/Daily/fcst_phy2m/201712/fcst_phy2m.2017122900.dat -P $outdir

However, instead of getting the actual .dat file, I receive a 13 KB HTML file. When I open the HTML file it is the login webpage to the Data Integration & Analysis download site. I am unsure how to fix this issue. Any help is appreciated, thank you!

CodePudding user response:

If you specify --user and --password with wget, it tries to do a HTTP Basic Authentication.

But the website you try to download from, doesn't allow a Basic Authentication. Instead, when you login, your credential will be exchanged in some form of a token: A session cookie or JWT Header.

Try to log in/download with your web browser and enable the web console. You will see the traffic details, headers, cookies, etc. You might even right-click on an entry and say "Copy as curl", such that the exact curl command with all cookies and headers is directly available.

From there, you might need to find a way to permanently login/download with curl or wget. Maybe a scraper would be more helpful here.

CodePudding user response:

You get redirected to an auth page https://auth.diasjp.net/cas/login

And it either sets a cookie or returns an auth token that is used for the download call.

In short: your username/pass won't work in wget.

You can press F12 in Chrome to get the Dev Tools; go to the network tab and take a look at the calls when you try to download the file in in the browser. Be prepared for some digging. Try to reproduce the download in postman.

A cookie may be used with wget or curl and an auth token can usually be seen as Bearer token in the Authorization Header of the HTTP request. However both may only work for some time (hours to weeks).

  • Related