I have an URL to a webpage that displays some images. I want to download the main image from that page.
If I right-click the image in the browser and I say "Open image in new tab" I get the image (jpg). However, if I try to download the image using the code below, I get an HTML file.
I tried to set the "Referer" parameter to the page's URL, and also to the URL of the image. I still get an HTML instead of an image.
function DownloadFile(CONST Url, Referer: String; OUT Data: TBytes; PostData: String= ''; SSL: Boolean = FALSE): Boolean; { TESTED OK }
VAR
Buffer : array[0..High(Word)*4] of Byte; { Buffer of 260KB }
TempBytes : TBytes;
sMethod : string;
BytesRead : Cardinal;
pSession : HINTERNET;
pConnection: HINTERNET;
pRequest : HINTERNET;
Resource : string;
Root : string;
port : Integer;
flags : DWord;
Header : string;
begin
Result := FALSE;
SetLength(Data, 0);
pSession := InternetOpen(nil {USER_AGENT}, INTERNET_OPEN_TYPE_PRECONFIG, nil, nil, 0);
if Assigned(pSession) then
TRY
{ Autodetect port }
port:= UrlExtractPort(URL);
if port = 0 then
if SSL
then Port := INTERNET_DEFAULT_HTTPS_PORT
else Port := INTERNET_DEFAULT_HTTP_PORT;
{ Root }
Root:= UrlExtractDomainRelaxed(Url);
pConnection := InternetConnect(pSession, PWideChar(Root), port, nil, nil, INTERNET_SERVICE_HTTP, 0, 0); { The second parameter of InternetConnect should contain only the name of the server, not the entire URL of the server-side script. }
if Assigned(pConnection) then
TRY
if (PostData = '')
then sMethod := 'GET'
else sMethod := 'POST';
if SSL
then flags := INTERNET_FLAG_SECURE OR INTERNET_FLAG_KEEP_CONNECTION
else flags := INTERNET_SERVICE_HTTP OR INTERNET_FLAG_RELOAD; // INTERNET_FLAG_RELOAD= Forces a download of the requested file, object, or directory listing from the origin server, not from the cache.;
Resource := UrlExtractResourceParams(Url);
pRequest := HTTPOpenRequest(pConnection, PWideChar(sMethod), PWideChar(Resource), nil, nil, nil, flags, 0); { The third parameter of HttpOpenRequest is the file name (URL) of the script }
if Assigned(pRequest) then
TRY
Header:= '';
if Referer > ''
then Header:= Header 'Referer: ' Referer sLineBreak;
Header:= Header 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0' SLineBreak;
//Header:= Header 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59' SLineBreak; // Microsoft Edge UA string
Header:= Header 'Accept: text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8' SLineBreak;
Header:= Header 'Accept-Language: en-us,en;q=0.5' SLineBreak;
Header:= Header 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' SLineBreak;
Header:= Header 'Keep-Alive: 70' SLineBreak; { In windows, default is 60 sec }
Header:= Header 'Connection: keep-alive' SlineBreak SLineBreak;
HttpAddRequestHeaders(pRequest, PWideChar(Header), Length(Header), HTTP_ADDREQ_FLAG_ADD);
Result:= HTTPSendRequest(pRequest, NIL, 0, Pointer(PostData), Length(PostData)); { The actual POST data is the forth parameter }
if Result then
REPEAT
ZeroMemory(@Buffer, SizeOf(Buffer));
{ Download bytes }
InternetReadFile(pRequest, @Buffer, SizeOf(Buffer), BytesRead);
{ We stop? }
if BytesRead= 0 then break;
{ Convert static array to dynamic array }
SetLength(TempBytes, BytesRead);
Move(Buffer[0], TempBytes[0], BytesRead);
{ Merge arrays }
Data:= Data TempBytes;
UNTIL BytesRead= 0;
FINALLY
InternetCloseHandle(pRequest);
END
else
RaiseLastOSError;
finally
InternetCloseHandle(pConnection);
end;
finally
InternetCloseHandle(pSession);
end;
end;
The URL is 'https://art42.tumblr.com/random'.
I will try Indy tomorrow. Hopefully the "Referer" is working there....
CodePudding user response:
It works as intended:
You request
GET https://art42.tumblr.com/random
which is answered withHTTP 302 Found
.That response implies there must be a header
Location
in the answer which points to a new URL we should query. In my case the whole header is:Location: https://art42.tumblr.com/post/158825454869#_=_
That URL is a new
GET
request which finally is answered withHTTP 200 OK
.This response mostly means we have some payload, and a header
Content-Type
should help us treating it the correct way. In my case the whole header is:Content-Type: text/html; charset=UTF-8
Which means it is a text document with HTML content, encoded in UTF-8. That's nice - needing to parse a PDF or EXE would be less trivial.
That's it: everything worked as expected. It is still a website - you even see that by all the text around. Just because one picture is embedded it doesn't make the whole payload a picture, too.
If you cannot tell apart a website from a picture then you'll have a long way ahead of learning. A web browser can display various media: parsed HTML as a rendered website, videos and pictures of varying formats, text files, nowadays even PDFs... The picture's URL would be https://64.media.tumblr.com/0b37315236ee5da6cb4d191ea6a14ccb/tumblr_on2uab8Anw1vb29w2o1_500.jpg
, which can be found inside the HTML, of course. If you can right-click on your picture to save it you can also display it in a new tab - and that displays differently than the website you're currently looking at where it is just embedded.
Yes: you should parse the whole HTML, watching out for all encounterings of <img src="
and then check if this is picture you want. Luckily it's even easier: just search for <meta name="twitter:image" content="
and then copy everything until " />
to get your actual picture URL.
CodePudding user response:
In your Accept header you are not specifying any graphic formats.
A properly configured web server should send you a representation of the resource in one of the formats you accept.
Try looking at the request headers sent from your browser (use the Developer Tools - see your browser's help for how to access them). It will be accepting graphic formats as well as text.