Home > other >  Try to download image from image URL, but I get HTML instead
Try to download image from image URL, but I get HTML instead

Time:02-21

I have the URL https://art42.tumblr.com/random to a webpage that displays some images. I want to download the main image from that page.

If I right-click the image in Firefox (or any other browser) and I choose "Open image in new tab" I get the image (jpg). However, if I try to download the image using the code below, I get an HTML file.

I guess the problem has to do with the Referer. I tried to set the "Referer" parameter to the page's URL, and also to the image's URL, but I still get HTML instead of JPG.

Why I can download the image in Firefox, but can't in my code?

function DownloadFile(CONST Url, Referer: String; OUT Data: TBytes; PostData: String= ''; SSL: Boolean = FALSE): Boolean;   { TESTED OK }
VAR
  Buffer     : array[0..High(Word)*4] of Byte; { Buffer of 260KB }
  TempBytes  : TBytes;
  sMethod    : string;
  BytesRead  : Cardinal;
  pSession   : HINTERNET;
  pConnection: HINTERNET;
  pRequest   : HINTERNET;
  Resource   : string;
  Root       : string;
  port       : Integer;
  flags      : DWord;
  Header     : string;
begin
  Result := FALSE;
  SetLength(Data, 0);
  pSession := InternetOpen(nil {USER_AGENT}, INTERNET_OPEN_TYPE_PRECONFIG, nil, nil, 0);

  if Assigned(pSession) then
  TRY
    { Autodetect port }
    port:= UrlExtractPort(URL);
    if port = 0 then
      if SSL
      then Port := INTERNET_DEFAULT_HTTPS_PORT
      else Port := INTERNET_DEFAULT_HTTP_PORT;

    { Root }
    Root:= UrlExtractDomainRelaxed(Url);
    pConnection := InternetConnect(pSession, PWideChar(Root), port, nil, nil, INTERNET_SERVICE_HTTP, 0, 0); { The second parameter of InternetConnect should contain only the name of the server, not the entire URL of the server-side script. }

    if Assigned(pConnection) then
    TRY
      if (PostData = '')
      then sMethod := 'GET'
      else sMethod := 'POST';

      if SSL
      then flags := INTERNET_FLAG_SECURE  OR INTERNET_FLAG_KEEP_CONNECTION
      else flags := INTERNET_SERVICE_HTTP OR INTERNET_FLAG_RELOAD; // INTERNET_FLAG_RELOAD= Forces a download of the requested file, object, or directory listing from the origin server, not from the cache.;

      Resource := UrlExtractResourceParams(Url);  
      pRequest := HTTPOpenRequest(pConnection, PWideChar(sMethod), PWideChar(Resource), nil, nil, nil, flags, 0);  { The third parameter of HttpOpenRequest is the file name (URL) of the script }

      if Assigned(pRequest) then
        TRY
           Header:= '';
           if Referer > ''
           then Header:= Header  'Referer: '   Referer   sLineBreak;
           Header:= Header  'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0' SLineBreak;
         //Header:= Header  'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59' SLineBreak;  //  Microsoft Edge UA string
           Header:= Header  'Accept: text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8' SLineBreak;
           Header:= Header  'Accept-Language: en-us,en;q=0.5'   SLineBreak;
           Header:= Header  'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' SLineBreak;
           Header:= Header  'Keep-Alive: 70'  SLineBreak; { In windows, default is 60 sec }
           Header:= Header  'Connection: keep-alive'  SlineBreak SLineBreak;

           HttpAddRequestHeaders(pRequest, PWideChar(Header), Length(Header), HTTP_ADDREQ_FLAG_ADD);

           Result:= HTTPSendRequest(pRequest, NIL, 0, Pointer(PostData), Length(PostData));     { The actual POST data is the forth parameter }
           if Result then
             REPEAT
              ZeroMemory(@Buffer, SizeOf(Buffer));   

              { Download bytes }
              InternetReadFile(pRequest, @Buffer, SizeOf(Buffer), BytesRead);

              { We stop? }
              if BytesRead= 0 then break;

              { Convert static array to dynamic array }
              SetLength(TempBytes, BytesRead);
              Move(Buffer[0], TempBytes[0], BytesRead);

              { Merge arrays }
              Data:= Data  TempBytes;
             UNTIL BytesRead= 0;
        FINALLY
          InternetCloseHandle(pRequest);
        END
      else
        RaiseLastOSError;

    finally
      InternetCloseHandle(pConnection);
    end;
  finally
    InternetCloseHandle(pSession);
  end;
end;

I will try Indy tomorrow. Hopefully the "Referer" is working there...

CodePudding user response:

In your "Accept" header you are not specifying any graphic formats.

A properly configured web server should send you a representation of the resource in one of the formats you accept.

Try looking at the request headers sent from your browser (use the Developer Tools - see your browser's help for how to access them). It will be accepting graphic formats as well as text.

CodePudding user response:

It works as intended:

  1. You request GET https://art42.tumblr.com/random which is answered with HTTP 302 Found.

  2. That response implies there must be a header Location in the answer which points to a new URL we should query. In my case the whole header is:

    Location: https://art42.tumblr.com/post/158825454869#_=_

  3. That URL is a new GET request which finally is answered with HTTP 200 OK.

  4. This response mostly means we have some payload, and a header Content-Type should help us treating it the correct way. In my case the whole header is:

    Content-Type: text/html; charset=UTF-8

    Which means it is a text document with HTML content, encoded in UTF-8. That's nice - needing to parse a PDF or EXE would be less trivial.

  5. That's it: everything worked as expected. It is still a website - you even see that by all the text around. Just because one picture is embedded it doesn't make the whole payload a picture, too.

If you cannot tell apart a website from a picture then you'll have a long way ahead of learning. A web browser can display various media: parsed HTML as a rendered website, videos and pictures of varying formats, text files, nowadays even PDFs... The picture's URL would be https://64.media.tumblr.com/0b37315236ee5da6cb4d191ea6a14ccb/tumblr_on2uab8Anw1vb29w2o1_500.jpg, which can be found inside the HTML, of course. If you can right-click on your picture to save it you can also display it in a new tab - and that displays differently than the website you're currently looking at where it is just embedded.

Yes: you should parse the whole HTML, watching out for all encounterings of <img src=" and then check if this is picture you want. Luckily it's even easier: just search for <meta name="twitter:image" content=" and then copy everything until " /> to get your actual picture URL.


Edit: To (not) reproduce OP's problem this code is optimized to be run as is and also automatically loads the picture. Note that a lot of questionable customizations have been removed, especially headers and a referer and the formatting is more consistent:

uses
  wininet, jpeg;

function DownloadFile(Data: TMemoryStream): Boolean;
var
  Buffer: Array[0.. High(Word)* 4] of Byte;
  Resource, Root, sMethod: AnsiString;
  BytesRead, flags: Cardinal;
  pSession, pConnection, pRequest: HINTERNET;
  port: Word;
begin
  Result:= FALSE;
  Data.Clear;
  pSession:= InternetOpenA(nil, INTERNET_OPEN_TYPE_PRECONFIG, nil, nil, 0);

  if Assigned(pSession) then
  try
    port:= 443;

    Root:= '64.media.tumblr.com';
    pConnection:= InternetConnectA(pSession, PAnsiChar(Root), port, nil, nil, INTERNET_SERVICE_HTTP, 0, 0);

    if Assigned(pConnection) then
    try
      sMethod:= 'GET';
      flags:= INTERNET_FLAG_SECURE or INTERNET_FLAG_KEEP_CONNECTION;

      Resource:= '/0b37315236ee5da6cb4d191ea6a14ccb/tumblr_on2uab8Anw1vb29w2o1_500.jpg';
      pRequest:= HTTPOpenRequestA(pConnection, PAnsiChar(sMethod), PAnsiChar(Resource), nil, nil, nil, flags, 0);

      if Assigned(pRequest) then
      try
        Result:= HTTPSendRequestA(pRequest, nil, 0, nil, 0);
        if Result then
        repeat
          InternetReadFile(pRequest, @Buffer, SizeOf(Buffer), BytesRead);
          if BytesRead= 0 then break;
          Data.Write(Buffer[0], BytesRead);
        until FALSE;
      finally
        InternetCloseHandle(pRequest);
      end
      else RaiseLastOSError;
    finally
      InternetCloseHandle(pConnection);
    end;
  finally
    InternetCloseHandle(pSession);
  end;
end;

// Actually executing it: just add one TImage to your form
procedure TForm1.Button1Click(Sender: TObject);
var
  Data: TMemoryStream;
  j: TJpegImage;
  Head: AnsiString;
begin
  Data:= TMemoryStream.Create;
  DownloadFile(Data);
  if Data.Size> 3 then begin  // Reasonable size for picture
    Data.Position:= 0;
    SetLength(Head, 3);
    Data.Read(Head[1], 3);
    if Head= #$ff#$d8#$ff then  // Is it JFIF (aka JPG)?
    begin
      Data.Position:= 0;
      j:= TJpegImage.Create;
      try
        j.LoadFromStream(Data);
        Image1.AutoSize:= TRUE;
        Image1.Picture.Assign(j);
      except
        // Might be corrupt or its (sub) format is not supported
      end;
      j.Free;
    end;
  end;
  Data.Free;
end;
  • Related