Home > Software engineering >  Airflow gets error 403 when consuming API
Airflow gets error 403 when consuming API

Time:07-13

I am trying to use Apache Airflow to do a GET method to this api: https://www.mercadobitcoin.com.br/api/BTC/day-summary/2022/7/10.

I created the following function to do this:

def response_get(self, yr: int, month: int, day: int) -> str:

    response_API = requests.get(f'https://www.mercadobitcoin.net/api/BTC/day-summary/{yr}/{month}/{day}')
    
    data = response_API.text

    return data

And it works the way it should. On the other hand, when I try to use SimpleHttpOperator from Apache Airflow, I get a 403 error. I created a HTTP connection with "https://www.mercadobitcoin.net/api/BTC/day-summary"

Heres my task and log:

Task

get_api = SimpleHttpOperator(
    task_id='get_api',
    http_conn_id='btc_get',
    endpoint=f'{yesterday.year}/{yesterday.month}/{yesterday.day}',
    method='GET',
    response_filter=lambda response: json.loads(response.text),
    log_response=True
)

Log

*** Reading local file: /opt/airflow/logs/dag_id=bitcoin/run_id=manual__2022-07-12T00:22:45.173666 00:00/task_id=get_api/attempt=1.log
[2022-07-11, 21:22:47 ] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: bitcoin.get_api manual__2022-07-12T00:22:45.173666 00:00 [queued]>
[2022-07-11, 21:22:47 ] {taskinstance.py:1159} INFO - Dependencies all met for <TaskInstance: bitcoin.get_api manual__2022-07-12T00:22:45.173666 00:00 [queued]>
[2022-07-11, 21:22:47 ] {taskinstance.py:1356} INFO - 
--------------------------------------------------------------------------------
[2022-07-11, 21:22:47 ] {taskinstance.py:1357} INFO - Starting attempt 1 of 1
[2022-07-11, 21:22:47 ] {taskinstance.py:1358} INFO - 
--------------------------------------------------------------------------------
[2022-07-11, 21:22:47 ] {taskinstance.py:1377} INFO - Executing <Task(SimpleHttpOperator): get_api> on 2022-07-12 00:22:45.173666 00:00
[2022-07-11, 21:22:47 ] {standard_task_runner.py:52} INFO - Started process 3018 to run task
[2022-07-11, 21:22:47 ] {standard_task_runner.py:79} INFO - Running: ['***', 'tasks', 'run', 'bitcoin', 'get_api', 'manual__2022-07-12T00:22:45.173666 00:00', '--job-id', '21', '--raw', '--subdir', 'DAGS_FOLDER/projeto_api.py', '--cfg-path', '/tmp/tmpgquhvpmu', '--error-file', '/tmp/tmpifb8j8ri']
[2022-07-11, 21:22:47 ] {standard_task_runner.py:80} INFO - Job 21: Subtask get_api
[2022-07-11, 21:22:47 ] {task_command.py:370} INFO - Running <TaskInstance: bitcoin.get_api manual__2022-07-12T00:22:45.173666 00:00 [running]> on host 54ef6f590a3a
[2022-07-11, 21:22:48 ] {taskinstance.py:1571} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=bitcoin
AIRFLOW_CTX_TASK_ID=get_api
AIRFLOW_CTX_EXECUTION_DATE=2022-07-12T00:22:45.173666 00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2022-07-12T00:22:45.173666 00:00
[2022-07-11, 21:22:48 ] {http.py:102} INFO - Calling HTTP method
[2022-07-11, 21:22:48 ] {base.py:68} INFO - Using connection ID 'btc_get' for task execution.
[2022-07-11, 21:22:48 ] {http.py:129} INFO - Sending 'GET' to url: https://www.mercadobitcoin.com.br/api/BTC/day-summary/2022/7/11
[2022-07-11, 21:22:50 ] {http.py:142} ERROR - HTTP error: Forbidden
[2022-07-11, 21:22:50 ] {http.py:143} ERROR - <!DOCTYPE html>
<html lang="en-US">
<head>
<title>Access denied</title>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<link rel="stylesheet" href="/cdn-cgi/styles/errors.new.min.css" media="screen" />
<script>
(function(){if(document.addEventListener&&window.XMLHttpRequest&&JSON&&JSON.stringify){var e=function(a){var c=document.getElementById("error-feedback-survey"),d=document.getElementById("error-feedback-success"),b=new XMLHttpRequest;a={event:"feedback clicked",properties:{errorCode:1020,helpful:a,version:3}};b.open("POST","https://sparrow.cloudflare.com/api/v1/event");b.setRequestHeader("Content-Type","application/json");b.setRequestHeader("Sparrow-Source-Key","c771f0e4b54944bebf4261d44bd79a1e");
b.send(JSON.stringify(a));c.classList.add("feedback-hidden");d.classList.remove("feedback-hidden")};document.addEventListener("DOMContentLoaded",function(){var a=document.getElementById("error-feedback"),c=document.getElementById("feedback-button-yes"),d=document.getElementById("feedback-button-no");"classList"in a&&(a.classList.remove("feedback-hidden"),c.addEventListener("click",function(){e(!0)}),d.addEventListener("click",function(){e(!1)}))})}})();
</script>
<script>
         (function(){if(document.addEventListener){var c=function(){var b=document.getElementById("copy-label");document.getElementById("plain-ray-id");if(navigator.clipboard)navigator.clipboard.writeText("7295a074fab5a6cd");else{var a=document.createElement("textarea");a.value="7295a074fab5a6cd";a.style.top="0";a.style.left="0";a.style.position="fixed";document.body.appendChild(a);a.focus();a.select();document.execCommand("copy");document.body.removeChild(a)}b.innerText="Copied"};document.addEventListener("DOMContentLoaded",
function(){var b=document.getElementById("plain-ray-id"),a=document.getElementById("click-to-copy-btn");"classList"in b&&(b.classList.add("hidden"),a.classList.remove("hidden"),a.addEventListener("click",c))})}})();
      </script>
<script defer src="https://performance.radar.cloudflare.com/beacon.js"></script>
</head>
<body>
<div  role="main">
<div >
<h1>
<span >Access denied</span>
<span >Error code <span>1020</span></span>
</h1>
<div >
<p>You do not have access to www.mercadobitcoin.com.br.</p><p>The site owner may have set restrictions that prevent you from accessing the site. Contact the site owner for access or try loading the page again.</p>
</div>
</div>
</div>
<div>
<div >
<div >
<h2 >Additional information</h2>
<p>The access policies of a site define which visits are allowed. Your current visit is not allowed according to those policies.</p><p>Only the site owner can change site access policies.</p>
</div>
<div >
<h2 >I am the site owner</h2>
<p >
Ray ID:
<span  id="plain-ray-id">
7295a074fab5a6cd
</span>
<button  id="click-to-copy-btn" title="Click to copy Ray ID" type="button">
<span >7295a074fab5a6cd</span><span  id="copy-label">Copy</span>
</button>
</p>
<ol>
<li>
Search the
<a rel="noopener noreferrer" href="https://dash.cloudflare.com/?to=/:account/:zone/firewall" target="_blank">Firewall Events Log</a>
<img  title="Opens in new tab" src="/cdn-cgi/images/external.png" alt="External link">
for the above Ray ID.
</li>
<li>Examine and assess the details of the access policy.</li>
</ol>
<br>
<a rel="noopener noreferrer" href="https://support.cloudflare.com/hc/articles/360029779472-Troubleshooting-Cloudflare-1XXX-errors#error1020" target="_blank">Troubleshooting guide</a>
<img  title="Opens in new tab" src="/cdn-cgi/images/external.png" alt="External link">
</div>
</div>
<div  role="contentinfo">
<div >
<div  id="error-feedback">
<div id="error-feedback-survey" >
Was this page helpful?
<button  id="feedback-button-yes" type="button">Yes</button>
<button  id="feedback-button-no" type="button">No</button>
</div>
<div  id="error-feedback-success">
Thank you for your feedback!
</div>
</div>
</div>
<div >
Performance &amp; security by <a rel="noopener noreferrer" href="https://www.cloudflare.com" target="_blank">Cloudflare</a>
<img  title="Opens in new tab" src="/cdn-cgi/images/external.png" alt="External link">
</div>
</div>
</div>
<script defer src="https://static.cloudflareinsights.com/beacon.min.js/v652eace1692a40cfa3763df669d7439c1639079717194" integrity="sha512-Gi7xpJR8tSkrpF7aordPZQlW2DLtzUlZcumS8dMQjwDHEnw9I7ZLyiOj/6tZStRBGtGgN6ceN6cMH8z7etPGlw==" data-cf-beacon='{"rayId":"7295a074fab5a6cd","token":"bdd41a57ca1a467086229b46045794af","version":"2022.6.0","si":100}' crossorigin="anonymous"></script>
</body>
</html>

[2022-07-11, 21:22:50 ] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/http/hooks/http.py", line 140, in check_response
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.7/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.mercadobitcoin.com.br/api/BTC/day-summary/2022/7/11

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/http/operators/http.py", line 104, in execute
    response = http.run(self.endpoint, self.data, self.headers, self.extra_options)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/http/hooks/http.py", line 130, in run
    return self.run_and_check(session, prepped_request, extra_options)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/http/hooks/http.py", line 183, in run_and_check
    self.check_response(response)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/http/hooks/http.py", line 144, in check_response
    raise AirflowException(str(response.status_code)   ":"   response.reason)
airflow.exceptions.AirflowException: 403:Forbidden
[2022-07-11, 21:22:50 ] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=bitcoin, task_id=get_api, execution_date=20220712T002245, start_date=20220712T002247, end_date=20220712T002250
[2022-07-11, 21:22:51 ] {standard_task_runner.py:97} ERROR - Failed to execute job 21 for task get_api (403:Forbidden; 3018)
[2022-07-11, 21:22:51 ] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-11, 21:22:51 ] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

I really can't see what I am missing. Is the Airflow using some headers that are being recognized as a crawler? If it is, how can I "disable" that?

I am using Ubuntu 20.04 and Airflow with Docker.

CodePudding user response:

In the the response_get function you used "https://www.mercadobitcoin.net"

In the SimpleHttpOperator it seems that your connection is defined with "https://www.mercadobitcoin.com.br"

change the com.br to net

  • Related