I work on a project that maintains a data lake that centralizes public information from the Brazilian government. Our pipelines run on a Kubernetes cluster.
I'm currently building a pipeline for labor market data. This is the bash script I use to download the data:
#!/bin/bash
# To run this script the user must run 'bash download.sh group', where group is cagedmov | cagedfor | cageddex.
# See explanation in the next comment:
# The microdata resulting from the new consolidation are made available in accordance with the
# month of disclosure, as of January 2020, containing three files for each
# competence. Following a consistent naming pattern, CAGEDMOVAYYYMM files
# bring the movements declared within the deadline with declaration competence
# same as YYYYMM. The CAGEDFORAAAMM files bring the declared moves
# outside the deadline with declaration competence equal to YYYYMM. the files
# CAGEDEXCAAAAMM bring the excluded movements with declaration competence
# of exclusion equal to YYYYMM
lower_group=$1
upper_group=${lower_group^^}
mkdir -p /tmp/novo_caged/$lower_group/input
ufs=('RO' 'AC' 'AM' 'RR' 'PA' 'AP' 'TO' 'MA' 'PI' 'CE' 'RN' 'PB' 'PE' 'AL' 'SE' 'BA' 'MG' 'ES' 'RJ' 'SP' 'PR' 'SC' 'RS' 'MS' 'MT' 'GO' 'DF')
anos=(2020 2021 2022)
meses=($(seq 1 1 12))
for uf in "${ufs[@]}"
do
for ano in "${anos[@]}"
do
for mes in "${meses[@]}"
do
mkdir -p /tmp/novo_caged/$lower_group/ano=$ano/mes=$mes/sigla_uf=$uf/
done
done
done
cd /tmp/novo_caged/$lower_group/input
ftp_path="ftp://anonymous:[email protected]/pdet/microdados/NOVO CAGED/"
pad_meses=($(echo {01..12}))
folders=($(seq 202001 1 202012))
for ano in "${anos[@]}"
do
for mes in "${pad_meses[@]}"
do
wget "$ftp_path$ano/$ano$mes/$upper_group$ano$mes.7z"
7z x -y $upper_group$ano$mes.7z
rm *7z
done
done
The script runs perfectly on my computer, but when I deploy to the Kubernetes cluster, the script throws an error Failed to connect to ftp.mtps.gov.br port 21: Connection timed out
. Apparently, the address ftp.mtps.gov.br
only accepts requests from IP addresses from Brazil. Is there a way to get around this restriction? It would be very important for our project to automate this ETL and have this data released in an updated way.
CodePudding user response:
You can use Tor as a sockx5 proxy, and configure it to let traffic exit from a specific country.
In the torrc configuration file add these lines, or eventually modify existing ones.
ExitNodes {br}
StrictNodes 1
The last thing, you need to tell your bash script to use tor.
That can be done in different ways, the easiest one is to use the torify command.
I suggest to test everything adding this line at the top of the script
#!/bin/bash
curl https://api.myip.com;exit
This will give you evidence of the country being used as tor exit node. If it's ok, remove this test line.
https://www.torproject.org/
https://linux.die.net/man/1/torify