I work on a project that maintains a data lake that centralizes public information from the Brazilian government. Our pipelines run on a Kubernetes cluster.

I'm currently building a pipeline for labor market data. This is the bash script I use to download the data:


# To run this script the user must run 'bash download.sh group', where group is cagedmov | cagedfor | cageddex.
# See explanation in the next comment:

# The microdata resulting from the new consolidation are made available in accordance with the
# month of disclosure, as of January 2020, containing three files for each
# competence. Following a consistent naming pattern, CAGEDMOVAYYYMM files
# bring the movements declared within the deadline with declaration competence
# same as YYYYMM. The CAGEDFORAAAMM files bring the declared moves
# outside the deadline with declaration competence equal to YYYYMM. the files
# CAGEDEXCAAAAMM bring the excluded movements with declaration competence
# of exclusion equal to YYYYMM


mkdir -p /tmp/novo_caged/$lower_group/input
ufs=('RO' 'AC' 'AM' 'RR' 'PA' 'AP' 'TO' 'MA' 'PI' 'CE' 'RN' 'PB' 'PE' 'AL' 'SE' 'BA' 'MG' 'ES' 'RJ' 'SP' 'PR' 'SC' 'RS' 'MS' 'MT' 'GO' 'DF')
anos=(2020 2021 2022)
meses=($(seq 1 1 12))

for uf in "${ufs[@]}"
    for ano in "${anos[@]}"
        for mes in "${meses[@]}"
            mkdir -p /tmp/novo_caged/$lower_group/ano=$ano/mes=$mes/sigla_uf=$uf/

cd /tmp/novo_caged/$lower_group/input
ftp_path="ftp://anonymous:[email protected]/pdet/microdados/NOVO CAGED/"

pad_meses=($(echo {01..12}))
folders=($(seq 202001 1 202012))

for ano in "${anos[@]}"
    for mes in "${pad_meses[@]}"
        wget "$ftp_path$ano/$ano$mes/$upper_group$ano$mes.7z"
        7z x -y $upper_group$ano$mes.7z
        rm *7z

The script runs perfectly on my computer, but when I deploy to the Kubernetes cluster, the script throws an error Failed to connect to ftp.mtps.gov.br port 21: Connection timed out. Apparently, the address ftp.mtps.gov.br only accepts requests from IP addresses from Brazil. Is there a way to get around this restriction? It would be very important for our project to automate this ETL and have this data released in an updated way.

CodePudding user response:

You can use Tor as a sockx5 proxy, and configure it to let traffic exit from a specific country.
In the torrc configuration file add these lines, or eventually modify existing ones.

ExitNodes {br}
StrictNodes 1

The last thing, you need to tell your bash script to use tor.
That can be done in different ways, the easiest one is to use the torify command.
I suggest to test everything adding this line at the top of the script


curl https://api.myip.com;exit

This will give you evidence of the country being used as tor exit node. If it's ok, remove this test line.


