everyone!
I'm trying to make parallel curl requests from array of urls to speed up bash script. After research I found out, that there are several approaches: GNU parallel, x-args, built-in curl options (--parallel) starting from 7.68.0, using ampersand. Best option would be not to use GNU parallel, since it demands GNU installation and I'm not allowed to do it.
Here is my initial script:
#!/bin/bash
external_links=(https://www.designernews.co/error https://www.awwwards.com/error1/ https://dribbble.com/error1 https://www.designernews.co https://www.awwwards.com https://dribbble.com)
invalid_links=()
for external_link in "${external_links[@]}"
do
curlResult=$(curl -sSfL --max-time 60 --connect-timeout 30 --retry 3 -4 "$external_link" 2>&1 > /dev/null) && status=$? || status=$?
if [ $status -ne 0 ] ; then
if [[ $curlResult =~ (error: )([0-9]{3}) ]]; then
error_code=${BASH_REMATCH[0]}
invalid_links =("${error_code} ${external_link}")
echo "${external_link}"
fi
fi
i=$((i 1))
done
echo "Found ${#invalid_links[@]} invalid links: "
printf '%s\n' "${invalid_links[@]}"
I tried to change the curl options, added x-args, ampersand, but failed to succeed. All examples I found were mostly using GNU or reading data from the file, none of them worked with variable containg array of URLs (Running programs in parallel using xargs, cURL with variables and multiprocessing in shell). Could you, please, help me with this issue?
CodePudding user response:
The main problem is that you can't modify your array directly from a sub-process. A possible work-around is to use a FIFO file for transmitting the results of the sub-processes to the main program.
remark: As long as the messages are shorter than getconf SSIZE_MAX
bytes, the writes to the FIFO file are guaranteed to be atomic.
#!/bin/bash
tempdir=$(mktemp -d) &&
mkfifo "$tempdir/fifo" &&
exec 3<> "$tempdir/fifo" || exit 1
external_links=(
https://www.designernews.co/error
https://www.awwwards.com/error1/
https://dribbble.com/error1
https://www.designernews.co
https://www.awwwards.com
https://dribbble.com
)
for url in "${external_links[@]}"
do
{
curl -4sLI -o /dev/null -w '%{http_code}' --max-time 60 --connect-timeout 30 --retry 2 "$url"
echo "/$url"
} >&3 &
done
invalid_links=()
for (( i = ${#external_links[@]}; i > 0; i-- ))
do
IFS='/' read -u 3 -r http_code url
(( 200 <= http_code && http_code <= 299 )) || invalid_links =( "$http_code $url" )
done
echo "Found ${#invalid_links[@]} invalid links:"
(( ${#invalid_links[@]} > 0 )) && printf '%s\n' "${invalid_links[@]}"
remarks:
- Here I consider a link valid when
curl
's output is in the200
-299
range. - When the web server doesn't exists or reply,
curl
's output is000
UPDATE
Limiting the number of curl
parallel requests to 10
.
It's possible to add the logic for that with the shell but if you have BSD or GNU xargs
then you can replace the whole for url in "${external_links[@]}"; do ...; done
loop with:
printf '%s\0' "${external_links[@]}" |
xargs -0 -P 10 -n 1 sh -c '
curl -4sLI \
-o /dev/null \
-w "%{http_code}" \
--max-time 60 \
--connect-timeout 30 \
--retry 2 "$0"
printf /%s\\n "$0"
' 1>&3
Furthermore, if your curl
is at least 7.75.0 then you should able to replace the sh -c '...'
with a single curl
command (untested):
printf '%s\0' "${external_links[@]}" |
xargs -0 -P 10 -n 1 \
curl -4sLI \
-o /dev/null \
-w "%{http_code}/%{url}\n" \
--max-time 60 \
--connect-timeout 30 \
--retry 2 \
1>&3