I have a file with a word Sweden in different variations.
I am trying to get if 34th column has Sweden there
awk -F\" '$34 ~ /Sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
As far as i know it gonna be so slow, as I have 650 million rows.
Is there any way I can get all variation in 1 awk command?
CodePudding user response:
You can use this awk
:
awk -F\" 'tolower($34) ~ /sweden|^se$/' "$ipp" >> sweden.csv
CodePudding user response:
With your shown samples, attempts please try following awk
code. Simply making field separator as "
and in main block checking if field 34th is either containing sweden
(including upper and lower cases to match any kind of combinations of it) OR it starts from se
9with both lower and upper case for letters) if any of the condition passes then print that line.
awk -F\" '$34 ~ /[Ss][Ww][Ee][Dd][Ee][Nn]|^[Ss][Ee]$/' "$ipp" >> sweden.csv
CodePudding user response:
If you're using GNU awk, you can use IGNORECASE
option:
awk -F\" 'BEGIN{IGNORECASE=1} $34 ~ /sweden|^se$/' "$ipp" >> sweden.csv
CodePudding user response:
Your code might be ameloriated as already explained, more generally you might put 6 pattern-action pairs in single awk
call rather than 6 separate that is
awk -F\" '$34 ~ /Sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /sweden/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /SWEDEN/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^Se$/ {print $0}' $ipp >> sweden.csv &
awk -F\" '$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
might be written more concisely as
awk -F\" '$34 ~ /Sweden/ {print $0}$34 ~ /sweden/ {print $0}$34 ~ /SWEDEN/ {print $0}$34 ~ /^se$/ {print $0}$34 ~ /^Se$/ {print $0}$34 ~ /^SE$/ {print $0}' $ipp >> sweden.csv &
Note that if line does contain both Sweden
and SWEDEN
it will appear twice (in 6 x awk
and 1 x awk
solution) and also order of lines in output might be different between these 2 approaches.