Home > other >  Using bash script to remove from sentence words longer than [x] characters
Using bash script to remove from sentence words longer than [x] characters

Time:02-10

I have a sentence (array) and I would like to remove from it all words longer than 8 characters.

Example sentence:

var="one two three four giberish-giberish five giberish-giberish six"

I would like to get:

var="one two three four five six"

So far I'm using this:

echo $var | tr ' ' '\n' | awk 'length($1) <= 6 { print $1 }' | tr '\n ' ' '

Solution above works fine but as you can see I'm replacing space with newline then filtering words and then replacing back newline with space. I'm pretty sure there must be better and more "elegant" solution without swapping space/newline.

CodePudding user response:

Using sed

$ sed 's/\<[a-z-]\{8,\}\> //g' file
var="one two three four five six"

CodePudding user response:

Here is one way to do it:

arr=(one two three four giberish-giberish five giberish-giberish six)
for var in "${arr[@]}"; do (( ${#var} > 8 )) || echo -n "$var "; done
echo # for that newline in the end

And another:

awk '{ for(i=1;i<=NF;i  ) { if(length($i) < 8) printf "%s ", $i } print "" # for that newline in the end }'

And a third!

awk -v RS='[[:space:]] ' 'length < 8 { v=v" "$0 }; END{print substr(v, 2)}'

The last one prints a "perfect" single-space delimited string with no extra leading or trailing whitespace.

CodePudding user response:

You can use

#!/bin/bash
var="one two three four giberish-giberish five giberish-giberish six"
awk 'BEGIN{RS=ORS=" "} length($0) <= 6' <<< "$var"
# -> one two three four five six

See the online demo.

The BEGIN{RS=ORS=" "} sets the record input/output separator to a space and length($0) <= 6 only keeps the fields that are equal or shorter than 6 chars.

You can also consider the workarounds with GNU sed and perl:

sed -E 's/\s*\S{7,}//g' <<< "$var"
perl -pe 's/\s*\S{7,}//g' <<< "$var"

See this online demo.

A non-GNU sed workaround could look like

sed 's/[[:space:]]*[^[:space:]]\{7,\}//g' <<< "$var"

Here, all occurrences of zero or more whitespace (\s*, [[:space:]]*) followed with seven or more non-whitespace chars (\S{7,}, [^[:space:]]\{7,\}) are removed.

CodePudding user response:

In pure Bash, you can filter into a new array the words less than some chosen length:

#!/bin/bash

var="one two three four giberish-giberish five giberish-giberish six" 

new_arr=()
for w in $var; do  # no quotes on purpose to split string
    [[ ${#w} -lt 6 ]] && new_arr =( "$w" )
done    

declare -p new_arr
# declare -a new_arr=([0]="one" [1]="two" [2]="three" [3]="four" [4]="five" [5]="six")

Or if the source is already an array:

old_arr=(one two three four giberish-giberish five giberish-giberish six)
new_arr=()
for w in ${old_arr[@]}; do 
    [[ ${#w} -lt 6 ]] && new_arr =( "$w" )
done 

You may want to delete the words in old_arr as you loop over it. If you know that each $w is unique, you can do:

old_arr=(one two three four giberish-giberish five giberish-giberish six)
for w in ${old_arr[@]}; do 
    [[ ${#w} -ge 6 ]] && old_arr=("${old_arr[@]/$w}")
done 

But this has two issues: 1) If you have equal prefixes, all will be deleted and 2) The existing indices will remain:

$ declare -p old_arr
declare -a old_arr=([0]="one" [1]="two" [2]="three" [3]="four" [4]="" [5]="five" [6]="" [7]="six")

You could also unset the offending item by keeping a separate index:

old_arr=(one two three four giberish-giberish five giberish-giberish six)
idx=0
for w in ${old_arr[@]}; do 
    [[ ${#w} -ge 6 ]] && unset 'old_arr[idx]'
    (( idx   ))
done 

But then you end up with discontinuous array indexes (but the existing qualifying words remain at the same index):

$ declare -p old_arr
declare -a old_arr=([0]="one" [1]="two" [2]="three" [3]="four" [5]="five" [7]="six")

It usually better to filter into a new array unless you want to keep the existing indexes.

CodePudding user response:

This might work for you (GNU sed):

<<<"$var" sed -E 'y/ /\n/;s/..{8}.*\n//mg;y/\n/ /'

Translate spaces to newlines.

Remove all lines that are more than 8 characters long.

Translate newlines to spaces.

  • Related