I have a tab delimited file, and in the 3rd column I want to remove the first word if the following are true: 1. the first word is all in lowercase, 2. the second word is capitalized. And following that trim off anything longer than 2 words unless the whole string is lowercase. Ideally I want to do this with awk, but something like sed is also possible.
I managed to get it working with sed and awk together but only with the specific column alone, but I would like to get the whole line. I wasn't sure how to get sed to find the pattern only on that column and from what I read, awk doesn't allow backreferencing and I wasn't sure how to do it without that.
cat text.txt |
awk -F"\t" '{print $3}' |
sed -E 's/^[a-z]* ([A-Z])/\1/' |
awk '{if($1 ~ /^[[:upper:]]/) {print $1, $2} else {print $0}}'
What I have:
col1 col2 col3
123 a string James jones MD MSc
154 string mister George smith
163 String mrs anne jones
193 String john
157 big string 1 dude George
What I want to get:
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
CodePudding user response:
You may use this awk
solution:
awk '
BEGIN {FS=OFS="\t"}
NR > 1 && tolower($3) != $3 && split($3, a, / /) >= 2 {
p = (a[1] == tolower(a[1]) ? 2 : 1)
$3 = a[p] (p < length(a) ? " " a[p 1] : "")
}
1' file
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
CodePudding user response:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
match($3,/^[[:lower:]] [[:upper:]]/) {
$3 = substr($3,RLENGTH)
}
($3 ~ /[[:upper:]]/) && match($3,/[^ ] [^ ] /) {
$3 = substr($3,RSTART,RLENGTH)
}
{ print }
$ awk -f tst.awk file
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
The above columns are tab-separated, to visualize those columns aligned:
$ awk -f tst.awk file | column -s$'\t' -t
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
CodePudding user response:
Using sed
$ sed s'/\(\([^\t]*\t\ \)\{2\}\)\([[:lower:]]* \)\?\([[:upper:]][^ ]* \?\)\([^ ]*\).*/\1\4\5/' input_file
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
CodePudding user response:
This might work for you (GNU sed):
sed -E 's/^(.*\t)(([ [:lower:]] )|([[:lower:]] )?([[:upper:]]\S ( \S )?).*)$/\1\3\5/' file
Keep the first two tabbed fields.
If the third or subsequent fields are all lowercase keep them.
However is the first word is lowercase and the second begins with an uppercase character, drop the first word of the third field and any subsequent words from the fourth thereafter.