Conditionally remove first word of nth column using awk/sed-CodePudding

I have a tab delimited file, and in the 3rd column I want to remove the first word if the following are true: 1. the first word is all in lowercase, 2. the second word is capitalized. And following that trim off anything longer than 2 words unless the whole string is lowercase. Ideally I want to do this with awk, but something like sed is also possible.

I managed to get it working with sed and awk together but only with the specific column alone, but I would like to get the whole line. I wasn't sure how to get sed to find the pattern only on that column and from what I read, awk doesn't allow backreferencing and I wasn't sure how to do it without that.

cat text.txt |
awk -F"\t" '{print $3}' |
sed -E 's/^[a-z]* ([A-Z])/\1/' |
awk '{if($1 ~ /^[[:upper:]]/) {print $1, $2} else {print $0}}'

What I have:

col1    col2            col3
123     a string        James jones MD MSc
154     string          mister George smith
163     String          mrs anne jones
193     String          john
157     big string 1    dude George

What I want to get:

col1    col2            col3
123     a string        James jones
154     string          George smith
163     String          mrs anne jones
193     String          john
157     big string 1    George

CodePudding user response：

You may use this awk solution:

awk '
BEGIN {FS=OFS="\t"}
NR > 1 && tolower($3) != $3 && split($3, a, /  /) >= 2 {
   p = (a[1] == tolower(a[1]) ? 2 : 1)
   $3 = a[p] (p < length(a) ? " " a[p 1] : "")
}
1' file

col1    col2            col3
123     a string        James jones
154     string          George smith
163     String          mrs anne jones
193     String          john
157     big string 1    George

Online Code Demo

CodePudding user response：

$ cat tst.awk
BEGIN { FS=OFS="\t" }
match($3,/^[[:lower:]]   [[:upper:]]/) {
    $3 = substr($3,RLENGTH)
}
($3 ~ /[[:upper:]]/) && match($3,/[^ ]   [^ ] /) {
    $3 = substr($3,RSTART,RLENGTH)
}
{ print }

$ awk -f tst.awk file
col1    col2    col3
123     a string        James jones
154     string  George smith
163     String  mrs anne jones
193     String  john
157     big string 1    George

The above columns are tab-separated, to visualize those columns aligned:

$ awk -f tst.awk file | column -s$'\t' -t
col1  col2          col3
123   a string      James jones
154   string        George smith
163   String        mrs anne jones
193   String        john
157   big string 1  George

CodePudding user response：

Using sed

$ sed s'/\(\([^\t]*\t\ \)\{2\}\)\([[:lower:]]* \)\?\([[:upper:]][^ ]* \?\)\([^ ]*\).*/\1\4\5/' input_file
col1    col2            col3
123     a string        James jones
154     string          George smith
163     String          mrs anne jones
193     String          john
157     big string 1    George

CodePudding user response：

This might work for you (GNU sed):

sed -E 's/^(.*\t)(([ [:lower:]] )|([[:lower:]]  )?([[:upper:]]\S ( \S )?).*)$/\1\3\5/' file

Keep the first two tabbed fields.

If the third or subsequent fields are all lowercase keep them.

However is the first word is lowercase and the second begins with an uppercase character, drop the first word of the third field and any subsequent words from the fourth thereafter.