Bash script: filter columns based on a character-CodePudding

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).

A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5

My objective is to create a table as follows:


A\t1
B\t2
C\t3
D\t4
E\t5

i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.

I have tried a couple of things but with no luck.

I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work. Here is the snippet:

col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[@]};idx  ));do
  echo "@{col1[$idx]} @{col2[$idx]}"
  # I will append to myArr here
done

The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.

Any advice is very much appreciated.

Thank you. :)

CodePudding user response：

Assuming that when you say a space you mean a blank character then using any awk:

awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file

CodePudding user response：

Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)

$ cat ls
A   1
B   2
C x 3
D   4
E y 5

$ cat ls |perl -pe 's/^(\S ).*\t(\S )/$1 $2/g'
A 1
B 2
C 3
D 4
E 5

This code gets all non-empty characters from the front and all non-empty characters from after \t

CodePudding user response：

Try

sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file

The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.