I want to process annotations in an object detection dataset (YOLO format). The first 5 words are class and xywh coordinates, and everything after is the segmentation data. I want to remove everything after the first 5 words, preferably using bash. What I want to achieve is similar to:
for file in *.txt; do sed -i 's/(PATTERN GOES HERE)//g' $file; done
I tried using lookbehind assertion: (?<=\W{5}).* but it does not work
CodePudding user response:
using sed
you could keep the first 5 "words" capturing 5 times 1 or more chars other than spaces.
Then match the rest after it, and replace with capture group 1
sed 's/^\([[:space:]]*\([^[:space:]]\ [[:space:]]\ \)\{4\}[^[:space:]]\ \).*/\1/' file
CodePudding user response:
Perl works here:
first create a file with columns:
$ seq 100 | paste - - - - - - - - - - > file
$ cat file
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
Now keep only the first 5 fields
$ perl -i -lane 'print "@F[0..4]"' file
and we're left with
$ cat file
1 2 3 4 5
11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
61 62 63 64 65
71 72 73 74 75
81 82 83 84 85
91 92 93 94 95
CodePudding user response:
What I have found to work is:
^(?:\S \s ){5}\K.*
Explanation:
^ Start of a line
(?: Non-capturing group start
\S \s Match at least one non-whitespace char, followed by at least one whitespace char
) Non-capturing group end
{5} Repeat five times
\K Pretend the match has started at this point
.* Match everything
CodePudding user response:
Using grep
(with awk
to substitute grep formatting)
Note: Keeps field separator formatting intrinsically. Skips lines with less than n words.
% n=5
% grep -Eno "([[:alnum:],\.] [[:blank:]] ){$n}" file |
awk '/:/{gsub(/.*:/, "", $0); print}'
Lorem ipsum dolor sit amet,
incididunt ut labore et dolore
nostrud exercitation ullamco laboris nisi
Duis aute irure dolor in
fugiat nulla pariatur. Excepteur sint
labore labore labore culpa qui
Data
% tab=$(printf "\t")
% cat << EOF > file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure${tab}dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat${tab} nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
labore labore labore culpa qui officia deserunt mollit anim id est laborum.
EOF