Regex to match everything after the first N words-CodePudding

I want to process annotations in an object detection dataset (YOLO format). The first 5 words are class and xywh coordinates, and everything after is the segmentation data. I want to remove everything after the first 5 words, preferably using bash. What I want to achieve is similar to:

for file in *.txt; do sed -i 's/(PATTERN GOES HERE)//g' $file; done

I tried using lookbehind assertion: (?<=\W{5}).* but it does not work

CodePudding user response：

using sed you could keep the first 5 "words" capturing 5 times 1 or more chars other than spaces.

Then match the rest after it, and replace with capture group 1

sed 's/^\([[:space:]]*\([^[:space:]]\ [[:space:]]\ \)\{4\}[^[:space:]]\ \).*/\1/' file

CodePudding user response：

Perl works here:

first create a file with columns:

$ seq 100 | paste - - - - - - - - - - > file
$ cat file
1   2   3   4   5   6   7   8   9   10
11  12  13  14  15  16  17  18  19  20
21  22  23  24  25  26  27  28  29  30
31  32  33  34  35  36  37  38  39  40
41  42  43  44  45  46  47  48  49  50
51  52  53  54  55  56  57  58  59  60
61  62  63  64  65  66  67  68  69  70
71  72  73  74  75  76  77  78  79  80
81  82  83  84  85  86  87  88  89  90
91  92  93  94  95  96  97  98  99  100

Now keep only the first 5 fields

$ perl -i -lane 'print "@F[0..4]"' file

and we're left with

$ cat file
1 2 3 4 5
11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
61 62 63 64 65
71 72 73 74 75
81 82 83 84 85
91 92 93 94 95

CodePudding user response：

What I have found to work is:

^(?:\S \s ){5}\K.*

Explanation:

^ Start of a line
 (?: Non-capturing group start
    \S \s  Match at least one non-whitespace char, followed by at least one whitespace char
          ) Non-capturing group end
           {5} Repeat five times
              \K Pretend the match has started at this point
                .* Match everything

CodePudding user response：

Using grep (with awk to substitute grep formatting)

Note: Keeps field separator formatting intrinsically. Skips lines with less than n words.

% n=5

% grep -Eno "([[:alnum:],\.] [[:blank:]] ){$n}" file | 
    awk '/:/{gsub(/.*:/, "", $0); print}'
Lorem   ipsum dolor sit amet,
incididunt ut labore et    dolore
nostrud exercitation ullamco laboris nisi
Duis      aute irure    dolor in
fugiat      nulla pariatur. Excepteur sint
labore labore labore culpa qui

Data

% tab=$(printf "\t")

% cat << EOF > file
Lorem   ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et    dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis      aute irure${tab}dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat${tab}    nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
labore labore labore culpa qui officia deserunt mollit anim id est laborum.
EOF