Home > Software design >  Regex Must Match a Word (not to replace) AND a Pattern (to replace) in a Line
Regex Must Match a Word (not to replace) AND a Pattern (to replace) in a Line

Time:12-29

With regex (specifically PCRE/server, but can also python), I want to remove all occurrences of the lines that contain a single letter comma (/,.,/g) and with the word "Labels:"

So for example in these lines:

Labels: K,ltemittel,System,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z

to

Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

What I've tried:

  • non-capturing group ("Labels:" still also getting replaced)
  • lookahead and lookbehind (cannot use greedy)
  • grouping /(Labels:)*(,.,) (also capturing the non "Labels:")

CodePudding user response:

You could potentially use:

(?i)(^(?!Labels:).*)|\b[a-z],|,[a-z]\b

See an online demo


  • (?i) - Set case-insensitive matching 'on';
  • ( - Open 1st capture group;
    • ^ - Start string anchor;
    • (?!labels:) - Assert position is not followed by 'Labels:';
    • .* - Match (Greedy) 0 characters other than newline;
    • ) - Close 1st capture group;
  • | - Or;
  • \b[a-z], - Match a word-boundary followed by a single letter and a comma;
  • | - Or;
  • ,[a-z]\b - Match a comma followed by a single letter and a word-boundary.

Now replace it with your 1st capture group.

CodePudding user response:

Using sed

$ sed '/Labels:/s/,[A-Za-z]\>//g;s/\<[A-Za-z],//' input_file
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

Explanation (Added By Tripleee)

It looks for a comma, followed by an alphabetic, followed by a word boundary, i.e. the label after the comma is a single letter. Then, it removes any remaining single-letter label immediately before a comma by similar logic

CodePudding user response:

Another variation using gnu-awk.

For a line that starts with Labels: replace a comma followed by a single char a-z or A-Z and a word boundary with an empty string.

awk '/^Labels:/{gsub(/,[a-zA-Z]\y|\y[a-zA-Z],/, "")};1' file

Output

Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z

As you have tagged Python and pcre, another option is to use the \G anchor and match Label: at the start of the string, and capture in group 1 what you want to keep.

(?:^Labels:\h*|\G(?!^))\K(?:([^\s,]{2,}(?:,(?![a-z]$))?)|,?[a-z],?)

See a regex demo and a Python demo using the Python PyPi regex module.

CodePudding user response:

This might work for you (GNU sed):

sed -E '/Labels/{s/( )\S,|(,)\S,|,\S$/\1\2/g;s//\1\2/g}' file

If a line contains Labels, pattern match for 3 alternate matches and if either the first and second match replace by the matching back reference. Repeat for any overlapping.

  • Related