With regex (specifically PCRE/server, but can also python), I want to remove all occurrences of the lines that contain a single letter comma (/,.,/g)
and with the word "Labels:"
So for example in these lines:
Labels: K,ltemittel,System,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z
to
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
What I've tried:
- non-capturing group ("Labels:" still also getting replaced)
- lookahead and lookbehind (cannot use greedy)
- grouping
/(Labels:)*(,.,)
(also capturing the non "Labels:")
CodePudding user response:
You could potentially use:
(?i)(^(?!Labels:).*)|\b[a-z],|,[a-z]\b
See an online demo
(?i)
- Set case-insensitive matching 'on';(
- Open 1st capture group;^
- Start string anchor;(?!labels:)
- Assert position is not followed by 'Labels:';.*
- Match (Greedy) 0 characters other than newline;)
- Close 1st capture group;
|
- Or;\b[a-z],
- Match a word-boundary followed by a single letter and a comma;|
- Or;,[a-z]\b
- Match a comma followed by a single letter and a word-boundary.
Now replace it with your 1st capture group.
CodePudding user response:
Using sed
$ sed '/Labels:/s/,[A-Za-z]\>//g;s/\<[A-Za-z],//' input_file
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
Explanation (Added By Tripleee)
It looks for a comma, followed by an alphabetic, followed by a word boundary, i.e. the label after the comma is a single letter. Then, it removes any remaining single-letter label immediately before a comma by similar logic
CodePudding user response:
Another variation using gnu-awk
.
For a line that starts with Labels:
replace a comma followed by a single char a-z or A-Z and a word boundary with an empty string.
awk '/^Labels:/{gsub(/,[a-zA-Z]\y|\y[a-zA-Z],/, "")};1' file
Output
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
As you have tagged Python and pcre, another option is to use the \G
anchor and match Label:
at the start of the string, and capture in group 1 what you want to keep.
(?:^Labels:\h*|\G(?!^))\K(?:([^\s,]{2,}(?:,(?![a-z]$))?)|,?[a-z],?)
See a regex demo and a Python demo using the Python PyPi regex module.
CodePudding user response:
This might work for you (GNU sed):
sed -E '/Labels/{s/( )\S,|(,)\S,|,\S$/\1\2/g;s//\1\2/g}' file
If a line contains Labels
, pattern match for 3 alternate matches and if either the first and second match replace by the matching back reference. Repeat for any overlapping.