Home > Mobile >  find pattern and characters up to space, and move captured pattern to end of line sed
find pattern and characters up to space, and move captured pattern to end of line sed

Time:10-14

I would like to find a particular pattern ("k__"), and any characters after it, up to a space, and then move that captured pattern to the end of the line

With this example file:

cat test.file
37099   k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Bacteria species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__

So, Id like to match "k__Eukaryota" and "k__Bacteria" (and other patterns that start with k__) and then move those captured matches to the end of the line : e.g. desired output=

37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Bacteria

I thought it woudl be easy but I can;t get it to go. Here is what ive tried:

cat test.file | gsed -E 's#(.*k__)(k__\w\ )(.*)#\1\3\2#'

Cupture text until pattern, then match (cpature pattern and any word characters up to whitespace) then capture to the end of the line and then change the order of capturing groups.

I think I can back reference these patterns to change the order but Im prob. not matching them correctly. How to capture up to my pattern, the pattern ("K__xyz") and then match to end of line, capture those groups, and reorganize? Is this the right approach?

Any help is much appreciated!

LP

CodePudding user response:

Use this Perl one-liner:

perl -lpe 's{^(.*?\s)(k__\S )\s (.*)}{$1$3 $2}' test.file > out.file

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

^ : Beginning of the line.
(.*?\s) : 0 or more of any characters (non-greedy), ending with whitespace, capture and store in variable $1.
(k__\S ) : Literal k__ followed by 1 or more non-whitespace characters, capture and store in variable $2.
\s (.*) : 1 or more whitespace characters. Then 0 or more any characters, capture and store in variable $3.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

CodePudding user response:

if you want to edit original file, add '-i' option;
sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
if you want to save result to other file, remove '-i' option;
sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file

my test result:

szvp000006656:/home # cat test.file
37099   k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista

szvp000006656:/home # sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file
szvp000006656:/home # cat new.file
37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota

szvp000006656:/home # sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
szvp000006656:/home # cat test.file
37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota

Note:

  1. It is recommended to use https://regexr.com/ to debug regular syntax
  2. Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Try this non-greedy regex [^/]* instead of .*? chaos-stackoverflow
  • Related