I would like to find a particular pattern ("k__"), and any characters after it, up to a space, and then move that captured pattern to the end of the line
With this example file:
cat test.file
37099 k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista
73015 k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015 k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015 k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015 k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015 k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
43925 k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925 k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925 k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925 k__Bacteria species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
So, Id like to match "k__Eukaryota" and "k__Bacteria" (and other patterns that start with k__) and then move those captured matches to the end of the line : e.g. desired output=
37099 species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
73015 species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015 species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015 species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015 species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015 species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
43925 species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925 species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925 species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925 species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Bacteria
I thought it woudl be easy but I can;t get it to go. Here is what ive tried:
cat test.file | gsed -E 's#(.*k__)(k__\w\ )(.*)#\1\3\2#'
Cupture text until pattern, then match (cpature pattern and any word characters up to whitespace) then capture to the end of the line and then change the order of capturing groups.
I think I can back reference these patterns to change the order but Im prob. not matching them correctly. How to capture up to my pattern, the pattern ("K__xyz") and then match to end of line, capture those groups, and reorganize? Is this the right approach?
Any help is much appreciated!
LP
CodePudding user response:
Use this Perl one-liner:
perl -lpe 's{^(.*?\s)(k__\S )\s (.*)}{$1$3 $2}' test.file > out.file
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-p
: Loop over the input one line at a time, assigning it to $_
by default. Add print $_
after each loop iteration.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
^
: Beginning of the line.
(.*?\s)
: 0 or more of any characters (non-greedy), ending with whitespace, capture and store in variable $1
.
(k__\S )
: Literal k__
followed by 1 or more non-whitespace characters, capture and store in variable $2
.
\s (.*)
: 1 or more whitespace characters. Then 0 or more any characters, capture and store in variable $3
.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
perldoc perlre
: Perl regular expressions (regexes)
CodePudding user response:
if you want to edit original file, add '-i' option;
sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
if you want to save result to other file, remove '-i' option;
sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file
my test result:
szvp000006656:/home # cat test.file
37099 k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista
szvp000006656:/home # sed -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file > new.file
szvp000006656:/home # cat new.file
37099 species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
szvp000006656:/home # sed -i -r 's/(.*)(k__[^ ]*)( .*)/\1\3 \2/g' test.file
szvp000006656:/home # cat test.file
37099 species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
Note:
- It is recommended to use https://regexr.com/ to debug regular syntax
- Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Try this non-greedy regex [^/]* instead of .*? chaos-stackoverflow