Ignore linebreaks when searching for patterns with bash-CodePudding

I have files with constant stream of letters, capped at 10 letters per line, like so:

ABCDEFGHIJ
XXXXXXXXXX
XXXXXXXXXX
XXXXXXXXXX
XXXXABCDEF
ABCDEFGHIJ

I want to remove the Xs in pairs of three, so I want the result to be

ABCDEFGHIJ
XABCDEF
ABCDEFGHIJ

My current approach is

sed 's/XXX//g' inputFile > outputFile

but that only considers the pattern within a single line, and results in:

ABCDEFGHIJ
X
X
X
XABCDEF
ABCDEFGHIJ

How do I need to formulate the search pattern to ignore linebreaks, so to essentially accept XXX, X\nXX, and XX\nX? Is this possible with sed, or another command?

CodePudding user response：

With GNU sed. Modify your regex.

sed -zE 's/X\n{0,1}X\n{0,1}X\n{0,1}//g' inputFile > outputFile

Or shorter:

sed -zE 's/(X\n?){3}//g' inputFile > outputFile

Output to outputFile:

ABCDEFGHIJ
XABCDEF
ABCDEFGHIJ

-z: separate lines by NUL characters

CodePudding user response：

This will do it:

paste -sd '' your_file | sed 's/XXX/   /g' | fold -w 10 | sed 's/ //g; /^$/d'

paste -sd '' your_file merges all the lines onto a single line
sed 's/XXX/ /g' replaces three X's by three spaces (note this will be problematic if the original file has spaces, since in the last step I remove them all... you could choose some other unique replacement if this is the case).
fold -w 10 folds the long line back to a set of lines 10 characters long
sed 's/ //g; /^$/d' removes the spaces and the removes any blank lines (if you used some other unique replacement instead of spaces in the second step, remove that instead of spaces in this step).

Outputs

ABCDEFGHIJ
XABCDEF
ABCDEFGHIJ