Home > Software engineering >  Grep not recognizing white space
Grep not recognizing white space

Time:11-01

I have a file (the first chapter of Harry Potter) with large amounts of white space. For example:

 CHAPTER ONE
  The Boy Who Lived
   M r and Mrs Dursley, of number four, Privet Drive, were
   proud to say that they were perfectly normal, thank
   you very much. They were the last people you’d expect to be
   involved in anything strange or mysterious, because they just
   didn’t hold with such nonsense.
    Mr Dursley was the director of a fi rm called Grunnings,
    which made drills. He was a big, beefy man with hardly
    any neck, although he did have a very large moustache.
    Mrs Dursley was thin and blonde and had nearly twice the
    usual amount of neck, which came in very useful as she spent
    so much of her time craning over garden fences, spying on the
    neighbours. The Dursleys had a small son called Dudders and

My objective, while learning command line tools, is to (first identify with grep and then) remove all white space, as follows:

 CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and

I'm trying to identify the lines with multiple white spaces using grep. In this, I've attempted the following (amongst others):

$ grep "(\s){2,}" file
$ grep "(\ ){2,}" file
$ grep "([[:space:]]){2,}" file
$ grep "[[:space:]]{2,}" file

None of these has produced any matches. I've confirmed that there is white space in there with Vim. I've similarly confirmed each of those syntaxes on regex101.com. I've also checked the file against grep " " file (and varieties) and seen all lines with any white space output correctly.

What is the correct syntax for this query?

CodePudding user response:

Given:

cat file
 CHAPTER ONE
  The Boy Who Lived
   M r and Mrs Dursley, of number four, Privet Drive, were
   proud to say that they were perfectly normal, thank
   you very much. They were the last people you’d expect to be
   involved in anything strange or mysterious, because they just
   didn’t hold with such nonsense.
    Mr Dursley was the director of a fi rm called Grunnings,
    which made drills. He was a big, beefy man with hardly
    any neck, although he did have a very large moustache.
    Mrs Dursley was thin and blonde and had nearly twice the
    usual amount of neck, which came in very useful as she spent
    so much of her time craning over garden fences, spying on the
    neighbours. The Dursleys had a small son called Dudders and

Your best bet is sed to delete leading spaces:

sed -E 's/^[[:blank:]]{2,}//' file
 CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and

Or with awk:

awk '{sub(/^[[:blank:]]{2,}/,"")} 1' file
# same output

If you only want to identify those lines that have 2 or more spaces at the beginning with grep:

grep -E '^[[:blank:]]{2,}' file

The issue YOU were having is that grep and sed use Basic Regular Expressions (BRE) as a default. You need to use the -E option to trigger using Extended Regular Expressions (ERE).

HERE is the difference BRE and ERE.

awk uses ERE as a default.

  • Related