I have a file (the first chapter of Harry Potter) with large amounts of white space. For example:
CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and
My objective, while learning command line tools, is to (first identify with grep
and then) remove all white space, as follows:
CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and
I'm trying to identify the lines with multiple white spaces using grep
. In this, I've attempted the following (amongst others):
$ grep "(\s){2,}" file
$ grep "(\ ){2,}" file
$ grep "([[:space:]]){2,}" file
$ grep "[[:space:]]{2,}" file
None of these has produced any matches. I've confirmed that there is white space in there with Vim. I've similarly confirmed each of those syntaxes on regex101.com. I've also checked the file against grep " " file
(and varieties) and seen all lines with any white space output correctly.
What is the correct syntax for this query?
CodePudding user response:
Given:
cat file
CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and
Your best bet is sed
to delete leading spaces:
sed -E 's/^[[:blank:]]{2,}//' file
CHAPTER ONE
The Boy Who Lived
M r and Mrs Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank
you very much. They were the last people you’d expect to be
involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.
Mr Dursley was the director of a fi rm called Grunnings,
which made drills. He was a big, beefy man with hardly
any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent
so much of her time craning over garden fences, spying on the
neighbours. The Dursleys had a small son called Dudders and
Or with awk
:
awk '{sub(/^[[:blank:]]{2,}/,"")} 1' file
# same output
If you only want to identify those lines that have 2 or more spaces at the beginning with grep
:
grep -E '^[[:blank:]]{2,}' file
The issue YOU were having is that grep
and sed
use Basic Regular Expressions (BRE) as a default. You need to use the -E
option to trigger using Extended Regular Expressions (ERE).
HERE is the difference BRE and ERE.
awk
uses ERE as a default.