sed program to replace strings starting with a specific character-CodePudding

I know similar questions have been asked around, like for example here, but I've been unable to reproduce desired results for my own needs with such examples, and I don't understand why.

I want to replace all words in a file starting with the character '@' by <@MENTION>. For example, this:

I have 6 @emailaddresses and 10% of the people don't eat sandwiches! I have six @emailaddresses and 10%... @123_Username @BAPP, you shouldn't say that! I recently called@User but he didn't answer. @Username is not a nice person! This @username guy is really cool!

Should become this:

I have 6 <@MENTION> and 10% of the people don't eat sandwiches! I have six <@MENTION> and 10%... <@MENTION> <@MENTION>, you shouldn't say that! I recently called@User but he didn't answer. <@MENTION> is not a nice person! This <@MENTION> guy is really cool!

I have tried this:

sed 's/@[a-zA-Z0-9_]*/<@MENTION>/'

But the string '@BAPP' is not taken into account, which I'd like to, and 'called@User' is taken into account, which I would prefer to avoid.

I also tried this:

sed -E -e 's/\b@[a-zA-Z0-9_]*\b/<@MENTION>/'

But for a reason I don't know the word boundaries are not taken into account...

Any help to help me understand my way around this would be much appreciated, as I'm (obviously) learning and have a limited experience with Bash. Thanks a lot in advance.

CodePudding user response：

One sed idea:

$ sed -E 's/(^|[^[:alnum:]])@[a-zA-Z0-9_]*(\>)/\1<@MENTION>\2/g' file

NOTE: the initial ^ was added to address the case where the desired string is at the beginning of the line.

This generates:

# assuming embedded linefeeds

I have 6 <@MENTION> and 10% of the people don't eat sandwiches! I have six
<@MENTION> and 10%... <@MENTION> <@MENTION>, you shouldn't say that! I recently
called@User but he didn't answer. <@MENTION> is not a nice person! This <@MENTION> guy
is really cool!

# assuming no embedded linefeeds

I have 6 <@MENTION> and 10% of the people don't eat sandwiches! I have six <@MENTION> and 10%... <@MENTION> <@MENTION>, you shouldn't say that! I recently called@User but he didn't answer. <@MENTION> is not a nice person! This <@MENTION> guy is really cool!

CodePudding user response：

\b is non-standard. It represents a zero-width assertion that a "word" character is on one side and a non-"word" character is on the other. However, the definition of "word" means that it doesn't help you (@ is not "word" character).

Unless you give the g flag to s///, it only changes one match per line.

You probably don't want to match @ not followed by "word" characters, so using * is incorrect.

Putting that together:

sed -E 's/(^|[^a-zA-Z_<])@[a-zA-Z0-9_] /\1<@MENTION>/g'

^|[^a-zA-Z_<] matches start of line or characters not listed in []. Edit to be what you want to exclude. Adding < means you don't change existing <@MENTION>s.

CodePudding user response：

Using sed

$ sed -E 's/(\s )@[^ ]*/\1<@MENTION>/g' input_file
I have 6 <@MENTION> and 10% of the people don't eat sandwiches! 
I have six <@MENTION> and 10%... <@MENTION> <@MENTION>
you shouldn't say that! I recently called@User but he didn't answer.
@Username is not a nice person! This <@MENTION> guy is really cool!

CodePudding user response：

With your shown samples only. In GNU sed with -E option enabled you can try following. Simple explanation would be, enabling ERE(extended regular expressions) then substituting @ followed by all non-spaces values with <@MENTION> and using g flag to make that substitution happen globally.

sed -E 's/@\S /<@MENTION>/g' Input_file

OR to be more specific try following sed with small tweak to above answer:

sed -E 's/(\s )@\S /\1<@MENTION>/g' Input_file

CodePudding user response：

This sed worked for me:

sed 's/@\w*/<@MENTION>/g' file