Home > Net >  How to extract only the English words and leaving the Devanagari words in bash script?
How to extract only the English words and leaving the Devanagari words in bash script?

Time:04-23

The text file is like this,

#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_

The desired text file should be like,

#
1
8IU
underscore
$that
%redyellow
$
@_

This is what I have tried so far, using awk

awk -F"[अ-ह]*" '{print $1}' filename.txt And the output that I am getting is,

#
1


$that
%red
$

and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,

# 
1 े
 ं
 ो
$that 
%red yellow
$ ि
 ं

Is there anyway to solve this in bash script?

CodePudding user response:

Using perl:

$ perl -CSD -lpe 's/\p{Devanagari} //g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
@_

-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.

The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.

CodePudding user response:

Using awk you can do:

awk '{sub(/[^\x00-\x7F] /, "")} 1' file
#
1
8IU
underscore
$that
%redyellow

using [\x00-\x7F]. This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.

CodePudding user response:

Does this sed work?

sed 's/\([0-9a-zA-Z[:punct:]]*\)[^0-9a-zA-Z[:punct:]]*/\1/g' input_file
#
1
8IU
underscore
$that
%redyellow
$
@_

CodePudding user response:

tr is a very good fit for this task:

LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt

It sets the POSIX C locale environment so that only US English character set is valid.

Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

  • Related