Home > Software engineering >  Replace only alphanumeric chars from strings in one file in another
Replace only alphanumeric chars from strings in one file in another

Time:03-28

I have file1 with records that I want to find and replace with # in file2 and redirect the output to file3. I want to translate only the alphanumeric characters in file2. With the below code I'm not able to get the expected output. What am I doing wrong?

file_read=`cat file2`
while read line; do
  var=`echo $line | tr '[a-zA-Z0-9]' '#'`
  rep=`echo $file_read | awk "{gsub(/$line/,\"$var\"); print}"`
done < file1
echo file2 > file3

cat file1

2001009
@vanti Finserv Co.
2001009
Fund #1
11:11 - Capital
MS&CO(NY)
American Friends Org, Inc. 12X32
Domain-Name (LLC)
MS&CO(NY)
MS&CO(NY)
Ivy/Estate Rd
E*Trade wholesale

cat file2

<html>
<body>
<hr><br><>span >Records</span><table>
<tr >
 <td>Rec1</td>
 <td>Rec2</td>
 <td>Rec3</td>
 <td>Rec4</td>
 <td>Rec5</td>
 <td>Rec6</td>
 <td>Rec7</td>
 <td>Rec8</td>
</tr>
<tr >
<td>@vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>E*Trade wholesale</td>
<td>Domain-Name (LLC)</td>
<td>Ivy/Estate Rd</td>
<td></td>
</tr>
<tr >
<td>@vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td>Ivy/Estate Rd</td>
</table>
</body>
</html>

expected output cat file3

<html>
<body>
<hr><br><>span >Records</span><table>
<tr >
 <td>Rec1</td>
 <td>Rec2</td>
 <td>Rec3</td>
 <td>Rec4</td>
 <td>Rec5</td>
 <td>Rec6</td>
 <td>Rec7</td>
 <td>Rec8</td>
</tr>
<tr >
<td>@##### ####### ##.</td>
<td>##:## - #######</td>
<td>##&##(##)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>#*##### ########</td>
<td>######-#### (###)</td>
<td>###/###### ##</td>
<td></td>
</tr>
<tr >
<td>@##### ####### ##.</td>
<td></td>
<td>##&##(##)</td>
<td>2</td>
<td>2</td>
<td>##&##(##)</td>
<td>##&##(##)</td>
<td>###/###### ##/td>
</table>
</body>
</html>

CodePudding user response:

You seem to be looking for something like

awk 'NR==FNR {
  regex = $0;
  gsub(/[][(){}|\\* ?.^$]/, "\\\\&", regex);
  a[  n] = regex;

  gsub(/[A-Za-z0-9]/, "#");
  gsub(/&/, "\\\\&");
  b[n] = $0;

  next
}
{ for(i=1;i<=n;  i)
    gsub(a[i], b[i])
} 1' file1 file2 >file3

In brief, we populate the array a with the phrases from file1, and b with the corresponding replacement strings. The condition FNR==NR will be true for the first input file; we then fall through to the rest of the script, which simply replaces any strings from a with the corresponding string from b, and prints all the lines.

The code is complicated somewhat by the escaping of regex metacharacters in a and further by the fact that & in the replacement string needs to be escaped, too (& alone recalls the matched text).

Demo: https://ideone.com/YkAkAZ

You generally want to avoid while read loops in the shell; Awk is much faster and more idiomatic when you want to perform some transformation on all lines in a file.

As a further aside, please try http://shellcheck.net/ before asking for human assistance. Even after you fixed syntax errors pointed out in comments, your attempt contains common beginner errors such as broken quoting.

CodePudding user response:

Would you please try the following:

awk '
    NR==FNR {s = $0; gsub("[[:alnum:]]", "#"); a[s] = $0; next}
    {
        if (match($0, ">[^<] ")) {
            str = substr($0, RSTART 1, RLENGTH-1)
            if (str in a) {
                $0 = substr($0, 1, RSTART) a[str] substr($0, RSTART RLENGTH)
            }
        }
    }
1 ' file1 file2 > file3

It assumes the strings to be replced are enclosed with tags but will work with the shown example.

  • Related