Fast way to remove substring in a column using another column-CodePudding

This question is similar to this one although I am trying to find a faster tool than pandas. Here is my starting table:

    col1        col2
0  ABCDE   [AB]B[CD]
1  BCDEA   [BC][CD]E
2  CDEAB  [CD][CD]EA
3  DEABC       [DE]A
4  EABCD       [DE]A

In case you want to reproduce it in python pandas, you can use this:

df = pd.DataFrame([['ABCDE', '[AB]B[CD]'],
                   ['BCDEA', '[BC][CD]E'],
                   ['CDEAB', '[CD][CD]EA'],
                   ['DEABC', '[DE]A'],
                   ['EABCD', '[DE]A']],
                  columns=['col1', 'col2'])

My goal is to subtract col2 from col1 considering:

a. col2 uses regex

b. In order to make any change, col1 should start with col2

Here is the expected output:

    col1        col2   col3
0  ABCDE   [AB]B[CD]     DE
1  BCDEA   [BC][CD]E  BCDEA
2  CDEAB  [CD][CD]EA      B
3  DEABC       [DE]A  DEABC
4  EABCD       [DE]A    BCD

I have found a solution with pandas, but a 2,000,000 line file makes the script quite slow. Here is the command:

df['col3'] = df.col1.replace(df.col2[df.col2.notnull()], '', regex = True).str.lstrip()

One solution is to split the large file into many and use different cpu to do the job, but I was wondering whether awk, sed or perl could do this more efficiently. Any idea?

CodePudding user response：

Using a perl one-liner (And assuming whitespace-separated columns):

$ perl -lane 'print join("\t", @F, $F[0] =~ s/^$F[1]//r)' input.txt
ABCDE   [AB]B[CD]       DE
BCDEA   [BC][CD]E       BCDEA
CDEAB   [CD][CD]EA      B
DEABC   [DE]A   DEABC
EABCD   [DE]A   BCD

Or awk:

$ awk '{ $3 = $1; sub("^" $2, "", $3); print }' OFS="\t" input.txt
ABCDE   [AB]B[CD]       DE
BCDEA   [BC][CD]E       BCDEA
CDEAB   [CD][CD]EA      B
DEABC   [DE]A   DEABC
EABCD   [DE]A   BCD

CodePudding user response：

IIUC, you need to loop here.

Your anchoring condition is unclear ("In order to make any change, col1 should start with col2", but this doesn't seem to match the output) so I'll provide both options:

anchoring to the start:

import re

df['col3'] = [re.sub(f'^{b}', '', a) for a,b in zip(df['col1'], df['col2'])]

output:

    col1        col2   col3
0  ABCDE   [AB]B[CD]     DE
1  BCDEA   [BC][CD]E  BCDEA
2  CDEAB  [CD][CD]EA      B
3  DEABC       [DE]A  DEABC
4  EABCD       [DE]A    BCD

no anchoring:

df['col3'] = [re.sub(f'{b}', '', a) for a,b in zip(df['col1'], df['col2'])]

output:

    col1        col2 col3
0  ABCDE   [AB]B[CD]   DE
1  BCDEA   [BC][CD]E   BA
2  CDEAB  [CD][CD]EA    B
3  DEABC       [DE]A  DBC
4  EABCD       [DE]A  BCD