This question is similar to this one although I am trying to find a faster tool than pandas. Here is my starting table:
col1 col2
0 ABCDE [AB]B[CD]
1 BCDEA [BC][CD]E
2 CDEAB [CD][CD]EA
3 DEABC [DE]A
4 EABCD [DE]A
In case you want to reproduce it in python pandas, you can use this:
df = pd.DataFrame([['ABCDE', '[AB]B[CD]'],
['BCDEA', '[BC][CD]E'],
['CDEAB', '[CD][CD]EA'],
['DEABC', '[DE]A'],
['EABCD', '[DE]A']],
columns=['col1', 'col2'])
My goal is to subtract col2 from col1 considering:
a. col2 uses regex
b. In order to make any change, col1 should start with col2
Here is the expected output:
col1 col2 col3
0 ABCDE [AB]B[CD] DE
1 BCDEA [BC][CD]E BCDEA
2 CDEAB [CD][CD]EA B
3 DEABC [DE]A DEABC
4 EABCD [DE]A BCD
I have found a solution with pandas, but a 2,000,000 line file makes the script quite slow. Here is the command:
df['col3'] = df.col1.replace(df.col2[df.col2.notnull()], '', regex = True).str.lstrip()
One solution is to split the large file into many and use different cpu to do the job, but I was wondering whether awk, sed or perl could do this more efficiently. Any idea?
CodePudding user response:
Using a perl
one-liner (And assuming whitespace-separated columns):
$ perl -lane 'print join("\t", @F, $F[0] =~ s/^$F[1]//r)' input.txt
ABCDE [AB]B[CD] DE
BCDEA [BC][CD]E BCDEA
CDEAB [CD][CD]EA B
DEABC [DE]A DEABC
EABCD [DE]A BCD
Or awk
:
$ awk '{ $3 = $1; sub("^" $2, "", $3); print }' OFS="\t" input.txt
ABCDE [AB]B[CD] DE
BCDEA [BC][CD]E BCDEA
CDEAB [CD][CD]EA B
DEABC [DE]A DEABC
EABCD [DE]A BCD
CodePudding user response:
IIUC, you need to loop here.
Your anchoring condition is unclear ("In order to make any change, col1 should start with col2", but this doesn't seem to match the output) so I'll provide both options:
anchoring to the start:
import re
df['col3'] = [re.sub(f'^{b}', '', a) for a,b in zip(df['col1'], df['col2'])]
output:
col1 col2 col3
0 ABCDE [AB]B[CD] DE
1 BCDEA [BC][CD]E BCDEA
2 CDEAB [CD][CD]EA B
3 DEABC [DE]A DEABC
4 EABCD [DE]A BCD
no anchoring:
df['col3'] = [re.sub(f'{b}', '', a) for a,b in zip(df['col1'], df['col2'])]
output:
col1 col2 col3
0 ABCDE [AB]B[CD] DE
1 BCDEA [BC][CD]E BA
2 CDEAB [CD][CD]EA B
3 DEABC [DE]A DBC
4 EABCD [DE]A BCD