Check if a string value of a column in a Pandas DataFrame starts with the value of another column-CodePudding

I am trying to concatenate two string columns col1 and col2 in a Pandas DataFrame. But I do not want to concatenate them if col2’s value already begins with the value of col1. In this case I want to use col2 without concatenation. This would be the expected behavior:

col1	col2	result
ABC	ABC	ABC
ABC	XYC	ABCXYZ
ABC	abc123	abc123

I tried this code:

import pandas as pd

df = pd.DataFrame({
    'col1': ['ABC', 'ABC', 'AbC'],
    'col2': ['ABC', 'XYZ', 'abc123'],
})


df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith(df['col1'].str.lower()), df['col1']   df['col2'])

df

But that results in:

col1	col2	result
ABC	ABC	ABCABC
ABC	XYC	ABCXYZ
ABC	abc123	AbCabc123

For testing purposes I used a string literal as parameter for startswith and received the expected results:

df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith('abc'), df['col1']   df['col2'])

I found out that the result of the startswith function always returns NaN:

df['result'] = df['col2'].str.lower().str.startswith(df['col1'].str.lower())

col1	col2	result
ABC	ABC	NaN
ABC	XYC	NaN
ABC	abc123	NaN

If I replace the startswith parameter with a string literal I receive booleans as expected:

df['result'] = df['col2'].str.lower().str.startswith('abc')

col1	col2	result
ABC	ABC	True
ABC	XYC	False
ABC	abc123	True

I understand that it seems to be a problem to use a series as a parameter in the startswith function. But I do not get it to work.

I am very new to Python and Pandas, I intensively used search engines and stackoverflow's search function before I created my first post. What do I have to change in my code to accomplish the desired behavior? Any help is highly appreciated. Thanks!

CodePudding user response：

Inspired by this answer: https://stackoverflow.com/a/64332351/18090994

Write your own startswith function and vectorize it with numpy.vectorize. In this way, you can compare the strings in col1 and col2 row by row.

from numpy import vectorize

def startswith(str1, str2):
  """Check if str1 starts with str2 (case insensitive)"""
  return str1.lower().startswith(str2.lower())

startswith = vectorize(startswith)
df['result'] = df['col2'].where(startswith(df['col2'], df['col1']), df['col1']   df['col2'])