I am trying to concatenate two string columns col1
and col2
in a Pandas DataFrame. But I do not want to concatenate them if col2
’s value already begins with the value of col1
. In this case I want to use col2
without concatenation. This would be the expected behavior:
col1 | col2 | result |
---|---|---|
ABC | ABC | ABC |
ABC | XYC | ABCXYZ |
ABC | abc123 | abc123 |
I tried this code:
import pandas as pd
df = pd.DataFrame({
'col1': ['ABC', 'ABC', 'AbC'],
'col2': ['ABC', 'XYZ', 'abc123'],
})
df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith(df['col1'].str.lower()), df['col1'] df['col2'])
df
But that results in:
col1 | col2 | result |
---|---|---|
ABC | ABC | ABCABC |
ABC | XYC | ABCXYZ |
ABC | abc123 | AbCabc123 |
For testing purposes I used a string literal as parameter for startswith and received the expected results:
df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith('abc'), df['col1'] df['col2'])
I found out that the result of the startswith
function always returns NaN
:
df['result'] = df['col2'].str.lower().str.startswith(df['col1'].str.lower())
col1 | col2 | result |
---|---|---|
ABC | ABC | NaN |
ABC | XYC | NaN |
ABC | abc123 | NaN |
If I replace the startswith
parameter with a string literal I receive booleans as expected:
df['result'] = df['col2'].str.lower().str.startswith('abc')
col1 | col2 | result |
---|---|---|
ABC | ABC | True |
ABC | XYC | False |
ABC | abc123 | True |
I understand that it seems to be a problem to use a series as a parameter in the startswith
function. But I do not get it to work.
I am very new to Python and Pandas, I intensively used search engines and stackoverflow's search function before I created my first post. What do I have to change in my code to accomplish the desired behavior? Any help is highly appreciated. Thanks!
CodePudding user response:
Inspired by this answer: https://stackoverflow.com/a/64332351/18090994
Write your own startswith
function and vectorize it with numpy.vectorize. In this way, you can compare the strings in col1
and col2
row by row.
from numpy import vectorize
def startswith(str1, str2):
"""Check if str1 starts with str2 (case insensitive)"""
return str1.lower().startswith(str2.lower())
startswith = vectorize(startswith)
df['result'] = df['col2'].where(startswith(df['col2'], df['col1']), df['col1'] df['col2'])