Home > front end >  Check if a string value of a column in a Pandas DataFrame starts with the value of another column
Check if a string value of a column in a Pandas DataFrame starts with the value of another column

Time:03-03

I am trying to concatenate two string columns col1 and col2 in a Pandas DataFrame. But I do not want to concatenate them if col2’s value already begins with the value of col1. In this case I want to use col2 without concatenation. This would be the expected behavior:

col1 col2 result
ABC ABC ABC
ABC XYC ABCXYZ
ABC abc123 abc123

I tried this code:

import pandas as pd

df = pd.DataFrame({
    'col1': ['ABC', 'ABC', 'AbC'],
    'col2': ['ABC', 'XYZ', 'abc123'],
})


df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith(df['col1'].str.lower()), df['col1']   df['col2'])

df

But that results in:

col1 col2 result
ABC ABC ABCABC
ABC XYC ABCXYZ
ABC abc123 AbCabc123

For testing purposes I used a string literal as parameter for startswith and received the expected results:

df['result'] = df['col2'].where(df['col2'].str.lower().str.startswith('abc'), df['col1']   df['col2'])

I found out that the result of the startswith function always returns NaN:

df['result'] = df['col2'].str.lower().str.startswith(df['col1'].str.lower())
col1 col2 result
ABC ABC NaN
ABC XYC NaN
ABC abc123 NaN

If I replace the startswith parameter with a string literal I receive booleans as expected:

df['result'] = df['col2'].str.lower().str.startswith('abc')
col1 col2 result
ABC ABC True
ABC XYC False
ABC abc123 True

I understand that it seems to be a problem to use a series as a parameter in the startswith function. But I do not get it to work.

I am very new to Python and Pandas, I intensively used search engines and stackoverflow's search function before I created my first post. What do I have to change in my code to accomplish the desired behavior? Any help is highly appreciated. Thanks!

CodePudding user response:

Inspired by this answer: https://stackoverflow.com/a/64332351/18090994

Write your own startswith function and vectorize it with numpy.vectorize. In this way, you can compare the strings in col1 and col2 row by row.

from numpy import vectorize

def startswith(str1, str2):
  """Check if str1 starts with str2 (case insensitive)"""
  return str1.lower().startswith(str2.lower())

startswith = vectorize(startswith)
df['result'] = df['col2'].where(startswith(df['col2'], df['col1']), df['col1']   df['col2'])
  • Related