How to extract strings from a list in a column in a python pandas dataframe?-CodePudding

Let's say I have a list

lst = ["fi", "ap", "ko", "co", "ex"]

and we have this series

       Explanation 

a      "fi doesn't work correctly" 
b      "apples are cool" 
c      "this works but translation is ko"

and I'm looking to get something like this:

        Explanation                         Explanation Extracted

a      "fi doesn't work correctly"          "fi"
b      "apples are cool"                    "N/A"
c      "this works but translation is ko"   "ko"

CodePudding user response：

I think this solves your problem.

import pandas as pd

lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])

extracted =[] 
for index, row in df.iterrows():
    tempList =[] 
    rowSplit = row['Explanation'].split(" ")
    for val in rowSplit:
        if val in lst:
            tempList.append(val)
    if len(tempList)>0:
        extracted.append(','.join(tempList))
    else:
        extracted.append('N/A')

df['Explanation Extracted'] = extracted

CodePudding user response：

apply function of Pandas might be helpful

def extract_explanation(dataframe):
    custom_substring = ["fi", "ap", "ko", "co", "ex"]
    substrings = dataframe['explanation'].split(" ")
    explanation = "N/A"
    for string in substrings:
        if string in custom_substring:
            explanation = string
    return explanation

df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)

The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.

CodePudding user response：

With the series

explanation = pd.Series(
    ["fi doesn't work correctly",
     "apples are cool",
     "this works but translation is ko"],
    index=["a", "b", "c"]
)

you can use .str.extract() to do

lst = ["fi", "ap", "ko", "co", "ex"]
expl_extracted = explanation.str.extract(
    r"(?:^|\s )("   "|".join(lst)   r")(?:\s |$)", expand=False
)

to get the series

a     fi
b    NaN
c     ko
dtype: object

The regex pattern r"(?:^|\s )(" "|".join(lst) r")(?:\s |$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.

To combine the both:

df = pd.DataFrame(
    {"Explanation": explanation, "Explanation Extracted": expl_extracted}
)

                        Explanation Explanation Extracted
a         fi doesn't work correctly                    fi
b                   apples are cool                   NaN
c  this works but translation is ko                    ko