Let's say I have a list
lst = ["fi", "ap", "ko", "co", "ex"]
and we have this series
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
and I'm looking to get something like this:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
CodePudding user response:
I think this solves your problem.
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
CodePudding user response:
apply
function of Pandas might be helpful
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.
CodePudding user response:
With the series
explanation = pd.Series(
["fi doesn't work correctly",
"apples are cool",
"this works but translation is ko"],
index=["a", "b", "c"]
)
you can use .str.extract()
to do
lst = ["fi", "ap", "ko", "co", "ex"]
expl_extracted = explanation.str.extract(
r"(?:^|\s )(" "|".join(lst) r")(?:\s |$)", expand=False
)
to get the series
a fi
b NaN
c ko
dtype: object
The regex pattern r"(?:^|\s )(" "|".join(lst) r")(?:\s |$)"
looks for an occurrence of one of the lst
items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract()
extracts the capture group (the part in the middle in ()
). Without a match the return is NaN
.
To combine the both:
df = pd.DataFrame(
{"Explanation": explanation, "Explanation Extracted": expl_extracted}
)
Explanation Explanation Extracted
a fi doesn't work correctly fi
b apples are cool NaN
c this works but translation is ko ko