I am trying to create a function that is flexible enough to output a string
based on search words that are in an existing DataFrame column. I am getting an output but it seems every output after the first is chained with the previous outputs (previous outputs repeat with the new output). How can I correct this? I plan to expand the function to include more for loops
. Maybe there is a more efficient way to do this.
# declarations
search_words = ['one', 'two', 'three']
l1 = []
#Function
def concat(text):
for i in search_words[0:1]:
if i in text:
a = 'four'
l1.append(a)
for i in search_words[1:3]:
if i in text:
b = 'five'
l1.append(b)
listToStr = ' '.join(map(str, l1))
return listToStr
# Test Dataframe
dftest = pd.DataFrame(data =['one filler two','two','filler','three one'],
columns = ['col1'])
# Test output
dftest['col2'] = dftest['col1'].apply(lambda x: concat(x))
dftest
wrong output given:
col1 col2
0 one filler two four five
1 two four five five
2 filler four five five
3 three one four five five four five
Desired output:
col1 col2
0 one filler two four five
1 two five
2 filler
3 three one five four
CodePudding user response:
You have to define a new l1
each time you call concat
:
def concat(text):
l1 = []
for i in search_words[0:1]:
if i in text:
l1.append('four')
for i in search_words[1:3]:
if i in text:
l1.append('five')
listToStr = ' '.join(l1)
return listToStr
Also when you apply concat
, you don't need lambda:
dftest['col2'] = dftest['col1'].apply(concat)
Output:
col1 col2
0 one filler two four five
1 two five
2 filler
3 three one four five
CodePudding user response:
One simpler way might be:
dict_assignment = {
"one": "four",
"two": "five",
"three": "five",
}
dftest["col2"] = dftest.col1.apply(
lambda p: ' '.join(dict_assignment[w] for w in p.split() if w in search_words)
)
print(dftest)
# col1 col2
# 0 one filler two four five
# 1 two five
# 2 filler
# 3 three one five four
CodePudding user response:
Using apply
to call a function that contains a loop will be extremely inefficient.
Use vectorial code instead and a dictionary that maps the word, then join to the original dataframe.
d = {'one':'four', 'two':'five', 'three':'five'}
df2 = dftest.join(
dftest['col1']
.str.extractall(f"({'|'.join(d)})")[0]
.map(d)
.groupby(level=0).agg(' '.join)
.rename('col2')
)
NB. I used a simple regex here, however if you have special characters in your query words, you might need to escape them using re.escape
. Please update the example of this is the case.
Output:
col1 col2
0 one filler two four five
1 two five
2 filler NaN
3 three one five four