I have a list of names and a dataframe with a column of free form text. I am trying to scan through the column of text and if it contains a string from the list then append the string as an additional column on the data frame.
I have only found ways to make it appear as a binary or True/False in the additional column.
sys_list = ['AAAA', 'BBBB', 'AD-12', 'B31-A']
data = {'text': ['need help with AAAA system requesting help', 'AD-12 crashed, need
support', 'fuel system down', '/BBBB needs refresh']}
df = pd.DataFrame(data)
with the end result being
text System
0 need help with AAAA system requesting help AAAA
1 AD-12 crashed, need support AD-12
2 fuel system down 0
3 /BBBB needs refresh BBBB
I have tried
# which gives True or False values
pattern = '|'.join(sys_list)
df['System'] = df['text'].str.contains(pattern)
# which gives 0 or 1
df['System'] = [int(any(w in sys_list for w in x.split())) for x in df['text']]
CodePudding user response:
import pandas as pd
sys_list = ['AAAA', 'BBBB', 'AD-12', 'B31-A']
data = {'text': ['need help with AAAA system requesting help', 'AD-12 crashed, need support', 'fuel system down', '/BBBB needs refresh']}
df = pd.DataFrame(data)
def f(s):
for symbol in sys_list:
if symbol in s:
return symbol
return 0
df['System'] = df.text.apply(f)
print(df)
prints
index | text | System |
---|---|---|
0 | need help with AAAA system requesting help | AAAA |
1 | AD-12 crashed, need support | AD-12 |
2 | fuel system down | 0 |
3 | /BBBB needs refresh | BBBB |
Remark: this only uses the first symbol in sys_list
that occurs in a string, i.e. assumes that the symbol occurrences are mutually exclusive.
CodePudding user response:
Slightly modifying your second example using :=
:
df["System"] = [
word
if any((word := ww) in w for w in x.split() for ww in sys_list)
else "N/A"
for x in df["text"]
]
print(df)
Prints:
text System
0 need help with AAAA system requesting help AAAA
1 AD-12 crashed, need support AD-12
2 fuel system down N/A
3 /BBBB needs refresh BBBB