I have this dataset:
emails1 = ['[email protected]', "[email protected]", "[email protected]"]
emails2 = ['[email protected]', "[email protected]", '[email protected]', "[email protected]"]
emails3 = ["[email protected]", '[email protected]']
terms = ['@gmail.com', 'data', 'ddd@']
df = pd.DataFrame([emails1, emails2, emails3])
df["emails"] = df.apply(lambda x: list([x[0],
x[1],
x[2],
x[3]]),axis=1)
df = df.iloc[: , 4:]
df
emails
0 [[email protected], [email protected], [email protected], None]
1 [[email protected], [email protected], [email protected], [email protected]]
2 [[email protected], [email protected], None, None]
I need to be able to find the first item of each list (starting from the back) that is from the terms array, so my out put wold be another column:
emails email wanted
0 [[email protected], [email protected], [email protected], None] [[email protected]]
1 [[email protected], [email protected], [email protected], [email protected]] [[email protected]]
2 [[email protected], [email protected], None, None] [[email protected]]
I tried this for each of the terms and combined the result, but does not work:
df["emails"].apply(lambda x:[i for i in x if '@gmail.com' in i])
Is there a good way of doing this?
CodePudding user response:
The exact logic is unclear, but you need a list comprehension:
import re
regex = re.compile('|'.join(map(re.escape, terms)))
# r'@gmail\.com|data|ddd@'
df['wanted'] = [next((x for x in l[::-1] if x and regex.search(x)), None)
for l in df['emails']]
output:
emails wanted
0 [[email protected], [email protected], [email protected]... [email protected]
1 [[email protected], [email protected], [email protected]... [email protected]
2 [[email protected], [email protected], None, None] [email protected]