Create a new columns based on keywords in YAKE-CodePudding

I'm trying to use YAKE to extract the keywords from a list of books' summaries.

df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)

Then I tried to use a loop and extract 1 keyword from each summary:

for i in df['Summary']:
  language = "en"
  max_ngram_size = 1
  deduplication_threshold = 0.9
  numOfKeywords = 2
  custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
  keywords = custom_kw_extractor.extract_keywords(i)
  for kw, w in keywords:
    print(kw)

Th output is:

printed
Lorem

However, I'd like to add them as a new column in the same dataframe. The final output should be:

Book	Summary	Keywods
1	text definition includes the original words of something written, printed, or spoken	printed
2	example of the Lorem ipsum placeholder text on a green and white webpage	Lorem

I tried to make a new list

df['keywords'] = kw

but it didn't work! It's been a while since I used Python and pandas and I don't seem to remember to do that!

Any help would be appreciated!

CodePudding user response：

Try df.Summary.apply:

import pandas as pd
import yake

language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)

df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)

df['Keywords'] = df.Summary.apply(lambda x : custom_kw_extractor.extract_keywords(x)[0][0])

|    |   Book | Summary                                                                              | Keywords   |
|---:|-------:|:-------------------------------------------------------------------------------------|:-----------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | printed    |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | Lorem      |

Or with to_numpy():

df['keywords'] = [custom_kw_extractor.extract_keywords(d)[0][0] for d in df.Summary.to_numpy()]

Key words as lists using to_numpy(), since it is usually faster than df.apply:

df['Keywords']  = [[s[0] for s in custom_kw_extractor.extract_keywords(d)] for d in df.Summary.to_numpy()]

|    |   Book | Summary                                                                              | Keywords                                               |
|---:|-------:|:-------------------------------------------------------------------------------------|:-------------------------------------------------------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written', 'spoken', 'definition'] |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | ['Lorem', 'webpage', 'ipsum', 'placeholder', 'text']

Or if you want comma separated strings:

df['Keywords']  = [','.join([s[0] for s in custom_kw_extractor.extract_keywords(d)]) for d in df.Summary.to_numpy()]

|    |   Book | Summary                                                                              | Keywords                               |
|---:|-------:|:-------------------------------------------------------------------------------------|:---------------------------------------|
|  0 |      1 | text definition includes the original words of something written, printed, or spoken | printed,text,written,spoken,definition |
|  1 |      2 | example of the Lorem ipsum placeholder text on a green and white webpage             | Lorem,webpage,ipsum,placeholder,text   |

CodePudding user response：

Update

1 keyword was for simplicity but I'd love it if I can generalise it to multiple keywords

For multiple keywords, change numOfKeywords and lambda function:

language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 3  # <- Multiple keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
                                            dedupLim=deduplication_threshold,
                                            top=numOfKeywords, features=None)

extract_keywords = lambda x: [k[0] for k in custom_kw_extractor.extract_keywords(x)]
df['TopKeyword'] = df['Summary'].apply(extract_keywords)

Output:

Book	Summary	TopKeyword
1	text definition includes the original words of something written, printed, or spoken	['printed', 'text', 'written']
2	example of the Lorem ipsum placeholder text on a green and white webpage	['Lorem', 'webpage', 'ipsum']

To get a string instead of a list, update lambda function:

extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))

Output:

Book	Summary	TopKeyword
1	text definition includes the original words of something written, printed, or spoken	printed,text,written
2	example of the Lorem ipsum placeholder text on a green and white webpage	Lorem,webpage,ipsum

Old answer

extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
df['Keywords'] = df['Summary'].apply(extract_keywords)

You don't need a loop here, apply can do it for you.

language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
                                            dedupLim=deduplication_threshold,
                                            top=numOfKeywords, features=None)

extract_keyword = lambda x: custom_kw_extractor.extract_keywords(x)[0][0]
df['TopKeyword'] = df['Summary'].apply(extract_keyword)

Output:

Book	Summary	TopKeyword
1	text definition includes the original words of something written, printed, or spoken	printed
2	example of the Lorem ipsum placeholder text on a green and white webpage	Lorem