I'm trying to use YAKE to extract the keywords from a list of books' summaries.
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
Then I tried to use a loop and extract 1 keyword from each summary:
for i in df['Summary']:
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 2
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(i)
for kw, w in keywords:
print(kw)
Th output is:
printed
Lorem
However, I'd like to add them as a new column in the same dataframe. The final output should be:
Book | Summary | Keywods |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |
I tried to make a new list
df['keywords'] = kw
but it didn't work! It's been a while since I used Python and pandas and I don't seem to remember to do that!
Any help would be appreciated!
CodePudding user response:
Try df.Summary.apply
:
import pandas as pd
import yake
language = "en"
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
df = {'Book': [1, 2], 'Summary': ['text definition includes the original words of something written, printed, or spoken', 'example of the Lorem ipsum placeholder text on a green and white webpage']}
df = pd.DataFrame(df)
df['Keywords'] = df.Summary.apply(lambda x : custom_kw_extractor.extract_keywords(x)[0][0])
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-----------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |
Or with to_numpy()
:
df['keywords'] = [custom_kw_extractor.extract_keywords(d)[0][0] for d in df.Summary.to_numpy()]
Key words as lists using to_numpy()
, since it is usually faster than df.apply
:
df['Keywords'] = [[s[0] for s in custom_kw_extractor.extract_keywords(d)] for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:-------------------------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written', 'spoken', 'definition'] |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | ['Lorem', 'webpage', 'ipsum', 'placeholder', 'text']
Or if you want comma separated strings:
df['Keywords'] = [','.join([s[0] for s in custom_kw_extractor.extract_keywords(d)]) for d in df.Summary.to_numpy()]
| | Book | Summary | Keywords |
|---:|-------:|:-------------------------------------------------------------------------------------|:---------------------------------------|
| 0 | 1 | text definition includes the original words of something written, printed, or spoken | printed,text,written,spoken,definition |
| 1 | 2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem,webpage,ipsum,placeholder,text |
CodePudding user response:
Update
1 keyword was for simplicity but I'd love it if I can generalise it to multiple keywords
For multiple keywords, change numOfKeywords
and lambda
function:
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 3 # <- Multiple keywords
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keywords = lambda x: [k[0] for k in custom_kw_extractor.extract_keywords(x)]
df['TopKeyword'] = df['Summary'].apply(extract_keywords)
Output:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | ['printed', 'text', 'written'] |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | ['Lorem', 'webpage', 'ipsum'] |
To get a string instead of a list, update lambda function
:
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
Output:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed,text,written |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem,webpage,ipsum |
Old answer
extract_keywords = lambda x: ','.join(k[0] for k in custom_kw_extractor.extract_keywords(x))
df['Keywords'] = df['Summary'].apply(extract_keywords)
You don't need a loop here, apply
can do it for you.
language = 'en'
max_ngram_size = 1
deduplication_threshold = 0.9
numOfKeywords = 1
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size,
dedupLim=deduplication_threshold,
top=numOfKeywords, features=None)
extract_keyword = lambda x: custom_kw_extractor.extract_keywords(x)[0][0]
df['TopKeyword'] = df['Summary'].apply(extract_keyword)
Output:
Book | Summary | TopKeyword |
---|---|---|
1 | text definition includes the original words of something written, printed, or spoken | printed |
2 | example of the Lorem ipsum placeholder text on a green and white webpage | Lorem |