Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe-CodePudding

Looping over a list of bigrams to search for, I need to create a boolean field for each bigram according to whether or not it is present in a tokenized pandas series. And I'd appreciate an upvote if you think this is a good question!

List of bigrams:

bigrams = ['data science', 'computer science', 'bachelors degree']

Dataframe:

df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                            ['computer', 'science', 'degree', 'masters'],
                                            ['bachelors', 'degree', 'computer', 'vision'],
                                            ['data', 'processing', 'science']]})

Desired Output:

                         job_description  data science computer science bachelors degree
0        [data, science, degree, expert]          True            False            False
1   [computer, science, degree, masters]         False             True            False
2  [bachelors, degree, computer, vision]         False            False             True
3             [data, bachelors, science]         False            False            False

Criteria:

Only exact matches should be replaced (for example, flagging for 'data science' should return True for 'data science' but False for 'science data' or 'data bachelors science')
Each search term should get it's own field and be concatenated to the original df

What I've tried:

Failed: df = [x for x in df['job_description'] if x in bigrams]

Failed: df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]

Failed: Could not adapt the approach here -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

Failed: Could not get this one to adapt, either -> Compare two bigrams lists and return the matching bigram

Failed: This method is very close, but couldn't adapt it to bigrams -> Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe

Thanks for any help you can provide!

CodePudding user response：

You could also try using numpy and nltk, which should be quite fast:

import pandas as pd
import numpy as np
import nltk

bigrams = ['data science', 'computer science', 'bachelors degree']
df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                            ['computer', 'science', 'degree', 'masters'],
                                            ['bachelors', 'degree', 'computer', 'vision'],
                                            ['data', 'processing', 'science']]})

def find_bigrams(data):
  output = np.zeros((data.shape[0], len(bigrams)), dtype=bool)
  for i, d in enumerate(data):
    possible_bigrams = [' '.join(x) for x in list(nltk.bigrams(d))   list(nltk.bigrams(d[::-1]))]
    indices = np.where(np.isin(bigrams, list(set(bigrams).intersection(set(possible_bigrams)))))
    output[i, indices] = True
  return list(output.T)

output = find_bigrams(df['job_description'].to_numpy())
df = df.assign(**dict(zip(bigrams, output)))

|    | job_description                               | data science   | computer science   | bachelors degree   |
|---:|:----------------------------------------------|:---------------|:-------------------|:-------------------|
|  0 | ['data', 'science', 'degree', 'expert']       | True           | False              | False              |
|  1 | ['computer', 'science', 'degree', 'masters']  | False          | True               | False              |
|  2 | ['bachelors', 'degree', 'computer', 'vision'] | False          | False              | True               |
|  3 | ['data', 'processing', 'science']             | False          | False              | False              |

CodePudding user response：

You could use a regex and extractall:

regex = '|'.join('(%s)' % b.replace(' ', r'\s ') for b in bigrams)
matches = (df['job_description'].apply(' '.join)
           .str.extractall(regex).droplevel(1).notna()
           .groupby(level=0).max()
           )
matches.columns = bigrams

out = df.join(matches).fillna(False)

output:

                         job_description  data science  computer science  bachelors degree
0        [data, science, degree, expert]          True             False             False
1   [computer, science, degree, masters]         False              True             False
2  [bachelors, degree, computer, vision]         False             False              True
3            [data, processing, science]         False             False             False

generated regex:

'(data\\s science)|(computer\\s science)|(bachelors\\s degree)'