Combine tokens sequentially into bigrams and split into columns pandas dataframe-CodePudding

I have a dataframe:

import pandas as pd
data = [[2022, 4, ['apple', 'edible', 'fruit'], 'edible friut',
         'apple edible fruit'], [2022, 4, ['apple', 'edible', 'fruit'], ' apple edible',
         'apple edible fruit'], [2022, 4, ['apple', 'edible', 'fruit'], ' orange sweet',
         'apple edible fruit'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], ' green kiwi',
         'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'fruit popular',
         'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'yellow lemon',
         'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'kiwi fruit',
         'green kiwi fruit popular']]
df = pd.DataFrame(data, columns = ['year', 'id', 'list_token_joined_bigram', 'old bigram', 'joined_bigram'])

The dataframe looks like this:

 ---- --- ----------------------------- ------------- ------------------------ 
|year|id |list_token_joined_bigram     |old bigram   |joined_bigram           |
 ---- --- ----------------------------- ------------- ------------------------ 
|2022|4  |[apple, edible, fruit]       |edible friut |apple edible fruit      |
|2022|4  |[apple, edible, fruit]       | apple edible|apple edible fruit      |
|2022|4  |[apple, edible, fruit]       | orange sweet|apple edible fruit      |
|2022|1  |[green, kiwi, fruit, popular]| green kiwi  |green kiwi fruit popular|
|2022|1  |[green, kiwi, fruit, popular]|fruit popular|green kiwi fruit popular|
|2022|1  |[green, kiwi, fruit, popular]|yellow lemon |green kiwi fruit popular|
|2022|1  |[green, kiwi, fruit, popular]|kiwi fruit   |green kiwi fruit popular|
 ---- --- ----------------------------- ------------- ------------------------

I want to get new bigrams from the list_token_joined_bigram column, which are obtained by joining in series and split into new column (new_bigram_after_join). If the new bigram matches the old one, then the value in the column joined_bigram is saved, if it does not match, then put None. In the end, I want it to be:

  ---- --- ----------------------------- ------------- ------------------------ --------------------- 
|year|id |list_token_joined_bigram     |old bigram   |joined_bigram           |new_bigram_after_join|
 ---- --- ----------------------------- ------------- ------------------------ --------------------- 
|2022|4  |[apple, edible, fruit]       |edible friut |apple edible fruit      |apple edible         |
|2022|4  |[apple, edible, fruit]       | apple edible|apple edible fruit      |edible fruit         |
|2022|4  |[apple, edible, fruit]       | orange sweet|null                    |null                 |
|2022|1  |[green, kiwi, fruit, popular]| green kiwi  |green kiwi fruit popular|green kiwi           |
|2022|1  |[green, kiwi, fruit, popular]|fruit popular|green kiwi fruit popular|kiwi fruit           |
|2022|1  |[green, kiwi, fruit, popular]|yellow lemon |null                    |null                 |
|2022|1  |[green, kiwi, fruit, popular]|kiwi fruit   |green kiwi fruit popular|fruit popular        |
 ---- --- ----------------------------- ------------- ------------------------ ---------------------

CodePudding user response：

Here is a way to do this (note: in your sample data I fixed the typo 'friut' to 'fruit' and I removed the extraneous space in some of the old bigrams):

def consecutive_pairs(lst):
    return [f'{a} {b}' for a, b in zip(lst, lst[1:])]

out = (
    df
    .rename(columns={'old bigram': 'old_bigram'})
    .assign(
        new_bigram=df['list_token_joined_bigram']
        .transform(consecutive_pairs)
    )
    .explode('new_bigram')
    .query('new_bigram != old_bigram')
)

>>> out.drop(columns='joined_bigram')
   year  id list_token_joined_bigram       old_bigram     new_bigram    
0  2022  4          [apple, edible, fruit]   edible fruit   apple edible
1  2022  4          [apple, edible, fruit]   apple edible   edible fruit
2  2022  4          [apple, edible, fruit]   orange sweet   apple edible
2  2022  4          [apple, edible, fruit]   orange sweet   edible fruit
3  2022  1   [green, kiwi, fruit, popular]     green kiwi     kiwi fruit
3  2022  1   [green, kiwi, fruit, popular]     green kiwi  fruit popular
4  2022  1   [green, kiwi, fruit, popular]  fruit popular     green kiwi
4  2022  1   [green, kiwi, fruit, popular]  fruit popular     kiwi fruit
5  2022  1   [green, kiwi, fruit, popular]   yellow lemon     green kiwi
5  2022  1   [green, kiwi, fruit, popular]   yellow lemon     kiwi fruit
5  2022  1   [green, kiwi, fruit, popular]   yellow lemon  fruit popular
6  2022  1   [green, kiwi, fruit, popular]     kiwi fruit     green kiwi
6  2022  1   [green, kiwi, fruit, popular]     kiwi fruit  fruit popular

Explanation

The .rename() is to make the column 'old bigram' possible to handle as a proper identifier in the .query() further down in the expression.
The function consecutive_pairs() simply makes a list of bigrams from the ngrams list. E.g. consecutive_pairs(['a', 'b', 'c', 'd']) == ['a b', 'b c', 'c d'].
The result of the .assign(...) is a new column ("new_bigram") that initially contains a list of such pairs.
.explode() expands each element of these lists into a new row.
.query() filters out the cases where new_bigram is the same as old_bigram.

Note: in the display of the result, I dropped joined_bigram to make the result easier to read.