I have a dataframe:
import pandas as pd
data = [[2022, 4, ['apple', 'edible', 'fruit'], 'edible friut',
'apple edible fruit'], [2022, 4, ['apple', 'edible', 'fruit'], ' apple edible',
'apple edible fruit'], [2022, 4, ['apple', 'edible', 'fruit'], ' orange sweet',
'apple edible fruit'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], ' green kiwi',
'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'fruit popular',
'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'yellow lemon',
'green kiwi fruit popular'], [2022, 1, ['green', 'kiwi', 'fruit', 'popular'], 'kiwi fruit',
'green kiwi fruit popular']]
df = pd.DataFrame(data, columns = ['year', 'id', 'list_token_joined_bigram', 'old bigram', 'joined_bigram'])
The dataframe looks like this:
---- --- ----------------------------- ------------- ------------------------
|year|id |list_token_joined_bigram |old bigram |joined_bigram |
---- --- ----------------------------- ------------- ------------------------
|2022|4 |[apple, edible, fruit] |edible friut |apple edible fruit |
|2022|4 |[apple, edible, fruit] | apple edible|apple edible fruit |
|2022|4 |[apple, edible, fruit] | orange sweet|apple edible fruit |
|2022|1 |[green, kiwi, fruit, popular]| green kiwi |green kiwi fruit popular|
|2022|1 |[green, kiwi, fruit, popular]|fruit popular|green kiwi fruit popular|
|2022|1 |[green, kiwi, fruit, popular]|yellow lemon |green kiwi fruit popular|
|2022|1 |[green, kiwi, fruit, popular]|kiwi fruit |green kiwi fruit popular|
---- --- ----------------------------- ------------- ------------------------
I want to get new bigrams from the list_token_joined_bigram
column, which are obtained by joining in series and split into new column (new_bigram_after_join
). If the new bigram matches the old one, then the value in the column joined_bigram
is saved, if it does not match, then put None. In the end, I want it to be:
---- --- ----------------------------- ------------- ------------------------ ---------------------
|year|id |list_token_joined_bigram |old bigram |joined_bigram |new_bigram_after_join|
---- --- ----------------------------- ------------- ------------------------ ---------------------
|2022|4 |[apple, edible, fruit] |edible friut |apple edible fruit |apple edible |
|2022|4 |[apple, edible, fruit] | apple edible|apple edible fruit |edible fruit |
|2022|4 |[apple, edible, fruit] | orange sweet|null |null |
|2022|1 |[green, kiwi, fruit, popular]| green kiwi |green kiwi fruit popular|green kiwi |
|2022|1 |[green, kiwi, fruit, popular]|fruit popular|green kiwi fruit popular|kiwi fruit |
|2022|1 |[green, kiwi, fruit, popular]|yellow lemon |null |null |
|2022|1 |[green, kiwi, fruit, popular]|kiwi fruit |green kiwi fruit popular|fruit popular |
---- --- ----------------------------- ------------- ------------------------ ---------------------
CodePudding user response:
Here is a way to do this (note: in your sample data I fixed the typo 'friut' to 'fruit' and I removed the extraneous space in some of the old bigrams):
def consecutive_pairs(lst):
return [f'{a} {b}' for a, b in zip(lst, lst[1:])]
out = (
df
.rename(columns={'old bigram': 'old_bigram'})
.assign(
new_bigram=df['list_token_joined_bigram']
.transform(consecutive_pairs)
)
.explode('new_bigram')
.query('new_bigram != old_bigram')
)
>>> out.drop(columns='joined_bigram')
year id list_token_joined_bigram old_bigram new_bigram
0 2022 4 [apple, edible, fruit] edible fruit apple edible
1 2022 4 [apple, edible, fruit] apple edible edible fruit
2 2022 4 [apple, edible, fruit] orange sweet apple edible
2 2022 4 [apple, edible, fruit] orange sweet edible fruit
3 2022 1 [green, kiwi, fruit, popular] green kiwi kiwi fruit
3 2022 1 [green, kiwi, fruit, popular] green kiwi fruit popular
4 2022 1 [green, kiwi, fruit, popular] fruit popular green kiwi
4 2022 1 [green, kiwi, fruit, popular] fruit popular kiwi fruit
5 2022 1 [green, kiwi, fruit, popular] yellow lemon green kiwi
5 2022 1 [green, kiwi, fruit, popular] yellow lemon kiwi fruit
5 2022 1 [green, kiwi, fruit, popular] yellow lemon fruit popular
6 2022 1 [green, kiwi, fruit, popular] kiwi fruit green kiwi
6 2022 1 [green, kiwi, fruit, popular] kiwi fruit fruit popular
Explanation
- The
.rename()
is to make the column'old bigram'
possible to handle as a proper identifier in the.query()
further down in the expression. - The function
consecutive_pairs()
simply makes a list of bigrams from the ngrams list. E.g.consecutive_pairs(['a', 'b', 'c', 'd']) == ['a b', 'b c', 'c d']
. - The result of the
.assign(...)
is a new column ("new_bigram
") that initially contains a list of such pairs. .explode()
expands each element of these lists into a new row..query()
filters out the cases wherenew_bigram
is the same asold_bigram
.
Note: in the display of the result, I dropped joined_bigram
to make the result easier to read.