How to determine if a column of a dataframe has an element of several different lists?-CodePudding

Suppose I have 3 lists and I want to search in one of the columns of my dataframe among them.

Dataframe:

id      desxription
1         'this is bad'
2         'city tehran country iran'
3         'uA is a country'
5         'this is summer'
6         'this is winter'
7         'this is canada'
8         'this is toronto'

Lists:

L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']

Now I want to make a new column with respect of each list. If the row has an element of the list, extract it in the column, otherwise NA:

id      desxription                       L1          L2         L3
1         'this is bad'                      NA         NA          NA
2         'city tehran country iran'       NA         iran        tehran
3         'uA is a country'                  NA         uA          NA
5         'this is summer'                  summer      NA          NA
6         'this is winter'                  winter      NA          NA
7         'this is canada'                 NA         NA          canada
8         'this is toronto'               NA         NA          toronto

CodePudding user response：

Your df:

df = spark.createDataFrame(
    [
    ('1','this is bad'),
    ('2','city tehran country iran'),
    ('3','uA is a country'),
    ('5','this is summer'),
    ('6','this is winter'),
    ('7','this is canada'),
    ('8','this is toronto')
    ],
    ["id", "desxription"]

 --- ------------------------ 
|id |desxription             |
 --- ------------------------ 
|1  |this is bad             |
|2  |city tehran country iran|
|3  |uA is a country         |
|5  |this is summer          |
|6  |this is winter          |
|7  |this is canada          |
|8  |this is toronto         |
 --- ------------------------

Then you can use regexp_extract_all:

L1 = ['summer','winter','fall']
exp_L1 = '|'.join(L1)

L2 = ['iran' , 'uA']
exp_L2 = '|'.join(L1)

L3 = ['tehran','canada','toronto']
exp_L3 = '|'.join(L1)

df\
    .withColumn('L1', F.lit('|'.join(L1)))\
    .withColumn('L2', F.lit('|'.join(L2)))\
    .withColumn('L3', F.lit('|'.join(L3)))\
    .withColumn('L1', F.expr(r"regexp_extract_all(desxription, L1, 0)").getItem(0))\
    .withColumn('L2', F.expr(r"regexp_extract_all(desxription, L2, 0)").getItem(0))\
    .withColumn('L3', F.expr(r"regexp_extract_all(desxription, L3, 0)").getItem(0))\
    .na.fill('NA')\
    .show(truncate=False)

 --- ------------------------ ------ ---- ------- 
|id |desxription             |L1    |L2  |L3     |
 --- ------------------------ ------ ---- ------- 
|1  |this is bad             |NA    |NA  |NA     |
|2  |city tehran country iran|NA    |iran|tehran |
|3  |uA is a country         |NA    |uA  |NA     |
|5  |this is summer          |summer|NA  |NA     |
|6  |this is winter          |winter|NA  |NA     |
|7  |this is canada          |NA    |NA  |canada |
|8  |this is toronto         |NA    |NA  |toronto|
 --- ------------------------ ------ ---- -------

CodePudding user response：

You could use regexp_extract. While forming the pattern, use \b which will match word boundaries.

df = (df
    .withColumn('L1', F.regexp_extract('desxription', r'(\b'   r'\b)|(\b'.join(L1)   r'\b)', 0))
    .withColumn('L2', F.regexp_extract('desxription', r'(\b'   r'\b)|(\b'.join(L2)   r'\b)', 0))
    .withColumn('L3', F.regexp_extract('desxription', r'(\b'   r'\b)|(\b'.join(L3)   r'\b)', 0))
    .replace({'': 'NA'})
)