Suppose I have 3 lists and I want to search in one of the columns of my dataframe among them.
Dataframe:
id desxription
1 'this is bad'
2 'city tehran country iran'
3 'uA is a country'
5 'this is summer'
6 'this is winter'
7 'this is canada'
8 'this is toronto'
Lists:
L1 = ['summer', 'winter', 'fall']
L2 = ['iran', 'uA']
L3 = ['tehran', 'canada', 'toronto']
Now I want to make a new column with respect of each list. If the row has an element of the list, extract it in the column, otherwise NA:
id desxription L1 L2 L3
1 'this is bad' NA NA NA
2 'city tehran country iran' NA iran tehran
3 'uA is a country' NA uA NA
5 'this is summer' summer NA NA
6 'this is winter' winter NA NA
7 'this is canada' NA NA canada
8 'this is toronto' NA NA toronto
CodePudding user response:
Your df:
df = spark.createDataFrame(
[
('1','this is bad'),
('2','city tehran country iran'),
('3','uA is a country'),
('5','this is summer'),
('6','this is winter'),
('7','this is canada'),
('8','this is toronto')
],
["id", "desxription"]
--- ------------------------
|id |desxription |
--- ------------------------
|1 |this is bad |
|2 |city tehran country iran|
|3 |uA is a country |
|5 |this is summer |
|6 |this is winter |
|7 |this is canada |
|8 |this is toronto |
--- ------------------------
Then you can use regexp_extract_all
:
L1 = ['summer','winter','fall']
exp_L1 = '|'.join(L1)
L2 = ['iran' , 'uA']
exp_L2 = '|'.join(L1)
L3 = ['tehran','canada','toronto']
exp_L3 = '|'.join(L1)
df\
.withColumn('L1', F.lit('|'.join(L1)))\
.withColumn('L2', F.lit('|'.join(L2)))\
.withColumn('L3', F.lit('|'.join(L3)))\
.withColumn('L1', F.expr(r"regexp_extract_all(desxription, L1, 0)").getItem(0))\
.withColumn('L2', F.expr(r"regexp_extract_all(desxription, L2, 0)").getItem(0))\
.withColumn('L3', F.expr(r"regexp_extract_all(desxription, L3, 0)").getItem(0))\
.na.fill('NA')\
.show(truncate=False)
--- ------------------------ ------ ---- -------
|id |desxription |L1 |L2 |L3 |
--- ------------------------ ------ ---- -------
|1 |this is bad |NA |NA |NA |
|2 |city tehran country iran|NA |iran|tehran |
|3 |uA is a country |NA |uA |NA |
|5 |this is summer |summer|NA |NA |
|6 |this is winter |winter|NA |NA |
|7 |this is canada |NA |NA |canada |
|8 |this is toronto |NA |NA |toronto|
--- ------------------------ ------ ---- -------
CodePudding user response:
You could use regexp_extract
. While forming the pattern, use \b
which will match word boundaries.
df = (df
.withColumn('L1', F.regexp_extract('desxription', r'(\b' r'\b)|(\b'.join(L1) r'\b)', 0))
.withColumn('L2', F.regexp_extract('desxription', r'(\b' r'\b)|(\b'.join(L2) r'\b)', 0))
.withColumn('L3', F.regexp_extract('desxription', r'(\b' r'\b)|(\b'.join(L3) r'\b)', 0))
.replace({'': 'NA'})
)