Home > Enterprise >  Pyspark: Create additional column based on Regex
Pyspark: Create additional column based on Regex

Time:10-06

I recently started Pyspark and I'm trying to figure out the regex matching.

For the regexes I've created a list and if one of these items in the list is found in the name column, the added column must be true. This Regex matching must not be case sensitive as seen in the example below.

I have a Table with the following format:

seqno name
1 john jones
2 John Jones
3 John Stones
4 Mary Wild
5 William Wurt
6 steven wurt

I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.

seqno name regex
1 john jones True
2 John Jones True
3 John Stones True
4 Mary Wild False
5 William Wurt True
6 steven wurt True

Here is the code to create part of the Table:

regex_list = [john, wurt]
columns = ['seqno', 'name']
data = [('1', 'john jones'),
        ('2', 'John Jones'),
        ('3', 'John Stones'),
        ('4', 'Mary Wild'),
        ('5', 'William Wurt'),
        ('6', 'steven wurt')]

df = spark.createDataFrame(data=data, schema=columns)

I've been trying numerous applications with .isin and .rlike but can't seem to make it work. Any help would be gladly appreciated.

Thanks in advance!

CodePudding user response:

Use rlike to check if any of the listed regex are like names. can change case in both list and column while test happens Code beloow

df.withColumn('regex',upper(col('name')).rlike(('|').join([x.upper() for x in regex_list]))).show()


 ----- ------------ ----- 
|seqno|        name|regex|
 ----- ------------ ----- 
|    1|  john jones| true|
|    2|  John Jones| true|
|    3| John Stones| true|
|    4|   Mary Wild|false|
|    5|William Wurt| true|
|    6| steven wurt| true|
 ----- ------------ ----- 
  • Related