pyspark regex extract all-CodePudding

Hi I have a dataframe like below.


id  | js                                    |
0   | bla var test bla ..                   |
1   | bla function RAM blob                 |
2   | function CPU blob blob                |
3   | thanks                                |
4   | bla var AWS and function twitter blaa |

I am trying to extract the next word after function or var

my code is here..

pattern3 = "(func)\s (\w )|(var)\s (\w )"

df = df.withColumn("js_extracted2", f.regexp_extract(f.col("js"),pattern3,4))

as it is capture only one word, the final row returns only AWS and not twitter.

so I would like to capture all matching.

My spark version is less than 3, 

so I tried df.withColumn('output', f.expr("regexp_extract_all(js, '(func)\s (\w )|(var)\s (\w )', 4)")).show()


but it returns only empty for all rows. 

my expected output is


id  | js                                    | output
0   | bla var test bla ..                   | [test]
1   | bla function RAM blob                 | [RAM]
2   | function CPU blob blob                | [CPU]
3   | thanks                                | 
4   | bla var AWS and function twitter blaa | [AWS, twitter]

Thanks in advance.

CodePudding user response：

You need to use four \ to form a regular expression.

df = df.withColumn("js_extracted2", F.expr(f"regexp_extract_all(js, '(function|var)\\\\s (\\\\w )', 2)"))
df.show(truncate=False)

CodePudding user response：

use a positive look behind ascertion (?<=i)j. That is extract j only if i is immediately to its left

df.withColumn('name', regexp_extract('js', '((?<=function|var)(\s\w ))',1)).show()

--- -------------------- ----- 
| id|                  js| name|
 --- -------------------- ----- 
|  0| bla var test bla ..| test|
|  1| bla function RAM...|  RAM|
|  2|function CPU blob...|  CPU|
|  3|             thanks |     |
|  4|bla var AWS and f...|  AWS|
 --- -------------------- -----

CodePudding user response：

The question looks more like how to do this without using regexp_extract_all to me.

Here is another way without regex, just incase one has a problem of using regexp_extract_all:

noofwordstoextract = 1
df.withColumn("ArrayOfWords",F.split("js"," "))\
.withColumn("test",F.expr(f"""filter(transform(ArrayOfWords,(x,e)-> 
CASE WHEN x in ('var','function') 
THEN array_join(slice(ArrayOfWords,e 2,{noofwordstoextract}),' ') ELSE NULL END)
,y-> y is not NULL)""")).drop("ArrayOfWords").show()


 --- ------------------------------------- -------------- 
|id |js                                   |test          |
 --- ------------------------------------- -------------- 
|0  |bla var test bla ..                  |[test]        |
|1  |bla function RAM blob                |[RAM]         |
|2  |function CPU blob blob               |[CPU]         |
|3  |thanks                               |[]            |
|4  |bla var AWS and function twitter blaa|[AWS, twitter]|
 --- ------------------------------------- --------------

This solution splits the strings into an array then checks the current element is in 'var' or 'function' and then extracts the next n words (here 1) and then joins them to form the original string (try with noofwordstoextract=2) for more details.