Hi I have a dataframe like below.
id | js |
0 | bla var test bla .. |
1 | bla function RAM blob |
2 | function CPU blob blob |
3 | thanks |
4 | bla var AWS and function twitter blaa |
I am trying to extract the next word after function or var
my code is here..
pattern3 = "(func)\s (\w )|(var)\s (\w )"
df = df.withColumn("js_extracted2", f.regexp_extract(f.col("js"),pattern3,4))
as it is capture only one word, the final row returns only AWS and not twitter.
so I would like to capture all matching.
My spark version is less than 3,
so I tried df.withColumn('output', f.expr("regexp_extract_all(js, '(func)\s (\w )|(var)\s (\w )', 4)")).show()
but it returns only empty for all rows.
my expected output is
id | js | output
0 | bla var test bla .. | [test]
1 | bla function RAM blob | [RAM]
2 | function CPU blob blob | [CPU]
3 | thanks |
4 | bla var AWS and function twitter blaa | [AWS, twitter]
Thanks in advance.
CodePudding user response:
You need to use four \
to form a regular expression.
df = df.withColumn("js_extracted2", F.expr(f"regexp_extract_all(js, '(function|var)\\\\s (\\\\w )', 2)"))
df.show(truncate=False)
CodePudding user response:
use a positive look behind ascertion (?<=i)j
. That is extract j
only if i
is immediately to its left
df.withColumn('name', regexp_extract('js', '((?<=function|var)(\s\w ))',1)).show()
--- -------------------- -----
| id| js| name|
--- -------------------- -----
| 0| bla var test bla ..| test|
| 1| bla function RAM...| RAM|
| 2|function CPU blob...| CPU|
| 3| thanks | |
| 4|bla var AWS and f...| AWS|
--- -------------------- -----
CodePudding user response:
The question looks more like how to do this without using regexp_extract_all to me.
Here is another way without regex, just incase one has a problem of using regexp_extract_all
:
noofwordstoextract = 1
df.withColumn("ArrayOfWords",F.split("js"," "))\
.withColumn("test",F.expr(f"""filter(transform(ArrayOfWords,(x,e)->
CASE WHEN x in ('var','function')
THEN array_join(slice(ArrayOfWords,e 2,{noofwordstoextract}),' ') ELSE NULL END)
,y-> y is not NULL)""")).drop("ArrayOfWords").show()
--- ------------------------------------- --------------
|id |js |test |
--- ------------------------------------- --------------
|0 |bla var test bla .. |[test] |
|1 |bla function RAM blob |[RAM] |
|2 |function CPU blob blob |[CPU] |
|3 |thanks |[] |
|4 |bla var AWS and function twitter blaa|[AWS, twitter]|
--- ------------------------------------- --------------
This solution splits the strings into an array then checks the current element is in 'var' or 'function' and then extracts the next n words (here 1) and then joins them to form the original string (try with noofwordstoextract=2) for more details.