Home > Software design >  Pyspark Error with return _compile(pattern, flags).findall(string) - how to troubleshoot?
Pyspark Error with return _compile(pattern, flags).findall(string) - how to troubleshoot?

Time:03-08

I am trying to do sentiment analysis using a list of words to get a count of positive and negative words in a pyspark dataframe column. I can successfully get the counts of positive words using the same method, and there are roughly 2k positive words in that list. The negative list has about double the number of words (~4k words). What could be causing this issue, and how can I fix it?

I don't think it is due to the code since it worked for the positive words, but I am confused as to whether the number of words I'm searching for is too long in the other list, or what I am missing. Here is an example (not the exact list) below:

stories.show()

 -------------------- 
|               words|
 -------------------- 
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
 -------------------- 


neg = ['angry','sad','sorrowful','angry']


#doing some counting manipulation here
df3.show()

Error:

spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1308         answer = self.gateway_client.send_command(command)
   1309         return_value = get_return_value(
-> 1310             answer, self.gateway_client, self.target_id, self.name)
   1311 
   1312         for temp_arg in temp_args:

/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
  File "/usr/lib/python3.7/re.py", line 225, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/lib/python3.7/re.py", line 288, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
    p = _parse_sub(source, pattern, True, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
    source.tell() - here   len(this))
re.error: multiple repeat at position 5

Expected output:

 -------------------- -------- 
|               words|Negative|
 -------------------- -------- 
|tom and jerry went t|      45|
|she was angry when g|      12|
|arnold became sad at|      54|

CodePudding user response:

Your neg list contains characters that have special meaning for regular expression patterns and consequently, your pattern becomes an unparsable regex pattern.

You can escape the special characters in the pattern by using the re.escape() function.

  • Related