I am trying to do sentiment analysis using a list of words to get a count of positive and negative words in a pyspark dataframe column. I can successfully get the counts of positive words using the same method, and there are roughly 2k positive words in that list. The negative list has about double the number of words (~4k words). What could be causing this issue, and how can I fix it?
I don't think it is due to the code since it worked for the positive words, but I am confused as to whether the number of words I'm searching for is too long in the other list, or what I am missing. Here is an example (not the exact list) below:
stories.show()
--------------------
| words|
--------------------
|tom and jerry went t|
|she was angry when g|
|arnold became sad at|
--------------------
neg = ['angry','sad','sorrowful','angry']
#doing some counting manipulation here
df3.show()
Error:
spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
1308 answer = self.gateway_client.send_command(command)
1309 return_value = get_return_value(
-> 1310 answer, self.gateway_client, self.target_id, self.name)
1311
1312 for temp_arg in temp_args:
/content/spark-3.2.0-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "<ipython-input-6-97710da0cedd>", line 17, in countNegatives
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
File "/usr/lib/python3.7/re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 932, in parse
p = _parse_sub(source, pattern, True, 0)
File "/usr/lib/python3.7/sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 648, in _parse
source.tell() - here len(this))
re.error: multiple repeat at position 5
Expected output:
-------------------- --------
| words|Negative|
-------------------- --------
|tom and jerry went t| 45|
|she was angry when g| 12|
|arnold became sad at| 54|
CodePudding user response:
Your neg
list contains characters that have special meaning for regular expression patterns and consequently, your pattern becomes an unparsable regex pattern.
You can escape the special characters in the pattern by using the re.escape() function.