When I was some pyspark code, it required to me to install a Python module called fuzzywuzzy (that I used to apply the leiv distance)
This is a python libraries and seems that pyspark doesn't have the module installed... so, How can I install this module inside Pyspark??
CodePudding user response:
You'd use pip
as normal, with the caveat that Spark can run on multiple machines, and so all machines in the Spark cluster (depending on your cluster manager) will need the same package (and version)
Or you can pass zip, whl or egg files using --py-files
argument to spark-submit, which get unbundled during code execution