TypeError: Invalid argument, not a string or column: <function <lambda> at 0x7f1f357c6160&g-CodePudding

I'm using the following snippet which creates a list of all .csv files in a directory in Databricks.

csv_dir = '/my_dir/'
csv_paths = list(filter(lambda x: '.csv' in x, os.listdir(csv_dir)))

However it yields the following error

TypeError: Invalid argument, not a string or column: <function <lambda> at 0x7f1f357c6160> of type <class 'function'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I'm guessing my pure Python code has been mistaken for PySpark code. I tried using %python at the top of the cell and it still yielded the same result.

Yes, I have used PySpark and Python interchangeably in the notebook but I've never faced this issue when using lambda functions.

Is there a workaround to escape this behavior?

Please Advise

CodePudding user response：

It's most likely as you guessed that your code uses Pyspark's filter function instead of Python's built-in filter. The best practice to import Pyspark functions is to use an alias, e.g import pyspark.sql.functions as F so that those functions will not be conflicted with the built-ins with the same names.

But, if you already imported from pyspark.sql.functions import *, you can call the built-in filter explicitly using __builtin__.filter

csv_paths = list(__builtin__.filter(lambda x: '.csv' in x, os.listdir(csv_dir)))