Upload file to Databricks DBFS with Python API-CodePudding

I'm following the Databricks example for uploading a file to DBFS (in my case .csv):

import json
import requests
import base64

DOMAIN = '<databricks-instance>'
TOKEN = '<your-token>'
BASE_URL = 'https://%s/api/2.0/dbfs/' % (DOMAIN)

def dbfs_rpc(action, body):
  """ A helper function to make the DBFS API request, request/response is encoded/decoded as JSON """
  response = requests.post(
    BASE_URL   action,
    headers={'Authorization': 'Bearer %s' % TOKEN },
    json=body
  )
  return response.json()

# Create a handle that will be used to add blocks
handle = dbfs_rpc("create", {"path": "/temp/upload_large_file", "overwrite": "true"})['handle']
with open('/a/local/file') as f:
  while True:
    # A block can be at most 1MB
    block = f.read(1 << 20)
    if not block:
        break
    data = base64.standard_b64encode(block)
    dbfs_rpc("add-block", {"handle": handle, "data": data})
# close the handle to finish uploading
dbfs_rpc("close", {"handle": handle})

When using the tutorial as is, I get an error:

Traceback (most recent call last):
  File "db_api.py", line 65, in <module>
    data = base64.standard_b64encode(block)
  File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 95, in standard_b64encode
    return b64encode(s)
  File "C:\Miniconda3\envs\dash_p36\lib\base64.py", line 58, in b64encode
    encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'

I tried doing with open('./sample.csv', 'rb') as f: before passing the blocks to base64.standard_b64encode but then getting another error:

TypeError: Object of type 'bytes' is not JSON serializable

This happens when the encoded block data is being sent into the API call.
I tried skipping encoding entirely and just passing the blocks into the post call. In this case the file gets created in the DBFS but has 0 bytes size.

At this point I'm trying to make sense of it all. It doesn't want a string but it doesn't want bytes either. What am I doing wrong? Appreciate any help.

CodePudding user response：

In Python we have strings and bytes, which are two different entities note that there is no implicit conversion between them, so you need to know when to use which and how to convert when necessary. This answer provides nice explanation.

With the code snippet I see two issues:

This you already got - open by default reads the file as text. So your block is a string, while standard_b64encode expects bytes and returns bytes. To read bytes from file it needs to be opened in binary mode:

with open('/a/local/file', 'rb') as f:

Only strings can be encoded as JSON. There's no source code available for dbfs_rpc (or I can't find it), but apparently it expects a string, which it internally encodes. Since your data is bytes, you need to convert it to string explicitly and that's done using decode:

dbfs_rpc("add-block", {"handle": handle, "data": data.decode('utf8')})