Home > Enterprise >  NameError: name 'storage' is not defined python 3.8
NameError: name 'storage' is not defined python 3.8

Time:07-07

I am trying to set up a dataflow job for converting a json file to csv and write it to a bucket using python script below for writing to bucket. ( i tried this in pyenv virtualenv 3.8.13) as i am using apache-beam. i tried changing many versions of python and google-cloud-storage. is there any alternative to it without using storage library ?

import apache_beam as beam
import pandas as pd
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage
from smart_open import openclass WriteCSVFile(beam.DoFn):

def __init__(self, bucket_name):
    self.bucket_name = bucket_name

def start_bundle(self):
    self.client = storage.Client()
    

def process(self, mylist):
    df = pd.DataFrame(mylist, columns={'account_id': str, 'isActive': str, 'balance': str, 'age': str, 'eyeColor': str, 'name': str, 'gender': str, 'company': str, 'email': str, 'phone': str, 'address':str})
    bucket = self.client.get_bucket(self.bucket_name)
    bucket.blob(f"output_poc4.csv").upload_from_string(df.to_csv(index=False), 'text/csv')

below is the error log

File "/home/myprject/dataflow_poc.py", line 86, in <module>
run()
 File "/home/myprject/dataflow_poc.py", line 79, in run
(pipeline | 'Start' >> beam.Create([None])
File "/home/myprject/.pyenv/versions/dataflow/lib/python3.8/site- 
packages/apache_beam/pipeline.py", line 598, in __exit__
self.result.wait_until_finish()
File "/home/myprject/.pyenv/versions/dataflow/lib/python3.8/site- 
packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1673, in wait_until_finish
raise DataflowRuntimeException(
  apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline 
failed. State: FAILED, Error:
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1458, in 
apache_beam.runners.common.DoFnRunner._invoke_bundle_method
File "apache_beam/runners/common.py", line 553, in 

  apache_beam.runners.common.DoFnInvoker.invoke_start_bundle
  File 
 "apache_beam/runners/common.py", line 559, in 

 apache_beam.runners.common.DoFnInvoker 
  .invoke_start_bundle
  File 
 "/home/myprject/dataflow_poc.py", 
  line 53, in start_bundle
 NameError: name 'storage' is not 
 defined

below is my few packages from my pip freeze

apache-beam==2.40.0
bcrypt==3.2.2
cachetools==4.2.4
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.1.0
cloudpickle==2.1.0
crcmod==1.7
cryptography==37.0.2
dill==0.3.1.1


  google-api-core==1.31.6
  google-apitools==0.5.31
  google-auth==1.35.0
  google-auth-httplib2==0.1.0
  google-cloud==0.34.0
  google-cloud-bigquery==2.34.4
  google-cloud-bigquery-storage==2.13.2
  google-cloud-bigtable==1.7.2
  google-cloud-core==1.7.2
  google-cloud-datastore==1.15.5
  google-cloud-dlp==3.7.1
  google-cloud-language==1.3.2
  google-cloud-pubsub==2.13.0
  google-cloud-pubsublite==1.4.2
  google-cloud-recommendations-ai==0.2.0
  google-cloud-spanner==1.19.3
  google-cloud-storage==2.4.0
  google-cloud-videointelligence==1.16.3
  google-cloud-vision==1.0.2
  google-crc32c==1.3.0
  google-resumable-media==2.3.3
  googleapis-common-protos==1.56.3

CodePudding user response:

add from google.cloud import storage inside the definition of start_bundle

CodePudding user response:

Posting my comment as answer which confirmed working by OP via comments.

You will need to move the from google.cloud import storage inside the function where you used import. In your case, it must be inside the start_bundle function.

You may refer to this handling of NameError documentation which also mentioned that

By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.

  • Related