I'm trying to use apache_beam.dataframe.io.read_csv
function to read an online source with no success. Everything works if the file is hosted on google storage 'gs://bucket/source.csv'
but fails on getting the file from 'https://github.com/../source.csv'
like sources..
from apache_beam.dataframe.io import read_csv
url = 'https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv'
with beam.Pipeline() as pipeline:
original_collection = pipeline | read_csv(path=url)
original_collection = original_collection[:5]
original_collection | beam.Map(print)
Giving me
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv
Could anybody give me a hint?
CodePudding user response:
Beam can only read files from filesystems (like gcs, hdfs, etc.) not arbitrary URLs (which are difficult to parallelize reads from). Local files work as well on the direct runner.
Alternatively, you could do something like
def parse_csv(contents):
[use pandas, the csv module, etc. to parse the contents string into rows]
with beam.Pipeline() as pipeline:
urls = pipeline | beam.Create(['https://github.com/datablist/sample-csv-files/...'])
contents = urls | beam.Map(lambda url: urllib.request.urlopen(url).read())
rows = contents | beam.FlatMap(parse_csv)
Probably easier to just save the file to a proper filesystem and read that...
CodePudding user response:
I think it's not possible to load an external file on Beam
.
You can think about another process or service than Beam
that copies your external files to Cloud Storage
bucket (for example with gsutil cp
).
Then in your Dataflow
job, you could read files from GCS
without issues.