read public http csv data into Apache Beam-CodePudding

I'm trying to use apache_beam.dataframe.io.read_csv function to read an online source with no success. Everything works if the file is hosted on google storage 'gs://bucket/source.csv' but fails on getting the file from 'https://github.com/../source.csv' like sources..

from apache_beam.dataframe.io import read_csv

url  = 'https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv'

with beam.Pipeline() as pipeline:
    original_collection = pipeline | read_csv(path=url)
    original_collection = original_collection[:5]
    original_collection | beam.Map(print)

Giving me

ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv

Could anybody give me a hint?

CodePudding user response：

Beam can only read files from filesystems (like gcs, hdfs, etc.) not arbitrary URLs (which are difficult to parallelize reads from). Local files work as well on the direct runner.

Alternatively, you could do something like

def parse_csv(contents):
  [use pandas, the csv module, etc. to parse the contents string into rows]

with beam.Pipeline() as pipeline:
    urls = pipeline | beam.Create(['https://github.com/datablist/sample-csv-files/...'])
    contents = urls | beam.Map(lambda url: urllib.request.urlopen(url).read())
    rows = contents | beam.FlatMap(parse_csv)

Probably easier to just save the file to a proper filesystem and read that...

CodePudding user response：

I think it's not possible to load an external file on Beam.

You can think about another process or service than Beam that copies your external files to Cloud Storage bucket (for example with gsutil cp).

Then in your Dataflow job, you could read files from GCS without issues.