I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
- The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
- The edge file must have the following columns:
id
,label
,from
andto
. - The vertex file must have:
id
andlabel
columns.
- I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
- It states that in the edge file, the
from
column must equate to "the vertex ID of the from vertex." - And that (in the edge file) the
to
column must equate to "the vertex ID of the to vertex."
My questions:
- Which columns need to be renamed to
id
,label
,from
andto
? Or, should I add new columns? - Do I only need one vertex file or multiple?
CodePudding user response:
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no
is guaranteed to be unique, rather than store it as a property called customer_no
you could instead make it the ~id
. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id
and ~label
. They are accessed differently using Gremlin steps such as hasLabel
and hasId
once the data is loaded. Columns with names from your domain like order_no
will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')