A Beginners walkthrough for building and querying AWS Neptune with Gremlin

July 30, 2019

Amazon Neptune is a fast, reliable, fully managed graph database well-suited for storing and querying highly-connected data. For online applications that depend on navigating and leveraging connections in their data, Amazon Neptune is the perfect solution.

In this blog post, I’ll walk you through creating a Graph Database Cluster, using AWS Neptune Service to host the Cluster. I have tried to consolidate the steps you need to follow to create the cluster and allow the cluster to import bulk data, while taking you through the process of loading that data, and querying the graph database in various ways to get you started with using a Gremlin based framework to query that database.


Step 1:

Create the environment. The steps for how to build the cluster are described in this “Getting Started” guide.

VERY IMPORTANT - PLEASE READ THIS: Skipping this part can cause your AWS bill to be higher than expected. The cost of a single instance Neptune DB Cluster is greater than $250 per month, if left running 24x7. If you delete the cluster after following the steps in this blog, and the cluster only exists for an hour or so, the cost will be minimal. At the time of writing for the smallest cluster the cost for the DB Instance is "db.r4.large: $0.348 per hour," and an EC2 instance is also created. Please be aware of the cost of the infrastructure you create, especially for Neptune and EC2 instances created; links to pricing are below. Delete all infrastructure you no longer would like to pay for, because you will be billed for what you create. Everything is created with Cloudformation, so when complete you should be able to “Delete” the stack from the Cloudformation console, and it will delete all infrastructure it created. You will still have to delete the files from your S3 bucket to no longer be charged for them.





Once your environment is built you should SSH to the EC2 instance that was created from the Getting Started guide.

Download two files for sample data that we will be walking through.



You will be uploading these two files to an S3 bucket that you will soon be creating, and you will pull these two files into your Neptune cluster in a future step.

Follow the instructions described here.

Following the instructions above, you will create an IAM role. Attach that IAM role to your cluster, create a VPC Endpoint, and from your EC2 instance you should have run the curl commands in the previous step with the appropriate information for your environment, like your-neptune-endpoint, bucket-name/object-key-name, region, arn:aws:iam::account-id:role/role-name. The format used for the sample data is csv.

With the steps above completed, you should be ready to connect to your Neptune cluster and run some queries.

While logged into your EC2 instance created above, from your home directory you should see a directory that looks like apache-tinkerpop-gremlin-console-3.3.2. Depending on when you try this, the version may be different. "cd" into that directory, "cd bin."

In the bin directory you will want to execute the ./gremlin.sh command; this will start your gremlin console process on your instance.

You should be at a "gremlin>" prompt. You will want to enter the command ":remote connect tinkerpop.server conf/neptune-remote.yaml." This configuration file was created as part of the cloudformation build, so it should already have all the correct information for your cluster.

Next, enter the command ":remote console" to change your session to console mode, so you are able to send gremlin commands to your cluster.

To explain some of the fundamental terms related to graph databases, a “graph” is a data structure that is defined by two components. You should also note that a “node” is also called a “vertex,” and an “edge” is a connection between two nodes/vertices.

The sample data loaded is related to air-routes, which are flight routes between airports. The loaded graph contains several vertex types that are specified using labels, the most common ones being airport and country. There are also vertices for each of the seven continents (continent) and a single version vertex that is provided as a way to test which version of the graph you are using.

To count the different groups of labels, run this command below:

gremlin> g.V().label().groupCount()

==>{continent=7, country=237, version=1, airport=3437}

Each airport vertex has many properties associated with it giving various details about that airport. As an example, below is the vertex for O'Hare Airport:

gremlin> g.V().has('code', 'ORD').valueMap(true)

==>{country=[US], code=[ORD], longest=[13000], id=18, city=[Chicago], lon=[-87.90480042], type=[airport], elev=[672], icao=[KORD], region=[US-IL], runways=[7], lat=[41.97859955], desc=[Chicago O'Hare International Airport], label=airport}

or you can add unfold() to the end to print each item on its own line:

gremlin> g.V().has('code', 'ORD').valueMap(true).unfold()













==>desc=[Chicago O'Hare International Airport]



To count the number of nodes, you can enter "g.V().count()" to obtain a full count of the number of vertices.

gremlin> g.V().count()



To list them you can enter:

gremlin> g.V()







Each node/vertex has a unique ID. As an example, in v[4], 4 is the unique ID. To identify the properties of that node you can enter:

gremlin> g.V("4").valueMap(true).unfold()













==>desc=[Nashville International Airport]


To group airports by country, and count them, you can enter:

gremlin> g.V().hasLabel('airport').groupCount().by('country')

==>{PR=6, PT=19, PW=1, PY=2, QA=1, AE=10, AF=4, AG=1, AI=1, AL=1, AM=2, AO=14, AR=38, AS=1, RE=2, AT=6, AU=128, AW=1, AZ=5, RO=14, BA=4, BB=1, RS=2, BD=7, RU=124, BE=5, BF=2, BG=4, RW=2, BH=1, BI=1, BJ=1, BL=1, BM=1, BN=1, BO=15, SA=26, BQ=3, SB=17, BR=115, SC=2, BS=18, SD=5, SE=39, BT=1, SG=2, BW=4, SH=2, SI=2, BY=3, BZ=13, SK=2, SL=1, SN=4, SO=5, CA=205, SR=1, CC=1, SS=1, CD=11, ST=1, CF=1, SV=1, CG=3, CH=5, SX=1, SY=2, CI=2, SZ=1, CK=6, CL=17, CM=5, CN=210, CO=50, TC=4, CR=13, TD=5, CU=12, CV=7, TG=1, TH=33, CW=1, CX=1, TJ=4, CY=3, CZ=5, TL=1, TM=1, TN=8, TO=1, TR=51, TT=2, DE=33, TV=1, TW=9, TZ=9, DJ=1, DK=8, DM=1, DO=7, UA=15, UG=4, DZ=30, UK=58, US=582, EC=15, EE=3, EG=10, EH=2, UY=2, UZ=11, VC=1, ER=1, ES=43, VE=24, ET=16, VG=2, VI=2, VN=22, VU=26, FI=20, FJ=10, FK=2, FM=4, FO=1, FR=59, WF=2, GA=2, WS=1, GD=1, GE=3, GF=1, GG=2, GH=5, GI=1, GL=14, GM=1, GN=1, GP=1, GQ=2, GR=39, GT=2, GU=1, GW=1, GY=2, HK=1, HN=6, HR=8, HT=2, YE=9, HU=2, ID=68, YT=1, IE=7, IL=5, IM=1, IN=76, ZA=20, IQ=6, IR=45, IS=5, IT=38, ZM=9, JE=1, ZW=3, JM=2, JO=2, JP=65, KE=14, KG=2, KH=3, KI=2, KM=1, KN=2, KP=1, KR=15, KS=1, KW=1, KY=3, KZ=21, LA=8, LB=1, LC=2, LK=6, LR=2, LS=1, LT=3, LU=1, LV=1, LY=11, MA=16, MD=1, ME=2, MF=1, MG=13, MH=2, MK=2, ML=1, MM=14, MN=10, MO=1, MP=2, MQ=1, MR=3, MS=1, MT=1, MU=2, MV=8, MW=2, MX=59, MY=35, MZ=10, NA=5, NC=1, NE=1, NF=1, NG=18, NI=1, NL=5, NO=49, NP=11, NR=1, NZ=25, OM=4, PA=5, PE=22, PF=30, PG=26, PH=40, PK=21, PL=13, PM=1}


Let’s say we wanted to identify which airports I can fly to from Chicago O'Hare. Enter:

gremlin> g.V().has('code','ORD').out().path().by('code')

==>[ORD, BNA]

==>[ORD, BWI]

==>[ORD, LGA]

==>[ORD, MSP]

==>[ORD, PBI]


gremlin> g.V().has('code','ORD').out().path().by('code').count()


How many airports, and to which countries can I fly without a layover from Chicago O'Hare?

gremlin> g.V().has('code','ORD').out().groupCount().by('country')

==>{DE=4, PR=1, JM=1, BE=1, HK=1, TW=1, JO=1, JP=2, DK=1, DO=1, FR=1, NZ=1, HU=1, BR=1, QA=1, BS=1, SE=1, UK=3, IE=1, US=178, CA=13, SV=1, AE=2, CH=1, SX=1, IN=1, KR=1, IS=1, CN=3, IT=2, MX=8, CO=1, GT=1, ES=2, KY=1, ET=1, CR=1, TC=1, PA=1, VI=1, AT=1, AW=1, LC=1, PL=2, NL=1, TR=1}

To just receive values from a query:

gremlin> g.V().has('airport','code','DFW').values()



==>Dallas/Fort Worth International Airport










To receive just a particular value from a query:

gremlin> g.V().has('airport','code','ORD').values('city')


To receive multiple specific values from a query, number of runways, and elevation:

gremlin> g.V().has('airport','code','ORD').values('runways','elev')



To find items that have or do not have a particular property:

gremlin> g.E().has('dist')







This returned all the Edges with a property of dist(distance), showing what the distance is, and the id-route->id. You can see the route with id 3682 is between airport 1 and 3. To identify what that distance is, we can:

gremlin> g.E("3682").valueMap()


And for 3683:

gremlin> g.E("3683").valueMap()


The example for hasNot:

gremlin> g.V().has('region').count()


gremlin> g.V().hasNot('region').count()



To count the number of routes:

gremlin> g.V().outE('route').count()


Note that the outE step looks at outgoing edges.


How many of each type of edge are there?

gremlin> g.E().groupCount().by(label)

==>{contains=6874, route=49238}

Or, rewritten a different way:

gremlin> g.E().label().groupCount()

==>{contains=6874, route=49238}


It is also possible to use a traversal inside of a modulator. Such traversals are known as "anonymous traversals" as they do not include a beginning V or E step.

This capability allows us to combine multiple values together as part of a result. The example below finds five routes that start in Chicago O'Hare and creates a path result containing the airport code and city name for both the source and destination airports. In this case, the anonymous traversal contained within the by modulator is applied to each element in the path.

gremlin> g.V("18").out().limit(5).path().by(values('code','city').fold())

==>[[ORD, Chicago], [BNA, Nashville]]

==>[[ORD, Chicago], [BWI, Baltimore]]

==>[[ORD, Chicago], [LGA, New York]]

==>[[ORD, Chicago], [MSP, Minneapolis]]

==>[[ORD, Chicago], [PBI, West Palm Beach]]

Here is an example to give the results labels to help identify what is what within that result.  What I will do is identify all nodes that start from O'Hare and what airport they can fly to within the US-CA region, add the from and to labels to those results.

gremlin> g.V().has('code','ORD').as('from').out().has('region','US-CA').as('to').select('from','to').by('code')

==>{from=ORD, to=OAK}

==>{from=ORD, to=LAX}

==>{from=ORD, to=SFO}

==>{from=ORD, to=SJC}

==>{from=ORD, to=SAN}

==>{from=ORD, to=SNA}

==>{from=ORD, to=SMF}

==>{from=ORD, to=ONT}

==>{from=ORD, to=PSP}

==>{from=ORD, to=FAT}


If you wanted to identify a specific Edge, for example, what Edge leaves Miami(MIA), and goes to Dallas(DFW)?

gremlin> g.V().has('code','MIA').outE().as('e').inV().has('code','DFW').select('e')


How about removing duplicates?

gremlin> g.V().has('region','GB-ENG').values('runways').fold()

==>[4, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 3, 1, 3, 3]

The above shows the number of runways at each airport in the GB-ENG region. For whatever reason, let’s say we didn't want any duplicates from that query:

gremlin> g.V().has('region','GB-ENG').values('runways').dedup().fold()

==>[4, 1, 2, 3]

What I have described here only scratches the surface of querying a graph database using the Gremlin Framework.  If you would like to learn everything there is to know about using Gremlin, check out the definitive guide book to Gremlin.  

Other Posts You Might Be Interested In

AWS Greases the Skids to the Cloud

AWS made multiple product announcements at the April AWS Summit in Chicago. For an attendee like me who has spent most of his career working with storage technologies, I was... Read More

CloudFormation Scoping for Beginners

When most people begin working with CloudFormation, they usually start with examples or tutorials they find online. After that, they quickly start combining and adding their... Read More

Amazon Web Services (AWS) Security Tools Quick Start

In my previous articles, we looked at security governance in both small-scale and large-scale Amazon Web Services (AWS) environments. Governance at Different Scales –... Read More