How to migrate csv, json and xml data to a Grakn knolwedge graph

Goal

In this tutorial, our aim is to migrate some actual data to the phone_calls knowledge graph. We defined this schema previously, in the Defining the Schema section.

A Quick Look at the Schema

Before we get started with migration, let’s have a quick reminder of how the schema for the phone_calls knowledge graph looks like.

The Visualised Schema

Python or Node.js?

Pick a language of your choice to continue.

An Overview

Let’s go through a summary of how the migration takes place.

  1. we need a way to talk to our Grakn keyspace. To do this, we will use the Python Client.

  2. we will go through each data file, extracting each data item and parsing it to a Python dictionary.

  3. we will pass each data item (in the form of a Python dictionary) to its corresponding template function, which in turn gives us the constructed Graql query for inserting that item into Grakn.

  4. we will execute each of those queries to load the data into our target keyspace — phone_calls.

Before moving on, make sure you have Python3 and Pip3 installed and the Grakn server running on your machine.

Getting Started

  1. Create a directory named phone_calls on your desktop.

  2. cd to the phone_calls directory via terminal.

  3. Run pip3 install grakn to install the Grakn Python Client.

  4. Open the phone_calls directory in your favourite text editor.

  5. Create a migrate.py file in the root directory. This is where we’re going to write all our code.

Including the Data Files

Pick one of the data formats below and download the files. After you download them, place the four files under the phone_calls/data directory. We will be using these to load their data into our phone_calls knowledge graph.

CSV: companies | people | contracts | calls

JSON: companies | people | contracts | calls

XML: companies | people | contracts | calls

Setting up the migration mechanism

All code that follows is to be written in phone_calls/migrate.py.

First thing first, we import the grakn module. We will use it for connecting to our phone_calls keyspace.

Next, we declare the inputs. More on this later. For now, what we need to understand about inputs — it’s a list of dictionaries, each one containing:

  • The path to the data file

  • The template function that receives a dictionary and produces the Graql insert query. We will define these template functions in a bit.

Let’s move on.

build_phone_call_graph(inputs)

This is the main and only function we need to call to start loading data into Grakn.

What happens in this function, is as follows:

  1. A Grakn client is created, connected to the server we have running locally.

  2. A session is created, connected to the keyspace phone_calls. Note that by using with, we indicate that the session will close after it’s been used.

  3. For each input dictionary in inputs, we call the load_data_into_grakn(input, session). This will take care of loading the data as specified in the input dictionary into our keyspace.

load_data_into_grakn(input, session)

In order to load data from each file into Grakn, we need to:

  1. retrieve a list containing dictionaries, each of which represents a data item. We do this by calling parse_data_to_dictionaries(input)

  2. for each dictionary in items: a) create a transaction tx, which closes once used, b) construct the graql_insert_query using the corresponding template function, c) execute the query and d)commit the transaction.

Before we move on to parsing the data into dictionaries, let’s start with the template functions.

The Template Functions

Templates are simple functions that accept a dictionary, representing a single data item. The values within this dictionary fill in the blanks of the query template. The result will be a Graql insert query.

We need 4 of them. Let’s go through them one by one.

companyTemplate

Example:

  • Goes in: { name: "Telecom" }

  • Comes out: insert $company isa company has name "Telecom";

personTemplate

Example:

  • Goes in: { phone_number: "+44 091 xxx" }

  • Comes out: insert $person has phone-number "+44 091 xxx";

or:

  • Goes in: { firs-name: "Jackie", last-name: "Joe", city: "Jimo", age: 77, phone_number: "+00 091 xxx"}

  • Comes out: insert $person has phone-number "+44 091 xxx" has first-name "Jackie" has last-name "Joe" has city "Jimo" has age 77;

contractTemplate

Example:

  • Goes in: { company_name: "Telecom", person_id: "+00 091 xxx" }

  • Comes out: match $company isa company has name "Telecom"; $customer isa person has phone-number "+00 091 xxx"; insert (provider: $company, customer: $customer) isa contract;

callTemplate

Example:

  • Goes in: { caller_id: "+44 091 xxx", callee_id: "+00 091 xxx", started_at: 2018–08–10T07:57:51, duration: 148 }

  • Comes out: match $caller isa person has phone-number "+44 091 xxx"; $callee isa person has phone-number "+00 091 xxx"; insert $call(caller: $caller, callee: $callee) isa call; $call has started-at 2018–08–10T07:57:51; $call has duration 148;

We’ve now created a template for each and all four concepts that were previously defined in the schema.

It’s time for the implementation of parse_data_to_dictionaries(input).

DataFormat-specific implementation

The implementation for parse_data_to_dictionaries(input) differs based on what format our data files have.

.csv, .json or .xml.

We will use Python’s built-in csv library. Let’s import the module for it.

Moving on, we will write the implementation of parse_data_to_dictionaries(input) for parsing .csv files. Note that we use DictReader to map the information in each row to a dictionary.

Besides this function, we need to make one more change.

Given the nature of CSV files, the dictionary produced will have all the columns of the .csv file as its keys, even when the value is not there, it’ll be taken as a blank string.

For this reason, we need to change one line in our person_template function.

if "first_name" in person becomes if person["first_name"] == "".

We will use ijson, an iterative JSON parser with a standard Python iterator interface.

Via the terminal, while in the phone_calls directory, run pip3 install ijson and import the module for it.

Moving on, we will write the implementation of parse_data_to_dictionaries(input) for processing .json files.

We will use Python’s built-in xml.etree.cElementTree library. Let’s import the module for it.

For parsing XML data, we need to know the target tag name. This needs to be specified for each data file in our inputs deceleration.

And now for the implementation of parse_data_to_dictionaries(input) for parsing .xml files.

The implementation below, although, not the most generic, performs well with very large .xml files. Note that many libraries that do xml to dictionary parsing, pull in the entire .xml file into memory first. There is nothing wrong with that approach when you’re dealing with small files, but when it comes to large files, that’s just a no go.

Putting it all together

Here is how our migrate.js looks like for each data format.

Time to Load

Run python3 migrate.py

Sit back, relax and watch the logs while the data starts pouring into Grakn.

... so far with the migration

We started off by setting up our project and positioning the data files.

Next we went on to set up the migration mechanism, one that was independent of the data format.

Then, we went ahead and wrote the template functions whose only job was to construct a Graql insert query based on the data passed to them.

After that, we learned how files with different data formats can be parsed into Python dictionaries.

Lastly, we ran python3 migrate.py which fired the build_phone_call_graph function with the given inputs. This loaded the data into our Grakn knowledge graph.

An Overview

Let’s go through a summary of how the migration takes place.

  1. we need a way to talk to our Grakn keyspace. To do this, we will use the Node.js Client.

  2. we will go through each data file, extracting each data item and parsing it to a Javascript object.

  3. we will pass each data item (in the form of a Javascript object) to its corresponding template function, which in turn gives us the constructed Graql query for inserting that item into Grakn.

  4. we will execute each of those queries to load the data into our target keyspace — phone_calls.

Before moving on, make sure you have npm installed and the Grakn server running on your machine.

Getting Started

  1. Create a directory named phone_calls on your desktop.

  2. cd to the phone_calls directory via terminal.

  3. Run npm install grakn to install the Grakn Node.js Client.

  4. Open the phone_calls directory in your favourite text editor.

  5. Create a migrate.js file in the root directory. This is where we’re going to write all our code.

Including the Data Files

Pick one of the data formats below and download the files. After you download them, place the four files under the phone_calls/data directory. We will be using these to load their data into our phone_calls knowledge graph.

CSV: companies | people | contracts | calls

JSON: companies | people | contracts | calls

XML: companies | people | contracts | calls

Setting up the migration mechanism

All code that follows is to be written in phone_calls/migrate.js.

First thing first, we require the grakn module. We will use it for connecting to our phone_calls keyspace.

Next, we declare the inputs. More on this later. For now, what we need to understand about inputs — it’s an array of objects, each one containing:

  • The path to the data file

  • The template function that receives an object and produces the Graql insert query. We will define these template functions in a bit.

Let’s move on.

buildPhoneCallGraph(inputs)

This is the main and only function we need to call to start loading data into Grakn.

What happens in this function, is as follows:

  1. A grakn instance is created, connected to the server we have running locally.

  2. A session is created, connected to the keyspace phone_calls.

  3. For each input object in inputs, we call the loadDataIntoGrakn(input, session). This will take care of loading the data as specified in the input object into our keyspace.

  4. The session is closed.

loadDataIntoGrakn(input, session)

In order to load data from each file into Grakn, we need to:

  1. retrieve a list containing objects, each of which represents a data item. We do this by calling parseDataToObjects(input)

  2. for each object in items: a) create a transaction tx, b) construct the graqlInsertQuery using the corresponding template function, c) run the query and d)commit the transaction.

Before we move on to parsing the data into objects, let’s start with the template functions.

The Template Functions

Templates are simple functions that accept an object, representing a single data item. The values within this object fill in the blanks of the query template. The result will be a Graql insert query.

We need 4 of them. Let’s go through them one by one.

companyTemplate

Example:

  • Goes in: { name: "Telecom" }

  • Comes out: insert $company isa company has name "Telecom";

personTemplate

Example:

  • Goes in: { phone_number: "+44 091 xxx" }

  • Comes out: insert $person has phone-number "+44 091 xxx";

or:

  • Goes in: { firs-name: "Jackie", last-name: "Joe", city: "Jimo", age: 77, phone_number: "+00 091 xxx"}

  • Comes out: insert $person has phone-number "+44 091 xxx" has first-name "Jackie" has last-name "Joe" has city "Jimo" has age 77;

contractTemplate

Example:

  • Goes in: { company_name: "Telecom", person_id: "+00 091 xxx" }

  • Comes out: match $company isa company has name "Telecom"; $customer isa person has phone-number "+00 091 xxx"; insert (provider: $company, customer: $customer) isa contract;

callTemplate

Example:

  • Goes in: { caller_id: "+44 091 xxx", callee_id: "+00 091 xxx", started_at: 2018–08–10T07:57:51, duration: 148 }

  • Comes out: match $caller isa person has phone-number "+44 091 xxx"; $callee isa person has phone-number "+00 091 xxx"; insert $call(caller: $caller, callee: $callee) isa call; $call has started-at 2018–08–10T07:57:51; $call has duration 148;

We’ve now created a template for each and all four concepts that were previously defined in the schema.

It’s time for the implementation of parseDataToObjects(input).

DataFormat-specific implementation

The implementation for parseDataToObjects(input) differs based on what format our data files have.

.csv, .json or .xml.

We will use Papaparse, a CSV (or delimited text) parser.

Via the terminal, while in the phone_calls directory, run npm install papaparse and require the module for it.

Moving on, we will write the implementation of parseDataToObjects(input) for parsing .csv files.

Besides this function, we need to make one more change.

Given the nature of CSV files, the object produced will have all the columns of the .csv file as its keys, even when the value is not there, it’ll be taken as a blank string.

For this reason, we need to change one line in our person_template function.

const isNotCustomer = typeof first_name === "undefined";

becomes

const isNotCustomer = first_name === “”;

We will use stream-json for custom JSON processing pipelines with a minimal memory footprint.

Via the terminal, while in the phone_calls directory, run npm install stream-json and require the modules for it.

Moving on, we will write the implementation of parseDataToObjects(input) for processing .json files.

We will use xml-stream, an xml stream parser.

Via the terminal, while in the phone_calls directory, run npm install xml-stream and require the module for it.

For parsing XML data, we need to know the target tag name. This needs to be specified for each data file in our inputs deceleration.

And now for the implementation of parseDataToObjects(input) for parsing .xml files.

Putting it all together

Here is how our migrate.js looks like for each data format.

Time to Load

Run npm run migrate.js

Sit back, relax and watch the logs while the data starts pouring into Grakn.

… so far with the migration

We started off by setting up our project and positioning the data files.

Next we went on to set up the migration mechanism, one that was independent of the data format.

Then, we went ahead and wrote a template function for each concept. A template’s sole purpose was to construct a Graql insert query for each data item.

After that, we learned how files with different data formats can be parsed into Javascript objects.

Lastly, we ran npm run migrate.js which fired the buildPhoneCallGraph function with the given inputs. This loaded the data into our Grakn knowledge graph.


Next

Now that we have some actual data in our knowledge graph, we can go ahead and query for insights.

Tags: migration