Ingesting Big Data into Neo4j – Part 2

Transformation – Journey from JSONL to CSV

Following our previous blog post on modelling and extracting data ready for loading into a graph database, this post will cover transforming a 350 GB compressed JSONL file to CSV. We’ll ingest that CSV file using the neo4j-admin import tool in part 3.

Pre-Processing

Before ingesting the data, it is important to understand its size and shape. The most basic analysis you can perform on any dataset:

Distribution of file sizes in a partitioned dataset: We had some files larger than 20 GB – in these cases, the transformation process may require extra logging due to possible failures, and further partitioning of the files
Schema of the data: A couple of lines tell us a lot about the data, and should be sufficient for the first iteration where we sketch a schema for the data. We can push the analysis of which columns are mostly null, and should be dropped from the schema to a further state.

Read process/Working with jsonl:

JSON is one of the most common human-readable, language-independent data interchange formats. However, not every JSON-like file is technically JSON. JSON lines (jsonl), Newline-delimited JSON (ndjson) and line-delimited JSON (ldjson) are popular, closely-related formats, with each line of a file being valid JSON string, representing an object. For this project, our data was in JSONL format, which is very convenient for streaming data and processing log files. Many tools and platforms have support out of the box. Since each line represents a JSON file, parsing and adding new data is quite straightforward.

{"id": 100, "name": "Ebru", "surname": "Cucen"}
{"id": 101, "name": "Fahran", "surname": "Wallace"}

Using Python, we can simply read from jsonl.gzip file by using the gzip module’s open method :

with gzip.open(jsonl_file_name, 'r') as jsonl_file:
   for json_line in jsonl_file:
       if not json_line.strip():
           continue
       customer = json.loads(json_line)

Write process/Working with csv:

Deciding File Types, Formats and Schema

The graph database ingestion process using the neo4j-admin import tool requires nodes and relationships described in separate files. For convenience, you can also have separate files describing the column headers.

It is a good idea to have a dictionary of the names of the files and headers – here’s some pseudocode:

{
'customers': {
   'customer': {
       'header_filename': os.path.join(output_dir, 'customers_header.csv'),
       'data_filename': os.path.join(output_dir, 'customers_###.csv.gz'),
       'json_columns': ['id', 'name', 'surname'],
       'graph_columns': ['id:ID(Customer-ID)', 'name', 'surname']
   }
}

The header information we defined as graph_columns would be written into header_filename as a csv file, and the data (about our Customer nodes in this scenario) would be written to data_filename as a compressed csv file.

Neo4j supports using an ID column to validate the uniqueness of the data within the files, which is great for validating data quality, especially as we’re doing this ingestion without the aid of transactions that could check database constraints.

Checking the output files

Neo4j-admin import can operate on compressed files, so we’ll take advantage of this to save some disk space and processing time.

For each file, it is a good idea to check the integrity and the number of records through simple commands. You can run the commands for multiple files at once using a regex match:

# Check zip file integrity with verbose output
gzip -t -v *.gzip
# Count the lines in a compressed file, using zgrep
zgrep -c ^ *.gzip

Parallelising the transformation process

To reduce the time taken by the transformation process you can use parallelism at a language level and at a task level by using a workflow orchestrator. Python has the Multiprocessor library which enables the running of functions in a pool with a number of threads:

from multiprocessing import Pool
        customers = map(process_files, files)
        for jsonl_file_name in files:
            print("Processing file: {}".format(jsonl_file_name))
            customers_jsonl = gzip.open(jsonl_file_name, 'r').readlines()
            with Pool(12) as p:
                customers_jsons = p.map(get_customers, customers_jsonl)
                result = map(customers_writer.writerow, list(customers_jsons))

A good tip here would be – as for any development practice, but in particular, for long-running batch processes like this – we should log enough to follow what is going right and wrong with the process. We found these useful:

Progress update – possibly every 10,000 records
A dedicated dead-letter file containing only errant records
The error + line number from the input file of the error

To further embrace the possibilities of parallelism, we could divide the load between multiple machines with tools like Apache Spark, Apache Beam or Apache Airflow. By increasing the number of VMs and the spec of the VMs you can reduce the time to finish the transformations. Maximising the number of vCPUs, combined with fast disks would probably give the best performance here, as this operation is very I/O and processing heavy, with very little need for memory. For virtualised environments like the cloud, it is simple to scale up the machine before running the task – remember to scale down afterwards to minimise your costs.

This is the end of our second post in the series of working with graph data. Check out the next post in the series which covers recommendations for loading this data into Neo4j.

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE