List of files processing in Node.js

Reading Time: 3 minutes

Frequently you find yourself processing a list of files to read their content, process them, and produce results? Well, in this post I will show you how you can do it in Node.js. Keep reading!

 

Introduction

Suppose you have a set of files inside a directory and you want to get a new list with the file names and its MD5. Suppose that you have a directory with files inside and you want to get a list indicating the file name and its content MD5.

Of course you can just do:

$ md5sum directory/*

But if you generalize the problem instead of generating an MD5 and you want to process the content of the file, you can do it in Node.js.

$ md5sum ebooks/\*
cf7fbf82c674a931cc1b247c4f56ccbf ebooks/Alice\'s Adventures in Wonderland.epub
cb4ebc8d1144742cedda66a6e4450180 ebooks/A Tale of Two Cities.epub
e2cc75ea0083ea30efe4e59a370e8646 ebooks/Calculus Made Easy.pdf
ab91a082bb798b6fc3988e9b66ae5342 ebooks/Don Quixote.epub
...

 

Simple and synchronous approach

In a procedural strategy you follow an algorithm like the following:

  • List all the files in the directory.
  • Read all the content of the file into memory.
  • Generate MD5 of the content.
  • Print the filename and the MD5 value.
  • Repeat for every file.

The first approach is to use the synchronous version of the fs module.

const fs = require("fs");
const crypto = require("crypto");

// Read directory path from command line
let dir = process.argv[2];
// Read file names from directory
let files = fs
  .readdirSync(dir, { withFileTypes: true })
  .filter((dirent) => dirent.isFile())
  .map((dirent) => dirent.name);

// Iterate over files
files.forEach((file) => {
  // Read file contents
  let contents = fs.readFileSync(${dir}/${file});
  // Calculate MD5 hash
  let hash = crypto
    .createHash("md5")
    .update(Uint8Array.from(contents))
    .digest("hex");
  // Print file name and hash
  console.log(${hash} ${file});
});
$ time node concurrent_md5.js ebooks
cf7fbf82c674a931cc1b247c4f56ccbf Alice\'s Adventures in Wonderland.epub
cb4ebc8d1144742cedda66a6e4450180 A Tale of Two Cities.epub
2dd688cfe25fb9e4b8e62b23e244fc9e Siddhartha.epub
860ba3c323a782025de1a383dfd417dc The Adventures of Sherlock Holmes.epub
867342deb4434c18e69fbf75a9197e67 The Prophet.epub
3e5b7d2c410fa0c1f51525255a2f6ad8 The Confessions of St. Augustine.epub
...
real    2m7.756s
user    0m15.496s
sys    0m8.914s

You may be thinking, can we do better than that?… Well, yes, one way is using concurrency.

 

Concurrent reading and processing

When you tackle the same problem using concurrency you can leverage in processing content concurrently, so you can read the content of the files in parallel and process it concurrently.

const fs = require("fs");
const crypto = require("crypto");

// Read directory path from command line
let dir = process.argv[2];
// Read file names from directory
let files = fs
  .readdirSync(dir, { withFileTypes: true })
  .filter((dirent) => dirent.isFile())
  .map((dirent) => dirent.name);

// Iterate over files
files.forEach((file) => {
  // Read file contents asynchronously
  fs.readFile(${dir}/${file}, (err, data) => {
    // Calculate MD5 hash
    let hash = crypto.createHash("md5").update(data).digest("hex");
    // Print file name and hash
    console.log(${hash} ${file});
  });
});
time node concurrent_limited_md5.js ebooks 5
cf7fbf82c674a931cc1b247c4f56ccbf Alice\'s Adventures in Wonderland.epub
cb4ebc8d1144742cedda66a6e4450180 A Tale of Two Cities.epub
6a107b181afdf4630beaf7a897ac115b Great Expectations.epub
2dd688cfe25fb9e4b8e62b23e244fc9e Siddhartha.epub
...

If you attempt to open all the files at the same time you will get an error since you will exceed the maximum number of open files allowed by the system, and probably it is not even necessary faster because it depends on how much you can read from disk at the same time.

So, you can limit the number of files you open at the same time.

 

Limiting the number of concurrent processes

How can we limit the number of concurrent processes?One way is to use a queue, so it will be the one that will control the number of concurrent processes.

const fs = require("fs");
const crypto = require("crypto");

// Read directory path from command line
let dir = process.argv[2];
let limit = process.argv[3] || 10;

// Read file names from directory
let files = fs
  .readdirSync(dir, { withFileTypes: true })
  .filter((dirent) => dirent.isFile())
  .map((dirent) => dirent.name);

// class for limiting the number of concurrent operations
class TaskQueue {
  constructor(concurrency) {
    this.concurrency = concurrency;
    this.running = 0;
    this.queue = [];
  }

  // add a task to the queue
  pushTask(task) {
    this.queue.push(task);
    this.next();
  }

  // run the next task if there is one and we are under the concurrency limit
  next() {
    while (this.running < this.concurrency && this.queue.length) {
      let task = this.queue.shift();
      task(() => {
        this.running--;
        this.next();
      });
      this.running++;
    }
  }
}

let queue = new TaskQueue(limit);

// Iterate over files
files.forEach((file) => {
  queue.pushTask((done) => {
    // Read file contents asynchronosly
    fs.readFile(${dir}/${file}, (err, data) => {
      // Calculate MD5 hash
      let hash = crypto.createHash("md5").update(data).digest("hex");
      // Print file name and hash
      console.log(${hash} ${file});
      done();
    });
  });
});
$ time node concurrent_limited_md5.js ebooks 10
cf7fbf82c674a931cc1b247c4f56ccbf Alice\'s Adventures in Wonderland.epub
cb4ebc8d1144742cedda66a6e4450180 A Tale of Two Cities.epub
6a107b181afdf4630beaf7a897ac115b Great Expectations.epub
2dd688cfe25fb9e4b8e62b23e244fc9e Siddhartha.epub
901804c8396ed18e5cee6080c4ffb8cf Moby Dick.epub
...
real    1m50.926s
user    0m15.341s
sys    0m7.244s

In this case the performance is approximately 15% faster than synchronous approach by having 10 concurrent read and process operations.

 

Conclusions

We saw the different ways in which we can implement our code to make it more efficient in terms of time and be careful of not consuming all the resources at the same time.

Depending on the problem you are trying to solve, concurrency can be a great tool to solve it, however you need to be aware of the problems that can arise when using it, and how to mitigate them.

0 Shares:
You May Also Like