Rethinking Dataiku for Windows Users: An Engineering Deep Dive

Use Cases & Projects, Dataiku Product, Tech Blog Alex Bitar

As a former Microsoft software engineer, I thought my Windows days were over when I started to work at Dataiku. Yet, fate caught up to me when I was asked to leverage my Windows experience to help port Dataiku to Windows. In this post, I will run you through the process that enabled Dataiku’s engineering team to port Dataiku’s free edition and its Launcher to the Windows world. An exciting and challenging journey!

Project Genesis

The project initially started as a Hackaiku project — hackathons where Dataikers get to work on their own projects they deem valuable for the company — based around the following question: Can we run Dataiku on Windows? After some time, an intern took over the project and proved that Dataiku on Windows was possible with a showcased proof of concept. Unfortunately, after that intern left the company, some other priorities arose and the project was paused for the foreseeable future.

When I joined Dataiku, coming from Microsoft, I was asked if I would be interested in reviving the project and carrying on the work to bring Dataiku to Windows, which I eagerly accepted. I saw a great opportunity to understand Dataiku’s design and internals while working on a unique project and allowed me to share my existing knowledge of Windows with the rest of the team.

Why Windows?

The free version of Dataiku was already available for experimenting on macOS; you could install an instance of the application locally through the Dataiku Launcher, but there was no equivalent experience available for Windows machines. The only option users had to run a Dataiku instance locally on Windows was via a virtual machine, which isn’t the easiest task to do, since it would most likely mean setting up on a Linux VM with no out-of-the-box Launcher installation.

We saw value in bringing the free edition of Dataiku to the Windows ecosystem, as we felt that this current virtual machine experience was a big turn off for most Windows users.

Unfortunately, Dataiku was not originally designed or built to run natively on a non-Unix operating system, meaning that there was going to be some work required to create a version for Windows. Since we were aiming for a smooth user experience during setup, we importantly decided not to use Windows Subsystem for Linux (WSL), which would have likely reduced the amount of work required for the migration. At the time of writing, WSL integration into Windows is more advanced than it was when this project originally started, however at the time of development it would have caused additional complications, directly affecting the user experience during the installation.

Overview of Dataiku and the Launcher

To understand what is required to run Dataiku on Windows, one must first understand its architecture and internal components. The following diagram gives a rough overview of how Dataiku components interact with each other. In reality, it is much more complex than shown but the following is enough to demonstrate the problems we encountered and had to solve.

Overview of Dataiku and the Launcher

In summary:

  • The backend can spawn processes that run code in either Java, Python, or R, whether it is for a notebook, a data recipe, or an ML operation.
  • Most of these processes are spawned through scripts.
  • Jupyter is used for notebooks.
  • Supervisor is used to control Dataiku’s main processes.
  • Depending on the scenario, Dataiku inter-process communication happens either over network sockets or via manipulating files saved on disk.
  • The frontend is built with AngularJS and Angular.
  • The Python virtual environment contains the Python dependencies used to run the builtin code.
  • The Dataiku Launcher manages the lifecycle of a Dataiku instance, its updates, and the installation of its dependencies. The dependencies required are as follows:
    • The Java runtime
    • The Python runtime
    • R (optional)

System Differences

The Dataiku backend is written in both Java and Python, two multi-platform languages. However, the Dataiku code itself was designed specifically to run on UNIX-like OSs; Its implementation was not cross-platform, with its usage of the file system specifics and bash scripts.

The bash scripts had to be rewritten in PowerShell, which required fully understanding the purpose of each script. PowerShell was chosen over batch, as it provides multiple built-in cmdlets and useful tools.

The Python and Java code, however, had to be modified to accommodate for some of the following Windows specificities:

File Systems

GM2949-DAC+3+Images+for+Windows+Users+Blog_v2-03*Some old Windows programs do not understand forward slashes

**Most Linux distributions define a PATH_MAX of 4096 but most BSDs will have a 1024 limit. This can also depend on the underlying file system.

A key implementation detail of Dataiku is that it uses the file system as a database and stores its objects within its working directory. As a result, the following problems arose on the Windows file system:

The directory and file names generated were too long -  One of the limitations on Windows is that by default, it limits the length of paths to 260 characters. This limit can be easily reached when new projects and jobs are created by Dataiku, as we create a folder structure that incorporates the names of projects and datasets.

A first attempt to work around this limitation was to create junctions to induce shorter path manipulation.*** Unfortunately, this approach would not scale throughout all our code and thus would have necessitated changes across multiple different places.

Luckily, Windows has a global flag to enable support of long paths. By default, this flag is not enabled and requires setting a registry key to activate it. We solved this at the Launcher level; if the user agrees to grant the application administrator rights, we enable long path support by updating the registry.

Windows forbids certain names or characters in its filenames - By default, Dataiku itself does not use these restricted characters or filenames, but the user could potentially try to create projects or datasets that contained something problematic. To work around this, we had to enforce additional checks and prevent creating projects and datasets that contain forbidden filenames. 

***On Windows, a junction is a pointer to another directory on the same volume.

File Access

In some scenarios, a job running inside its own process will output its result to a file. Let us call this file computed_data.json. In this example, the job can be any ML operation or Python recipe. Whilst this job is running, another process will poll and read the file to track progress. To ensure that the tracking process will read a complete file, the writer will write to a temporary file and then atomically replace computed_data.json with this new file when the write is finished.

On Linux, the os.replace and os.rename Python functions are atomic and will let you replace a file if it is already being read. On Windows, os.rename is not atomic and will fail if the destination already exists. os.replace will try to fulfill the atomicity part, but will fail if the file is already in use. After investigating this issue, we found a quick fix that worked for us which was to just keep retrying the os.replace call until it was successful.

External Dependencies

We were aiming for a smooth out-of-the-box experience that would not mess with the users’ machines and would be isolated from any pre-existing environments or runtimes. We investigated how Python and R could be installed without relying on their .exe installer files and without requiring administrator privileges:

Python

As shown above, some of Dataiku’s internal logic is written in Python. Dataiku kits embed a Python virtual environment that contains its required dependencies, some of which contain platform-specific binaries compiled against a specific version of Python that are only compatible with said version (which is why Dataiku needs to be installed with a specific version of Python). Python for Windows is already shipped to nuget.org, however not all versions are published, including the one that we were aiming to use. To solve this, we resorted to building the version we needed from source and packaging it ourselves.

R

The official documentation states that an R installation is relocatable, hence we could easily repackage it. However, the catch here is that, on Windows, R requires a set of tools called RTools. These tools are necessary, for example, when you install packages in your R environment which require compilation. When Dataiku runs the R binary, the BINPREF environment variable must be set for R to be able to use the tools. Thankfully, RTools itself is also relocatable. We decided to package R and RTools together and have created an intermediate script that sets the environment variables before running R.

Supervisor

A Windows-specific fork was incorporated to replace the standard linux only build.

Build & Release Process

Now that we built and shipped our application for an additional platform, it raises the question of how will it impact our build and release process?

Dataiku artifacts

As shown in the table, Dataiku distribution kits are packaged with the following artifacts on our Linux build agents:

  • Compiled Java code (JAR)
  • Python scripts
  • PowerShell or Bash scripts
  • Binary of third-party dependencies (nginx, graphviz)
  • Python virtual environment

All platform-specific artifacts are pre-built once and can be reused for multiple kit distributions. Thus, a Windows-specific kit can be created on the same machine as the Linux and macOS kits, as only Java requires compilation and JAR files are cross-platform.

Dataiku Launcher

For an application to be easily installed, it often must be signed to verify/prove the identity of the author. The Dataiku Launcher is an Electron application which we already ship on Mac and that can only be signed on a Mac. This is a control imposed by Apple. The .exe Windows target does not however require to be built and signed on a Windows machine, therefore we decided to build both the Mac and Windows releases on the same Mac machine.

Something worthwhile to mention is that for the Electron application installer to pass the Windows Defender SmartScreen you have two options:

  • Sign your application with an EV certificate.
  • Sign your application with a certificate that has built a reputation.

If your certificate does not have enough reputation yet, you can still prevent Windows Defender SmartScreen from blocking your application by submitting it for malware analysis with Microsoft.

Takeaways for Developers

A project like this can create a lot of space for professional growth as a developer; someone more familiar with Unix is forced to embrace the paradigms of Windows, and vice versa. Even something as commonplace as scripting, in Linux with bash and in Windows with Powershell or CMD, follows different patterns. It will also allow you to develop a deeper understanding of how the application you are porting truly functions internally, which then will prove to be very helpful when extending or troubleshooting in the future.

There’s not a lot we would change about this project in retrospect, but following the release of WSL2, if we were to re-run it today we would likely take WSL into further account. As it exists at the time of writing, WSL2 now has full Linux system call compatibility, meaning we would not need to maintain any Windows-specific system code. Overall, we consider that the project went well.

Besides the technical aspects, this project has been exciting for us as a company as it has potential for high impact, given the large new user base we can now tap into. From a personal standpoint, it was cool to work on, as I had the opportunity to harness my previous experience to hit the ground running soon after joining Dataiku, while also getting familiar with Dataiku internals and rekindling my experience within the world of Unix.

See for yourself, try Dataiku on Windows here

You May Also Like

Conquering the Data Deluge Through Streamlined Data Access

Read More

I Have Databricks, Why Do I Need Dataiku?

Read More

Dataiku Makes Machine Learning Accessible, Transparent, & Universal

Read More

Explainable AI in Practice (In Plain English!)

Read More