Image by Felix Mittermeier
Spreadsheet applications (Microsoft Excel, LibreOffice Calc) are very powerful tools for dealing with data. They feature pivot tables for exploration, solvers for optimization problems and connectors to pull data from any database. Needless to say that most data analytics today is done in a spreadsheet application.
Our client, a renowned research institute, collects data from many third-party businesses and analyze their environmental impact. Up to now, the institute sends a spreadsheet questionnaire to each business by mail. Each has to fill in the values and send it back. Afterwards, the analysis is done in a series of spreadsheets as well.
When scaling up this approach, the researchers faced problems with data collection (a lot of back and forth emailing because of missing/invalid values) and in-house collaboration (having slightly different formulas in each project makes comparing results difficult, sharing/reusing data is cumbersome because of the heterogenous schema of the collection spreadsheets). Also consolidation of analysis logic from many spreadsheets into a single one is an elaborate, error-prone task.
Facing the above problems, our client came to the conclusion to implement a more elaborate system. And this is where we came into play.
The main challenges were:
All three challenges are worth a separate blog entry. Today, I focus on the second challenge but I will also say a few words about the others.
The first challenge we addressed by using a so-called SQL antipattern, the entity-attribute-value model. To enable data reuse between projects, a global catalog of variables is introduced. The variable definition includes the data type (string, number, date, ..), the physical unit and minimum / maximum values (numbers only). If necessary, new variables can be defined quickly.
The analysis (the second challenge) is no longer placed into spreadsheets but is written in Python code stored in Git repositories. These so called analysis modules enable code reuse between projects. To ensure reproducibility (=we want to use the exact same code everytime), analysis modules are referenced by their Git commit. A project can utilize multiple analysis modules. This allows them to build a versatile toolbox of modules (i.e. data preparation, calculation, allocation to key performance values). In order to get meaningful results we need to know in which order the modules need to be executed. For this we create a schedule from a dependency graph based on the input and output variables of each module (See next section). The dependency graph is also crucial for creating the questionnaire in the third challenge.
For the third challenge, we built a web app that automatically creates online questionnaires based on the variables required by the project.
The app consists of a frontend, written in Angular, and a REST-style API backend, written in C# (ASP.NET Core 2.1). In addition we operate an ElasticSearch instance to provide fuzzy search for the variables and we implemented an endpoint to fetch data for ad-hoc analysis and module development in Jupyter Notebooks. What you cannot see in the following diagram is that the backend is further divided into API and analytics engine. There is no communication between them except via the database. This not only reduces the spaghetti force but it also allows us to scale the analytics engine horizontally if needed.
In the remaining blog post, I’ll focus on dependency analysis and execution of modules. Let me know if you are interested in another blog post about other parts of the system.
Consider a project that utilizes the four modules A, B, C and D as seen in the diagram (way) below. Our goal is to know two things:
The whole process is summarized in this diagram and explained in the following sections:
Let’s start by detecting the input and output variables in the Python code. We require that an analysis module has to follow the rule that inputs are attributes on an object called inp
and outputs are attributes on an object called out
. Example:
|
|
Because the modules are stored in Git repositories, we need to git clone
each module and walk Python’s abstract syntax tree of the module code. For example using this:
|
|
Given the module code
|
|
it outputs
|
|
Then, knowing all inputs and outputs, we can build a dependency graph and find the answers to questions above:
👉 This post is also featured on digital.ebp.ch.