24. August 2021

RISC Software: Data Engineering – the Solid Basis of Effective Data Use

Data engineering integrates data from a wide variety of sources and enables its effective use. This makes it essential for effective data science, machine learning and artificial intelligence, especially in the field of big data.

In recent years the task of gathering information from large amounts of data has become significantly more important for a growing number of companies in many different business sectors. Examples of this are historic sales data, which can be used to optimize the range of products offered by online shops, and sensor data from a production line, which can help to increase product quality or to replace machine parts in good time as part of preventive maintenance. Apart from using an integrated database as an integral part of a company’s operations, the hot topics of artificial intelligence (AI) and machine learning (ML), along with the promised capability of continually optimizing production processes, for instance, are a powerful incentive.

However, looking at the process of information gathering as a whole it quickly becomes clear that AI and ML are only the tip of the proverbial iceberg. For the steps of model training and model validation in particular these methods require large amounts of consistent and complete data. Such data volumes can be generated by sensor networks or sensors in production, for example. Receiving, storing and processing this data so that it can be used effectively is the central task of data engineering.

Whether the objective is standardized and effective company reporting, data science to improve the production process or AI is immaterial. A solid database is necessary in every case. Integration of data into a shared database can also serve as a reliable ground truth for many different applications within a company: for effective day-to-day business, strategic planning based on trustworthy data and facts, or for training models for AI systems.

Demarcation

The overall objective is thus to increase the quality and usability of available data, which means that it essentially corresponds to the data science hierarchy of needs 1 which describes the levels leading from raw data to AI. Similar to the needs hierarchy according to Maslow 2, the lower levels of the pyramid are a prerequisite for the steps that follow.

 

Data Science Hierarchy of Needs 1

At the top of the pyramid are the activities of data science, resting on integrated and cleaned data. These can be used to train ML models, for example.

The blue levels represent data engineering activities that deal principally with move/store and transform/explore. The levels above these, with AI, deep learning and ML, are the province of data scientists, whereas activities such as data labelling and data aggregation are crossover areas whose tasks can be performed either by data scientists or data engineers, depending on the precise nature of the task and the available personnel.

The data collecting activities at the base of the pyramid fall only partially under the remit of data engineering inasmuch as it generally receives the data at a defined interface – from files, external databases or a network protocol. One reason for this is that data engineering is a part of computer science or software engineering and as such is not usually involved in matters such as constructing and operating data logging hardware such as sensors.

Data cleaning and integration

During the data engineering process the raw data is prepared in several steps once it has been received until, having been consistently and completely processed, it is stored in the data memory:

  1. Data cleaning
  2. Data integration
  3. Date transforming

These forms of data formation are implemented sequentially, one step leading to the next. They can either be performed by processing data streams, i.e. processing many small data packets one after the other, or by batch processing, i.e. the whole dataset at once. A data store of the appropriate capacity – the data lake – allows the data to be stored in various formats while it is being processed.

Data cleaning includes inspection of the imported data lines to ensure they are complete and syntactically correct. In this step, data errors such as incorrect sensor values can also be identified by applying predefined rules. Depending on the application in hand, the following possibilities exist if these criteria are breached:

  • Improve raw data quality: If better-quality raw data can be supplied it can replace the flawed data.
  • Discard data: Flawed data can be discarded if the dataset is intended for training purposes in ML and enough correct data is available.
  • Automatically correct errors during import: If the data can be imported from an additional source, for instance, errors occurring during data integration can be rectified.

In practice the simplest solution is to discard any flawed data. However, if every single data point could be relevant for the planned evaluations in some way, flawed data must be corrected as far as possible. This could occur during quality assessment in production, for example, if the production data for a faulty workpiece is not correct owing to a sensor fault. The data can either be corrected manually by a domain expert or the correct data can be supplied at a later date.

The involvement of domain experts is key here because, on the one hand, they know the criteria for judging the correctness of data, such as sensor values, and on the other they know what needs to be done with flawed or incomplete data. Data integration deals with the automated linking of data from various data sources. Depending on the application domain and the type of data, data can be linked using a number of different methods, such as:

  • Unique identifiers, similar to foreign keys in the relational model
  • Geographic or temporal proximity
  • Domain-specific contexts such as sequences in production processes or production lines

Once the data cleaning and data integration steps have been completed the data engineer can provide a set of data suitable for use by data scientists. The data transforming step mentioned above refers to continuous adjustments of the data model to improve the performance of queries by the data scientists.

Data storing and data modelling for big data

The cleaned and integrated data can be stored in a suitable data repository. In applications for Industry 4.0, data is continuously generated by sensors, for instance, which often leads to accumulation of data volumes in the terabyte range in a matter of months. Data volumes of this size are often too large to handle with a classic relational database. Although there are commercially available scalable databases that use the relational model, their licences are so expensive that they are unsuitable for many implementation projects, especially for SMEs.

Horizontally scalable NoSQL systems are a viable alternative here. NoSQL stands for “Not only SQL” and describes data memories that use non-relational data models. Horizontal scalability means that these systems can be expanded by integrating additional hardware, in principle for unlimited amounts of data. Typical examples of NoSQL systems often use liberal licence models such as the Apache licence, meaning that they can be used for commercial purposes without the need to pay for a licence. Additionally, these systems place no special demands on the hardware used, which further reduces initial costs. NoSQL systems such as Apache Hadoop and related technologies therefore represent a low-cost way of storing and retrieving data volumes in the terabyte range.

Especially in the field of big data, choosing a suitable NoSQL database and an appropriate data model is of central importance because both play a vital role in the performance of the system as a whole. For NoSQL systems, this applies to both data input and data retrieval.

The choice of technology to be used and of the data model design is determined by the demands made on the system:

  • What data volumes and data rates must be imported?
  • What queries and evaluations will the data be used for?
  • What are the requirements with regard to query performance? Is the system a real-time system?

One key question is whether the system is intended to support predefined queries only or – using SQL, for example – should allow flexible queries.

The options when choosing the technology include allowing access to the data either by means of a known key only or also allowing queries of values of other attributes. In the first case, a system with the semantics of a distributed hash map such as Apache HBase is suitable, in the second an in-memory analytics solution such as Apache Spark. If the data is to be used primarily for linking data, a graph database should be considered.

In a big data system the data is stored in denormalized form for performance reasons. That means that all data relevant to a query outcome should be stored as a group. This is because join operations require a great deal of resources and time. The type of queries planned is therefore crucial for the data model design. For example, the attributes that appear chiefly as parameters in the queries should be used as key attributes. This is also the reason why the data model often has to be extended when new queries are added, to ensure that they are carried out effectively and that data engineering tasks continually arise even after the data has been imported.

RISC Software GmbH has amassed over ten years of expertise in the field of open-source NoSQL databases, making the company a reliable partner for consultancy and implementation when it comes to introducing or expanding a solid database, regardless of the area of application.

References

1 Blasch, Erik & Sung, James & Nguyen, Tao & Daniel, Chandra & Mason, Alisa. (2019). Artificial Intelligence Strategies for National Security and Safety Standards.

2 Abraham Maslow: A Theory of Human Motivation. In Psychological Review. 1943, Vol. 50 #4, pp. 370–396.

Use cases for data engineering

  • Use case 1: In-company data integration

Data from various sources can be collated and used effectively in an integrated data model

  • Use case 2: Data preparation for AI / ML

Data engineering methods can serve to make a large amount of consistent and complete data available for AI and ML training

  • Use case 3: Transformation of the data model to improve understanding of the data

Data engineering can significantly increase understanding of the data by adapting the data model to make it better suited to the use case. One example of this might be the introduction of a graph database.

  • Use case 4: Improved (faster) data use

By adapting data storage and the data model, data engineering can help to considerably speed up interactive queries.

Image credits:

  • Photo: © istock/metamorworks
  • Graphics: © RISC Software GmbH, reproduction free of charge