Compute High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is
dataflow programming, in which the computation is represented as a
directed graph (dataflow graph); nodes are the operations, and edges represent the flow of data. Popular implementations include
Apache Spark, and the
deep learning specific
TensorFlow. More recent implementations, such as
Differential/
Timely Dataflow, have used
incremental computing for much more efficient data processing.
Storage Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used. Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving.
Databases If the data is structured and some form of
online transaction processing is required, then
databases are generally used. Originally mostly
relational databases were used, with strong
ACID transaction correctness guarantees; most relational databases use
SQL for their queries. However, with the growth of data in the 2010s,
NoSQL databases have also become popular since they
horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the
object-relational impedance mismatch. More recently,
NewSQL databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.
Data warehouses If the data is structured and
online analytical processing is required (but not online transaction processing), then
data warehouses are a main choice. They enable data analysis, mining, and
artificial intelligence on a much larger scale than databases can allow,
Business analysts, data engineers, and data scientists can access data warehouses using tools such as SQL or
business intelligence software. •
Block storage splits data into regularly sized chunks;
Management The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a
workflow management system (e.g.
Airflow) to allow the data tasks to be specified, created, and monitored. The tasks are often specified as a
directed acyclic graph (DAG). == Lifecycle ==