Integrated circuits are designed to handle various operations on both analog and digital signals. In computing, digital signals are the most common and are typically represented as binary numbers.
Computer hardware and software use this
binary representation to perform computations. This is done by processing
Boolean functions on the binary input, and then outputting the results for storage or further processing by other devices.
Computational equivalence of hardware and software Because all
Turing machines can run any
computable function, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can always be used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software.
Hardware description languages (HDLs) such as
Verilog and
VHDL can model the same
semantics as software and
synthesize the design into a
netlist that can be programmed to an FPGA or composed into the
logic gates of an ASIC.
Stored-program computers The vast majority of software-based computing occurs on machines implementing the
von Neumann architecture, collectively known as
stored-program computers.
Computer programs are stored as data and
executed by
processors. Such processors must fetch and decode instructions, as well as
load data operands from
memory (as part of the
instruction cycle), to execute the instructions constituting the software program. Relying on a common
cache for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in the
modified Harvard architecture, where instructions and data have separate caches in the
memory hierarchy, there is overhead to decoding instruction
opcodes and
multiplexing available
execution units on a
microprocessor or
microcontroller, leading to low circuit utilization. Modern processors that provide
simultaneous multithreading exploit under-utilization of available processor functional units and
instruction level parallelism between different hardware threads.
Hardware execution units Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of an
instruction cycle and incur those stages' overhead. If needed calculations are specified in a
register transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses. This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increased
input/output capabilities. This comes at the cost of general-purpose utility.
Emerging hardware architectures Greater RTL customization of hardware designs allows emerging architectures such as
in-memory computing,
transport triggered architectures (TTA) and
networks-on-chip (NoC) to further benefit from increased
locality of data to execution context, thereby reducing computing and communication latency between modules and functional units. Custom hardware is limited in parallel processing capability only by the area and
logic blocks available on the
integrated circuit die. Therefore, hardware is much more free to offer
massive parallelism than software on general-purpose processors, offering a possibility of implementing the
parallel random-access machine (PRAM) model. It is common to build
multicore and
manycore processing units out of
microprocessor IP core schematics on a single FPGA or ASIC. Similarly, specialized functional units can be composed in parallel, as
in digital signal processing, without being embedded in a processor
IP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving little
conditional branching, especially on large amounts of data. This is how
Nvidia's
CUDA line of GPUs are implemented.
Implementation metrics As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering characteristics such as physical hardware dimensions, power consumption, and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed. ==Applications==