Stanford University stream processing projects included the Stanford Real-Time Programmable Shading Project started in 1999. A prototype called Imagine was developed in 2002. A project called Merrimac ran until about 2004.
AT&T also researched stream-enhanced processors as
graphics processing units rapidly evolved in both speed and functionality. Since these early days, dozens of stream processing languages have been developed, as well as specialized hardware.
Programming model notes The most immediate challenge in the realm of parallel processing does not lie as much in the type of hardware architecture used, but in how easy it will be to program the system in question in a real-world environment with acceptable performance. Machines like Imagine use a straightforward single-threaded model with automated dependencies, memory allocation and
DMA scheduling. This in itself is a result of the research at MIT and Stanford in finding an optimal
layering of tasks between programmer, tools and hardware. Programmers beat tools in mapping algorithms to parallel hardware, and tools beat programmers in figuring out smartest memory allocation schemes, etc. Of particular concern are MIMD designs such as
Cell, for which the programmer needs to deal with application partitioning across multiple cores and deal with process synchronization and load balancing. A drawback of SIMD programming was the issue of
array-of-structures (AoS) and structure-of-arrays (SoA). Programmers often create representations of enitities in memory, for example, the location of a particle in 3D space, the colour of the ball and its size as below: // A particle in a three-dimensional space. struct Particle { double x; double y; double z; // 8 bit per channel, say we care about RGB only unsigned byte color[3]; float size; // ... and many other attributes may follow... }; When multiple of these structures exist in memory they are placed end to end creating an
arrays in an
array of structures (AoS) topology. This means that should some algorithm be applied to the location of each particle in turn it must skip over memory locations containing the other attributes. If these attributes are not needed this results in wasteful usage of the CPU cache. Additionally, a SIMD instruction will typically expect the data it will operate on to be contiguous in memory, the elements may also need to be
aligned. By moving the memory location of the data out of the structure data can be better organised for efficient access in a stream and for SIMD instructions to operate one. A
structure of arrays (SoA), as shown below, can allow this. struct Particle { double* x; double* y; double* z; unsigned byte* colorRed; unsigned byte* colorBlue; unsigned byte* colorGreen; float* size; }; Instead of holding the data in the structure, it holds only pointers (memory locations) for the data. Shortcomings are that if multiple attributes to of an object are to be operated on they might now be distant in memory and so result in a cache miss. The aligning and any needed padding lead to increased memory usage. Overall, memory management may be more complicated if structures are added and removed for example. For stream processors, the usage of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes (at least 16) available. For each attribute, the application can state the number of components and the format of the components (but only primitive data types are supported for now). The various attributes are then attached to a memory block, possibly defining a
stride between 'consecutive' elements of the same attributes, effectively allowing interleaved data. When the GPU begins the stream processing, it will
gather all the various attributes in a single set of parameters (usually this looks like a structure or a "magic global variable"), performs the operations and
scatters the results to some memory area for later processing (or retrieving). More modern stream processing frameworks provide a FIFO like interface to structure data as a literal stream. This abstraction provides a means to specify data dependencies implicitly while enabling the runtime/hardware to take full advantage of that knowledge for efficient computation. One of the simplest and most efficient stream processing modalities to date for C++, is
RaftLib, which enables linking independent
compute kernels together as a data flow graph using C++ stream operators. As an example: import ; import ; import std; using String = std::string; using RaftKernel = raft::kernel; using RaftKernelStatus = raft::kstatus; using RaftMap = raft::map; using RaftPrint = raft::print; class HelloWorld : public RaftKernel { public: HelloWorld() { output.addPort("0"); } virtual RaftKernelStatus run() { output["0"].push("Hello World\n"); return raft::stop; } }; int main(int argc, char* argv[]) { // instantiate print kernel RaftPrint p; // instantiate hello world kernel HelloWorld hello; // make a map object RaftMap m; // add kernels to map, both hello and p are executed concurrently m += hello >> p; // execute the map m.exe(); return 0; }
Models of computation for stream processing Apart from specifying streaming applications in high-level languages, models of computation (MoCs) also have been widely used as
dataflow models and process-based models.
Generic processor architecture Historically, CPUs began implementing various tiers of memory access optimizations because of the ever-increasing performance when compared to relatively slow growing external memory bandwidth. As this gap widened, big amounts of die area were dedicated to hiding memory latencies. Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%). A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistors dedicated to management is actually very little. Beginning from a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on an add-in board (this seems to also apply to Imagine). CPUs continue do the job of managing system resources, running applications, and such. The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses have been employed in the past). The exact amount of memory lanes is dependent on the market range. As this is written, there are still 64-bit wide interconnections around (entry-level). Most mid-range models use a fast 128-bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512 MB) with a slightly slower crossbar that is 256 bits wide. By contrast, standard processors from
Intel Pentium to some
Athlon 64 have only a single 64-bit wide data bus. Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an
indirection chain, which is however guaranteed to finally read or write from a specific memory area (inside a stream). Because of the SIMD nature of the stream processor's execution units (ALUs clusters), read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from
Rambus and
DDR SDRAM, for example). This also allows for efficient memory bus negotiations. Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. This is where knowing the kernel temporaries and dependencies pays. Internally, a stream processor features some clever communication and management circuits but what's interesting is the
Stream Register File (SRF). This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. As a cache-like software-controlled structure to the various
ALUs, the SRF is shared between all the various ALU clusters. The key concept and innovation here done with Stanford's Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer. The dependencies between kernel functions and data is known through the programming model which enables the compiler to perform flow analysis and optimally pack the SRFs. Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor (or at least Imagine) totally automates. Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than if you hand tuned the thing with much effort. There is proof; there can be a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because intra-cluster communication is common and thus needs to be highly efficient. To keep those ALUs fetched with data, each ALU is equipped with local register files (LRFs), which are basically its usable registers. This three-tiered data access pattern, makes it easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving.
Hardware-in-the-loop issues Although an order of magnitude speedup can be reasonably expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this. Communication latencies are actually the biggest problem. Although
PCI Express improved this with full-duplex communications, getting a GPU (and possibly a generic stream processor) to work will possibly take long amounts of time. This means it's usually counter-productive to use them for small datasets. Because changing the kernel is a rather expensive operation the stream architecture also incurs penalties for small streams, a behaviour referred to as the
short stream effect.
Pipelining is a very widespread and heavily used practice on stream processors, with GPUs featuring pipelines exceeding 200 stages. The cost for switching settings is dependent on the setting being modified but it is now considered to always be expensive. To avoid those problems at various levels of the pipeline, many techniques have been deployed such as "über shaders" and "texture atlases". Those techniques are game-oriented because of the nature of GPUs, but the concepts are interesting for generic stream processing as well. == Examples ==