Skip to content
This repository was archived by the owner on Aug 21, 2023. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions HOPPE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# GPU Knowledge Log
### Alex Hoppe

## Memory Hierarchies

**Big Idea:** GPUs have orders of magnitude more performance than memory bandwidth
+ Radeon HD 5870 can do 1600 MUL-ADD/clk which requires ~20 TB/s of memory bandwidth [src](http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/lectures/08_mem_hierarchy.pdf)
+ PCIe bus limited to only 63 GB/s (x16) [src](https://en.wikipedia.org/wiki/PCI_Express)
+ GPUs have much more compute than memory read
- Shaders have very high arithmetic intensity: loads of floating-point ops -> one texture lookup
- GPU operations are mostly independent so they map well to SIMD and multi-stream processing
- GPU memory requests are much more buffered, queued, interleaved for max reuse


## Phong Shader Operation
+ Vertex shaders act on every (visible?) vertex in the 3D model
- example is calculating light reflection:

``` Cpp
qsampler mySamp;
Texture2D<float3> myTex;
float3 ks;
float shinyExp;
float3 lightDir;
float3 viewDir;

float4 phongShader(float3 norm, float2 uv)
{
float result;
float3 kd;
kd = myTex.Sample(mySamp, uv);
float spec = dot(viewDir, 2 * dot(-­lightDir, norm) * norm + lightDir);
result = kd * clamp(dot(lightDir, norm), 0.0, 1.0);
result += ks * exp(spec, shinyExp);
return float4(result, 1.0);
```
```
[src](http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/lectures/08_mem_hierarchy.pdf)

This vector shader takes in vertex normals per vertex, then calculates the intensity of the direct reflection into the eye (specular) and also the diffuse reflection and scales them by the (color-value?) coefficients from the texture mapping.

A *Phong Shader* is an algorithm that calculates surface reflectance by using vertex normals as an approximation of a 3D surface instead of plane normals.

#### Clamp Operation
+ Clamping to [0, 1] forces the values to the nearest value if outside range.
+ Without clamping, light going away from the view could be represented

## Shader Core
+ Individual fetch unit maybe?
+ Several ALUs working on same fetch unit
+ Execution contexts (register space afaict)
+ L1 Cache
+ Texture Cache, (read-only, loaded at compile-time)

![shader_core](shader_core_ex.png)
One theoretical example of a shader core.

## Graphics Processing
[CMU lecture about *GRAPHICS* start to front](https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf)

### Interleaving
Hide latency by interleaving different independent processes while they wait on memory fetches. Works like a pipeline CPU, sort of. Maximize throughput as opposed to wasting money, time and area on reducing latency.
![Hiding Latency with Interleaving](hiding_latency.png)


#### Execution Contexts
Execution contexts are basically swap spaces sized to park the entire register/execute state so it can interleave different operations

**EG:**
16 different execution contexts means 16 different processes can be interleaved, swapping back and forth while long operations like memory reads (texture fetches) are happening. This hides latency much better than a processor with 4 contexts, say. There is an area drawback to adding more contexts: Its adds a lot of area that's taken up by *expensive & fast*-type storage.

![Multiple Execution Contexts](exec_contexts.png)
Binary file added Project_ Graphics Processing Unit.pdf
Binary file not shown.
264 changes: 264 additions & 0 deletions WRITEUP.md

Large diffs are not rendered by default.

Binary file added img/exec_contexts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/hiding_latency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/shader_core_ex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/shader_core_memory.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/Graphics3D_VertexFragment.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 93 additions & 0 deletions notes_taylor/notes_taylor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
## SigGraph 2008 Talk

#### Overall architecture

![top level architecture](siggraph_gpu_overall.PNG)

Each shader core is like a smaller, specialized CPU. You don't need to optimize for single-thread efficiency, so you can remove a lot of specialized hardware from a CPU design.

![example CPU architecture](siggraph_cpu_design.PNG)

![example shader architecture](siggraph_slimmed_design.PNG)

Of these components:

- fetch/decode is pretty much analogous to the same unit in a CPU
- In this example, there's a single ALU, but actual GPU architectures can include many ALUs (usually specialized for vector operations)
- Execution context is a set of registers to store the current state in, to do effective multiplexing with high-latency stuff: more below

#### SIMD Processing

You can amortize the cost of managing a single instruction stream over multiple ALUs, since many "fragments" will need to be processed with the same set of instructions. Instead of scalar operations on scalar registers, you can do vector operations on vectors of registers. This is called SIMD (Single Instruction, Multiple Data) processing. You can either have explicit SIMD vector instructions, or write scalar instructions that get implicitly vectorized in hardware. This is a design decision, and different architectures do it differently.

![SIMD architecture](siggraph_multi_alu.PNG)

In this example, each ALU unit has its own single block of context data.

Now, you can do a lot of processing in parallel. With 16 cores, 8 ALUs per core, you can process 16 simultaneous instruction streams on 128 (total) fragments at the same time.

![128 fragments on 16 streams](siggraph_16_8_parallel.PNG)

Branching is kind of gross, though. Because you're processing several fragments through the same set of instructions, you have to take *every* branch that *any* of your fragments take. On the plus (?) side, this kind of obviates the benefits of a specialized branch predictor unit, as you'd rarely make use of it to skip anything.

![shader branching](siggraph_shader_branch.PNG)

#### Stalls and Latency

Here's where the execution context becomes really helpful. For certain operations, like retrieving a texture from memory, there's a massive amount of latency during which time your core would be doing basically nothing. If you instead build in multiple complete sets of execution contexts:

![multiple contexts](siggraph_multi_context.PNG)

While you're waiting for a stall to finish, you can store the current state in a context and switch over to another set of fragments for processing. This allows you to minimize the effect of high-latency operations on throughput.

![latency hiding](siggraph_latency_hiding.PNG)

## UFMG - Architecture and micro-architecture of GPUs

Before 2000, the focus was on improving single-threaded performance (reducing latency). Pipelining in the 80s, superscalar processors using branch prediction, out-of-order execution, other tricks to improve performance in the 90s. Regularity in application behavior allows for predictions to pay out on average.

After 2000:

#### Multi-Threading
- Homogeneous multi-core
- Replication of the complete execution engine
- More cores, more parallel work
- Interleaved multi-threading
- Uh. Basically pipelining?
- Clustered multi-threading
- For each unit of data, select between above

#### Heterogeneity
- Latency-optimized multi-core: spends too much resources on parallel portions
- Throughput-optimized multi-core: not performant in sequential portions
- Heterogeneous multi-core: contains both types, can specialize

![types of multi-cores](ufmg_heterogeneity.PNG)

#### Regularity in threading

Regularity (similarity in behavior between threads) allows for greater parallelism: time-intensive tasks like memory access across multiple threads can be consolidated into a single transaction. This cooperative sharing of fetch/decode and load/store units improves area/power efficiency (Nvidia calls this SIMT, in which threads are consolidated into "warps" for improved performance).

![regularity in data](ufmg_regularity.PNG)

#### Control Divergence

Traditionally, in SIMD processing:

- One thread per processing element
- All elements execute the same instruction
- Elements can be individually disabled

This is basically the structure covered in the SigGraph talk. Performance takes a larger hit the more branches taken, because every branch is taken.

![naive SIMT](ufmg_SIMT.PNG)

This is accomplished through the use of a mask stack or activity counters; a bit per thread shows which are currently active, or a counter per thread counts up for each active instruction cycle.

Alternately, you can maintain a separate program counter per thread. This has performance benefits, but makes thread synchronization a non-trivial task. Now you need to figure out which PC to use as the "master" PC each cycle, since you still only want to do one consolidated instruction fetch operation. This can be determined via various criteria (most common, latest, deepest function call nesting level, some combination) to maximize efficiency.

![voting SIMT](ufmg_SIMT_voting.PNG)

This starts looking a lot less like SIMD and a lot more like normal programs. There's no enforced lockstep of vector instructions; it's basically just different cores doing different things in parallel. You also don't need to maintain a stack of activation bits, which is nice.

This general paradigm is called SPMD (Single Program, Multiple Data). It's an idea that can be used for more than just specialized vector processors, and can also refer to parallel computation being done via message passing (e.g. MPI) over multiple independent computers, which allows for more flexibility in instruction fetches as well.
Binary file added notes_taylor/siggraph_16_8_parallel.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_cpu_design.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_gpu_overall.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_latency_hiding.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_multi_alu.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_multi_context.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_shader_branch.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_shader_core.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/siggraph_slimmed_design.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/ufmg_SIMT.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/ufmg_SIMT_voting.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/ufmg_heterogeneity.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/ufmg_performance.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_taylor/ufmg_regularity.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 93 additions & 0 deletions notes_william/Buffers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Frame Buffer

The frame buffer is a set of sub buffers that are used to determine what exactly ends up on the screen.

The frame buffer breaks down into Color Buffers, a Depth buffer, a Stencil buffer, and an Accumulation buffer. (There are technically some others too, but they aren't super as important as these ones I think.)

###### Frame Buffer Visual
![Frame Buffer Visual](FrameBufferVisual.gif)

This is a simplified picture of what the frame buffer is. It simply saves all the computed information that will then be written to the screen. Before writing it, it is fragments with computed information from the fragment shaders, then it is written to the screen and becomes pixels. (In this example picture we are looking at a 24 bit color buffer, which will be explained in a sec.)

##### What is a buffer?
It is a matrix of values in memory that correspond to the pixels on the screen. These values also have a particular bit depth based on what the buffer is and how the programmer wishes to use it. For example, one could use a 1 bit depth color buffer to represent only black and white on a screen. An important thing to remember about these buffers is that their main purpose is to save data that would be useful later. All of these buffers generally keep and maintain data from previous computation unless explicitly cleared.

**Tentative Note: (Actually I think this is wrong now.)**

~~It appears that all of these buffers other than the main color buffer, are things that you can set your self. Many graphics programs use most of them simply because they are so integral to being able to do graphics, however, one could make buffers for whatever information they want and use them in calculations. All of these buffers are stored in the graphics hardware's random access memory (RAM), and are designated, not explicitly in the hardware, but by the programmer.~~

I think the color buffers, depth buffer, and stencil buffer are all hardware level set aside memory area on the ram of the gpu or unified ram.

###### (Not Sure if this is correct)

## Main Color Buffer
The main color buffer is what determines the pixels for the scene you plan to draw. Without any extra buffers active, this buffer is what the math in the shader cores determines. The pixel are all updated based on this. These usually have a bit depth of 24 bits for true color pixels (2^8 for R G and B) on the screen, sometimes just 16 bits (2^5 for two colors and 2^6 for the last, which is usually green) for less specific colors, and technically they can have fewer bits. An alpha channel can also be added to do 32 bits. The picture in the Frame Buffer section is an example of a color buffer.

IMPORTANT NOTE: This is the only thing that ends up actually getting drawn to the screen, the rest of the buffers are used to edit this buffer.

##### Extra Color Buffers
There can be extra color buffers depending on the gpu setup. These allow for saving images that the programmer may know are going to be used for a while. They work the same as the main color buffer and are not written to the screen by default, only if the programmer specifies.

## Depth Buffer (or Z-Buffer)
The depth buffer is a representation of how far away the pixels on the screen are from the viewer if the scene existed in 3D space. This is useful for determining if an object that moves goes in front of or behind another object. These are usually 3 bytes (24 bits) deep.

###### Doom Depth Buffer Example

![Doom Depth Buffer Example](DoomExample.png)

In this example the actual game is on the left and the depth buffer values are displayed on the right (black being closer to the player, and white being further). These values allow the game to correctly display what is in front of what and not have graphics in the background lay on top of those in the foreground.

Another example, imagine a screen consists of a 2d House and a car driving by behind said house. The depth values for the area taken up by the house would be lower (closer) than those taken up by the car. Every time the screen is updated the car's depth values and color values would move towards the house. Once they overlap, the depth buffer would determine that the colors of the car are in fact behind those of the house, and thus the car is not displayed on the screen where it overlaps with the house. This process is called a depth test.

###### Depth Test Visuals
![Depth Test Visual](DepthBufferVisual.png)

![Another Depth Test Visual](z-bufferNumberExample.gif)

The programmer can also manually change information about what is happening in the scene by manually changing depth values for objects.

For example, imagine an application that allows you to punch holes in your browser through which you can see your desktop. The browser is essentially just a rectangle with a higher depth value than your desktop background. In order to punch a hole in it, one could just define a punch shape that the GPU can process and determine the pixel locations for, and then write the minimum distance on the depth buffer for the fragments inside this shape corresponding to the desktop background. Thus the desktop shape there would punch through and appear on top of the browser.


## Stencil Buffer
This is a buffer that is normally 1 byte (8 bits) deep and simply designates whether to update the corresponding pixels. Each bit in the buffer essentially acts as a mask, thus the 8 bits allow for 8 masks over the screen to stop certain color buffers from being rendered onto the screen in certain places. This is similar to how a graffiti artist would graffiti a multi colored graffiti. In fact that is where the name comes from. This is kinda like the depth buffer in that it can hide parts or entire objects, however this is particularly manual in that the programmer defines the shapes that are draw zones or no draw zones.

###### Antichamber Stencil Buffer Example
https://youtu.be/V9HvgmNVQGM?t=8m33s

Short clip from antichamber, great example of stencil buffers.

In this example there are small cube areas where different geometric objects exist, however the game uses stencil buffers to overlay different scenes in what should be the same geometric area. The way this most likely works, is the shader code of the game give each of those different areas its own stencil mask and then renders the seen objects in each of those scenes. In fact to go further and optimize, the game probably only has four separate scenes that are loaded each containing all the objects on one face side, this way one would only need 4 masks in order to correctly show the objects in the game. The only problem with all this is there is a lot of extra math and calculations that happen before the objects that aren't supposed to be seen are culled due to GPUs being bad at conditional logic. I'm pretty sure the setup is that the objects that are seen in any way are entirely rendered, and then afterwords multiplied by the cull mask, therefore a lot of extra math is happening. This isn't super evident when this youtuber plays the game, but when I run the game on my computer as soon as I walk into this room the game starts to slow down quite noticeably.

## Accumulation Buffer
The accumulation buffer is a sort of workspace for more complex graphics operations. This buffer uses values from one or more color buffers to generate things like antialiasing, motion blur, and blending. Once the buffer is done accumulating from the other buffers it then writes back to the main color buffer. This can be thought of something like how a photographer can blur together images by taking multiple pictures of a scene without advancing the film of the camera. This creates blur for things that move, or if done correctly translucent looking objects in an image, although the computer has the capacity to do more complicated tasks.

###### Motion Blur Accumulation Buffer Example
https://youtu.be/6zhTNYY8ehM?t=13s


## Normal Buffer

Normals are simply the direction vector perpendicular to a surface.

This buffer essentially just records the normals of the fragments in the scene. These can then be used to calculate lighting with information about where the light is and object color or texture. Additionally, for more realistic light one can use the material, like smoothness, reflectivity, and whether the light on it is colored.

###### Visualization of the normal buffer
![Normal Buffer Example](texture_normal_buffer.png)

In the example, you can see that the normals of the object denoted with RGB values. Green points towards positive y (up), Red points towards positive x (right), and blue points towards positive z (towards camera). This is why everything in the middle is generally blue, and all the values have a blueish tint, because we are looking from the positive z direction. Additionally, the direction upwards between these three axis is generally grey, this is most apparent on the upper right object with the whitish plane. The colors on the back of the sphere would be similar to those on the front but without much blue, and those there would be a dark spot on the back side.

This buffer isn't as explicitly important as the other buffers, because depending on your shader program one might calculate the normals as they go and not store them all in a buffer.

## Clearing Buffers
It’s also super important to clear the buffers when needed, if not then some weird visuals can happen. For example, while I’m not sure if this is why it happens, one could imagine that if you don’t clear the color or depth buffers, one would get a scene that appears to leave a trail of moving objects. For example, I’m sure some of y'all have experienced when you move your browser and the computer updates it’s position but doesn’t clear its old position, leaving a trail of browser on your screen.


## Closing Thoughts

It is important to note that this list of useful graphics buffers is not all encompassing. There are quite a few different variations of buffers that appear to be useful in different cases. These are just the most generally applicable.


# Resources

OpenGL book explanation on frame buffers can be found [here](http://www.glprogramming.com/red/chapter10.html).
Binary file added notes_william/DepthBufferVisual.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/DoomExample.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/FrameBufferPipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/FrameBufferVisual.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/GraphicsMemes.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/WinterWarmth.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added notes_william/texture_normal_buffer.png
Binary file added notes_william/z-bufferNumberExample.gif
Binary file added siggraph_16_8_parallel.PNG
Binary file added siggraph_cpu_design.PNG
Binary file added siggraph_multi_alu.PNG
Binary file added siggraph_slimmed_design.PNG