5. DISCLAIMER
•I have
•published academically in these fields
•contributed code / algorithms
•built systems to related to these fields
•But I do not consider myself an expert in many
of these fields / topics
Source: http://www.inspirationde.com/
16. SMART DAILY
•Intelligent video summary
•Daily email on your past day’s activity
- Compressed domain
analysis
- Very fast - totally I/O
bound
- Detect events based
on activity / sensible
motion
19. RELATIONSHIPTO
DATA SCIENCE?
•Rich information, lots of data (in terms of bits)
•Unstructured, usually without much context / semantics
•Difficult to process and query
•We are generating them every day
21. History of Computer Vision
Marvin Minsky, MIT
Turing award, 1969
“In 1966, Minsky hired a first-year
undergraduate student and
assigned him a problem to solve
over the summer:
connect a camera to a computer
and get the machine to describe
what it sees.”
Crevier 1993, pg. 88
23. 1960’s: interpretation of synthetic worlds
Larry Roberts
“Father of Computer Vision”
Larry Roberts PhD Thesis, MIT, 1963,
Machine Perception of Three-
Dimensional Solids
Input image 2x2 gradient operator computed 3D model
rendered from new
viewpoint
Slide credit: Steve Seitz
24.
25. 1970’s: some progress on interpreting
selected images
The representation and matching of pictorial
structures
Fischler and Elschlager, 1973
26. 1970’s: some progress on interpreting
selected images
The representation and matching of pictorial
structures
Fischler and Elschlager, 1973
36. ACKNOWLEDGEMENTS
•Many slides and materials borrowed from Jia-Bin Huang, Steve Seitz,
Rich Szeliski,Andrew Zisserman, Larry Zitnick. I try to give credits
whenever possible but may have a few omissions.
•Rights of illustration, pictures and other relevant materials belong to
their original creators or authors.
39. TOPICS
•Image formation and 2D image processing
•Epipolar geometry and stereo matching
•Structure from motion and tracking
•Stitching and computational photography
•Visual recognition (next talk)
40. REFERENCE BOOK
“MultipleView Geometry in Computer
Vision”, Richard Hartley and Andrew
Zisserman
•A good book to get started on camera
geometry
•More math heavy but very old school
43. Image formation
Let’s design a camera
• Idea 1: put a piece of film in front of an object
• Do we get a reasonable image?
44. Pinhole camera
Add a barrier to block off most of the rays
• This reduces blurring
• The opening known as the aperture
• How does this transform the image?
45. Shrinking the aperture
Why not make the aperture as small as possible?
• Less light gets through
• Diffraction effects...
46. Adding a lens
A lens focuses light onto the film
• There is a specific distance at which objects are “in focus”
– other points project to a “circle of confusion” in the image
• Changing the shape of the lens changes this distance
“circle of
confusion”
47. Depth of field
Changing the aperture size affects depth of field
• A smaller aperture increases the range in which the object is
approximately in focus
f / 5.6
f / 32
Flower images from Wikipedia http://en.wikipedia.org/wiki/Depth_of_field
48. Modeling projection
The coordinate system
• We will use the pin-hole model as an approximation
• Put the optical center (Center Of Projection) at the origin
• Put the image plane (Projection Plane) in front of the COP
– Why?
• The camera looks down the negative z axis
– we need this if we want right-handed-coordinates
49. Modeling projection
Projection equations
• Compute intersection with PP of ray from (x,y,z) to COP
• Derived using similar triangles (on board)
• We get the projection by throwing out the last coordinate:
50. Homogeneous coordinates
Is this a linear transformation?
Trick: add one more coordinate:
homogeneous image
coordinates
homogeneous scene
coordinates
Converting from homogeneous coordinates
• no—division by z is nonlinear
51. Perspective Projection
Projection is a matrix multiply using homogeneous coordinates:
divide by third coordinate
This is known as perspective projection
• The matrix is the projection matrix
• Can also formulate as a 4x4 (today’s reading does this)
divide by fourth coordinate
52. Projection equation
• The projection matrix models the cumulative effect of all parameters
• Useful to decompose into a series of operations
ΠXx =
⎥
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎢
⎣
⎡
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
=
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
=
1
****
****
****
Z
Y
X
s
sy
sx
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
⎥
⎥
⎦
⎤
⎢
⎢
⎣
⎡
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
−
−
=
11
0100
0010
0001
100
'0
'0
31
1333
31
1333
x
xx
x
xx
cy
cx
yfs
xfs
00
0 TIRΠ
projectionintrinsics rotation translation
identity matrix
Camera parameters
A camera is described by several parameters
• Translation T of the optical center from the origin of world coords
• Rotation R of the image plane
• focal length f, principle point (x’c, y’c), pixel size (sx, sy)
• blue parameters are called “extrinsics,” red are “intrinsics”
• The definitions of these parameters are not completely standardized
– especially intrinsics—varies from one book to another
54. Distortion
Radial distortion of the image
• Caused by imperfect lenses
• Deviations are most noticeable for rays that pass through the
edge of the lens
No distortion Pin cushion Barrel
56. Structure from Motion 20
Camera calibration
Determine camera parameters from known 3D
points or calibration object(s)
1. internal or intrinsic parameters such as focal
length, optical center, aspect ratio:
what kind of camera?
2. external or extrinsic (pose)
parameters:
where is the camera?
How can we do this?
CSE 576, Spring 2008
57. Structure from Motion 21
Camera matrix
Fold intrinsic calibration matrix K and extrinsic
pose parameters (R,t) together into a
camera matrix
M = K [R | t ]
(put 1 in lower r.h. corner for 11 d.o.f.)
CSE 576, Spring 2008
58. Structure from Motion 22
Camera matrix calibration
Directly estimate 11 unknowns in the M matrix
using known 3D points (Xi,Yi,Zi) and measured
feature positions (ui,vi)
CSE 576, Spring 2008
59. Structure from Motion 23
Separate intrinsics / extrinsics
New feature measurement equations
Use non-linear minimization
Standard technique in photogrammetry, computer
vision, computer graphics
• [Tsai 87] – also estimates κ1 (freeware @ CMU)
http://www.cs.cmu.edu/afs/cs/project/cil/ftp/html/v-source.html
• [Bogart 91] – View Correlation
CSE 576, Spring 2008
60. Structure from Motion 24
Multi-plane calibration
Use several images of planar target held at
unknown orientations [Zhang 99]
• Compute plane homographies
• Solve for K-TK-1 from Hk’s
– 1plane if only f unknown
– 2 planes if (f,uc,vc) unknown
– 3+ planes for full K
• Code available from Zhang and OpenCV
CSE 576, Spring 2008
65. These are the estimated extrinsic parameters (camera poses)
66. 360 degree field of view…
Basic approach
• Take a photo of a parabolic mirror with an orthographic lens (Nayar)
• Or buy one a lens from a variety of omnicam manufacturers…
– See http://www.cis.upenn.edu/~kostas/omni.html
67. Tilt-shift
Titlt-shift images from Olivo Barbieri
and Photoshop imitations
http://www.northlight-images.co.uk/article_pages/tilt_and_shift_ts-e.html
72. 2D Image filtering
• Linear filtering is a weighted sum/difference of pixel values
• Enhance images
• Denoise, smooth, increase contrast, etc.
• Extract information from images
• Texture, edges, distinctive points, etc.
• Detect patterns
• Template matching
112. Stereo matching 76
Rectification
Project each image onto same plane, which is parallel to
the epipole
Resample lines (and shear/stretch) to place lines in
correspondence, and minimize distortion
[Loop and Zhang, CVPR’99]
CSE 576, Spring 2008
115. Stereo matching 79
Your basic stereo algorithm
For each epipolar line
For each pixel in the left image
• compare with every pixel on same epipolar line in right image
• pick pixel with minimum match cost
Improvement: match windows
• This should look familar...
CSE 576, Spring 2008
116. Stereo matching 80
Depth Map Results
Input image Sum Abs Diff
Mean field Graph cuts
CSE 576, Spring 2008
117. Stereo matching 81
Active stereo with structured light
Project “structured” light patterns onto the object
• simplifies the correspondence problem
camera 2
camera 1
projector
camera 1
projector
Li Zhang’s one-shot stereo
CSE 576, Spring 2008
118. Data Acquisition
• Our custom-built 3D
Scanner:
• 200-400 image captured with
hand-held camera
• Geometry scanned with
structured-light
• Images registered to geometry
• Precise & inexpensive
Chen et al., Light Field Maping, SIGGRAPH 2002
120. Finding Paths through the World’s Photos
(Photo Tourism / Photosynth)
84 Snavely et al., Finding Paths through the World's Photos, SIGGRAPH 2008
121. Structure from Motion 85
Pose estimation
Once the internal camera parameters are known,
can compute camera pose
[Tsai87] [Bogart91]
Application: superimpose 3D graphics onto video
How do we initialize (R,t)?
CSE 576, Spring 2008
122. Structure from Motion 86
Structure from motion
Given many points in correspondence across
several images, {(uij,vij)}, simultaneously compute
the 3D location xi and camera (or motion)
parameters (K, Rj, tj)
Two main variants: calibrated, and uncalibrated
(sometimes associated with Euclidean and
projective reconstructions)
CSE 576, Spring 2008
123. Structure from Motion 87
Structure from motion
How many points do we need to match?
• 2 frames:
(R,t): 5 dof + 3n point locations ≤
4n point measurements ⇒
n ≥ 5
• k frames:
6(k–1)-1 + 3n ≤ 2kn
• always want to use many more
CSE 576, Spring 2008
124. Structure from Motion 88
Structure [from] Motion
Given a set of feature tracks,
estimate the 3D structure and 3D (camera)
motion.
Assumption: orthographic projection
CSE 576, Spring 2008
[Tomasi & Kanade, IJCV 92]
125. Wei-Chao Chen
Tracking of Robust Features (SURFTrac)
weichao.chen@gmail.com89 * Ta, Chen, Gelfand, Pulli, IEEE CVPR 2009 (oral)
} Scale-space tracking + matching of robust features
} ~10 FPS on a N95 mobile phone (compared to 0.5 FPS)
126. Wei-Chao Chen
Tracking of Robust Features (SURFTrac)
weichao.chen@gmail.com90 * Ta, Chen, Gelfand, Pulli, IEEE CVPR 2009 (oral)
135. Image Stitching 99
Cutout-based de-ghosting
•Select only one image per
output pixel, using spatial
continuity
•Blend across seams using
gradient continuity
(“Poisson blending”)
[Agarwala et al., SG’2004]
Richard Szeliski
136. Image Stitching 100
Cutout-based compositing
Photomontage [Agarwala et al., SG’2004]
• Interactively blend different images:
group portraits
Richard Szeliski
139. Computational Photography 103
Seamless Poisson cloning
Given vector field v (pasted gradient), find the
value of f in unknown region that optimizes:
Pasted gradient Mask
Background
unknown
region
Richard Szeliski
142. Wei-Chao Chen
Interactive Mobile Panorama
weichao.chen@gmail.com106
} Automatic capture based
on camera motion tracking
(2D)
} On-site interactive
evaluation of panorama
result
} High resolution images for
panorama stitching
Mobile Augmented Reality research at NRC Palo Alto
144. Wei-Chao Chen
Interactive
weichao.chen@gmail.com108
Poisson blending to generate light and
color globally for the whole panorama
image
Poisson blending to keep details
and avoid blur
Poisson Blending Linear Blending
Original ImagesPoisson Blending Results
Poisson blending to merge images with
very different lighting and color
Mobile Augmented Reality research at NRC Palo Alto
145. Computational Photography 109
High Dynamic Range Imaging
(HDR)
slides borrowed from
15-463: Computational Photography
Alexei Efros, CMU, Fall 2007,
Paul Debevec, and my talks
Richard Szeliski
147. Computational Photography 111
Problem: Dynamic Range
Typical cameras have limited dynamic range
What can we do?
Solution: merge multiple exposures
Richard Szeliski
151. Computational Photography 115
Tone Mapping
10-6 106
10-6 106
Real World
Ray Traced
World (Radiance)
Display/
Printer
0 to 255
High dynamic range
How can we do this?
Linear scaling?, thresholding? Suggestions?
Richard Szeliski
152. Computational Photography 116
Simple Global Operator
Compression curve needs to
• Bring everything within range
• Leave dark areas alone
In other words
• Asymptote at 255
• Derivative of 1 at 0
Richard Szeliski
163. Computational Photography 127
Interactive Local Adjustment
of Tonal Values
Dani Lischinski
Zeev Farbman
The Hebrew University
Matt Uyttendaele
Rick Szeliski
Microsoft Research
SIGGRAPH 2006
Richard Szeliski
170. Computational Photography 134
Approximate constraints with a function whose
smoothness is determined by underlying image:
Our smoothness term:
Constraint Propagation
data term
smoothness
term
Richard Szeliski
174. ACKNOWLEDGEMENTS
•Many slides and materials borrowed from Jia-Bin Huang, Silvio
Savarese, Steve Seitz, Rich Szeliski,Andrew Zisserman, Larry Zitnick. I
try to give credits whenever possible but may have a few omissions.
•Rights of illustration, pictures and other relevant materials belong to
their original creators or authors.
209. 2005 HOG (histograms of oriented gradients)
NMS
Linear SVM
classifier
For every candidate bounding box
Compute HOG
features
Non-maximal
suppression
212. 2008 DPM (Deformable parts model)
Object Detection with Discriminatively Trained Part Based Model,
Felzenszwalb, Girshick, McAllester and Ramanan, PAMI, 2010
213. 2008 DPM (Deformable parts model)
Object Detection with Discriminatively Trained Part Based Model,
Felzenszwalb, Girshick, McAllester and Ramanan, PAMI, 2010
216. Why it worked
• Multiple components
• Deformable parts?
• Hard negative mining
• Good balance
"How important are 'Deformable Parts' in the Deformable Parts Model?“, Divvala,
Efros, and Hebert, Parts and Attributes Workshop, ECCV, 2012
261. Neural networks are easily
fooled
Nguyen A,Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images. In CVPR '15, IEEE, 2015
269. SEARCH BY IMAGE EXAMPLES
•Still very much an open problem
•Personally, I won’t jump in and start a company yet
•Ignore me and dream big, if you wish
•Most commercial applications use a mixture of algorithms
•Similarity, face recognition, instance recognition, OCR/text
270. See the list of “things” it can search
to get an idea about how it is done.
271. Global scale of face recognition
Super hard!
Better luck in social networks
273. Google Image search result (2016)
My guess: from image to meta data
(text), then reissue text-based
search.
274. INSTANCE RECOGNITION
•Works, but the biggest problem is
speed.
•A common approach is to use
bag-of-words approach used in
text search
•Replace ‘words’ with ‘image
features’
•Put them in a bag (search
structure)
https://gilscvblog.com/2013/08/23/bag-of-words-models-for-visual-categorization/
275. SEARCH STRUCTURE
•Words are one-dimensional
•Use binary tree
• Features are high-dimensional
•k-d tree slow at higher
dimensions (we have 256!)
source: wikipedia.org
276. POSSIBLE SOLUTIONS
•Find approximate words
•e.g. approximate nearest
neighbour (ANN)
•Find lower dimensional space to
split the data
•e.g. scalable vocabulary tree
https://www.cs.umd.edu/~mount/ANN/
306. ACKNOWLEDGEMENTS
•Many slides and materials borrowed from Jia-Bin Huang, David Nister,
Steve Seitz, Rich Szeliski,Andrew Zisserman, Larry Zitnick. I try to give
credits whenever possible but may have a few omissions.
•Rights of illustration, pictures and other relevant materials belong to
their original creators or authors.
307. http://www.skywatch24.com
PART 3: GPU AND
COMPUTATION
| Wei-Chao Chen
Co-Founder, Skywatch Inc. | Adjunct Faculty, NationalTaiwan University
weichao.chen@skywatch24.com
309. PARALLEL PROCESSING & GPU
Many slides adapted from Wei-Chao Chen, for GPU
Programming course at NTU
310. Wei-Chao Chen (weichao.chen@gmail.com)
Parallel Computing Goals
4
} To solve your problem in less time
} Divide one big problem into smaller pieces
} Solve smaller problems concurrently
} Allows us to solve a bigger problem
} In order to parallelize a problem
} Identify dependencies in the problem
} Identify critical paths in the algorithm
} Modify dependencies to shorten the critical paths
314. Wei-Chao Chen (weichao.chen@gmail.com)
Instruction-Level Parallelism
8
} Multiple instructions in a serial program get executed
simultaneously
} Superscalar, etc
A=A+1
B=B+1
C=C+A
C=C+B
…
T=1
(failed to issue because of dependency)
319. Wei-Chao Chen (weichao.chen@gmail.com)
Amdahl’s Law
13
} Named after computer architect Gene Amdahl
} Speedup of a parallel computer is limited by the amount of
serial work
Parallelizable workSerial
Serial 2x Processors
Serial 4x Processors
Serial Many Processors
339. Wei-Chao Chen (weichao.chen@gmail.com)
GPUs Today
33
} GPUs are becoming more programmable
} Unified & programmable shaders
} GPUs now support 32/64 bit floating point numbers
} Almost IEEE FP compliant except for some specials
} GPUs have higher memory bandwidth than CPUs
} Multiple memory banks driven by the need of high-performance
graphics
How to make it easier to
program on?
340. Wei-Chao Chen (weichao.chen@gmail.com)
NVIDIA CUDA
34
} “Compute Unified Device Architecture”
} Or simply,“COMPUTE”
} “A General-Purpose Parallel Computing Environment”
} minimal set of C language extensions to harness GPU’s
computational resources
} CUDA toolset includes compiler, SDK and profiler, etc
} “nvcc hello_cuda.cu” v.s.“gcc hello_world.c”
344. CUDA Workflow
} Get a CUDA-enabled GPU
} Write C/C++ like code (*.cu)
} Compile with CUDA compiler (nvcc)
} Generates PTX code (“Parallel Thread Execution”)
} Applications auto-magically run on GPUs
} Many many parallel threads
} CUDA driver translate PTX code into HW
345. CUDA Overview
} CUDA C/C++ language extensions
} Small sets of extensions for writing kernels - sub-routines
that runs multi-threaded on GPUs
} CUDA Programming model abstraction
} For fine-grained data / thread parallelism, including
} Thread group hierarchy
} Shared memories
} Synchronization barriers
346. C/C++ Language Extensions
CPU
(host)
GPU
(device)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
347. C/C++ Language Extensions
CPU
(host)
GPU
(device)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
348. C/C++ Language Extensions
CPU
(host)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
GPU
(device)
349. C/C++ Language Extensions
CPU
(host)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
GPU
(device)
350. C/C++ Language Extensions
GPU
(device)
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
CPU
(host)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
351. C/C++ Language Extensions
GPU
(device)
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
CPU
(host)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
352. C/C++ Language Extensions
CPU
(host)
int main() {
…
cudaMalloc(…)
cudaMemcpy(…)
…
my_kernel<<<nblock, blocksize>>>(…)
…
cudaMemcpy(…)
…
}
__global__ void my_kernel(…) {
…
__shared__ float …
… blockIdx…
… threadIdx…
int i = gpu_func(…)
…
}
__device__ int gpu_func(…) {
…
}
GPU
(device)
353. CUDA Overview
} CUDA C/C++ language extensions
} Small sets of extensions for writing kernels - sub-routines that
runs multi-threaded on GPUs
} CUDA Programming model abstraction
} For fine-grained data / thread parallelism, including
} Thread group hierarchy
} Shared memories
} Synchronization barriers
354. Wei-Chao Chen (weichao.chen@gmail.com)
CUDA Programming Model Abstraction
48
} Serial code runs on the host (CPU)
} Parallel code runs on the device (GPU)
host code
device code
kernel<<nBlocks, nThreads>>(…)
host code
…
360. Wei-Chao Chen (weichao.chen@gmail.com)54
Source: NVIDIA
Threads get scheduled
round-robin based on the
number of processors
available in the device
This also means you need
sufficient # of blocks, at
least as many as the # of
SMs, to fill the pipe
366. CUDA Programming Model Abstraction
} Shared memory
Shared
Memory 1
Thread
Block 1
Shared
Memory 2
Thread
Block 2
367. CUDA Programming Model Abstraction
61
} Synchronization Barrier
} SIMT threads launched in a unit
called a warp
warp #2warp #1
Writing Writing
368. CUDA Programming Model Abstraction
62
} Synchronization Barrier
} SIMT threads launched in a unit
called a warp
} Problems occur when one warp
reads from another before it’s
finished
warp #2warp #1
Finished Still writing
369. CUDA Programming Model Abstraction
63
} Synchronization Barrier
} SIMT threads launched in a unit
called a warp
} Problems occur when one warp
reads from another before it’s
finished
warp #2warp #1
Still writing
370. CUDA Programming Model Abstraction
64
} Synchronization Barrier
} SIMT threads launched in a unit
called a warp
} Problems occur when one warp
reads from another before it’s
finished
} __syncthreads() prevents the
read-after-write hazard
} BTW, a warp doesn’t branch
} data-dependent conditional branch
only
warp #2warp #1
Finished Still writing
(waiting)
371. Example: Matrix Multiplication
} C = A x B
} Naïve implementation:
} Read a row of A
} Read a column of B
} Dot product
} Slow!
} Lots of global
memory reads
Source: NVIDIA CUDA Programming Guide
377. Wei-Chao Chen (weichao.chen@gmail.com)
Common Program Pattern
71
1. Load data from device to shared memory
2. Synchronize with all other threads in the same block
3. Process data in the shared memory
4. Synchronize again if necessary
5. Write results back to the device memory
381. Simplified Graphics Pipeline (1990s)
75
Input Processor
Do Geometry Stuff
Do Pixel Stuff
Accumulate Pixel Result
…
…
Sorting Stage
(Z-buffer)
Transparency
382. Do Geometry Stuff
Making It Faster (Mid 1990s)
76
Input Processor
Do Geometry Stuff
Do Pixel Stuff
Accumulate Pixel Result
Do Geometry Stuff
Do Pixel Stuff
Do Pixel Stuff
…
…
383. Do Geometry Stuff
Add Framebuffer Access (Late 1990s)
77
Input Processor
Do Geometry Stuff
Do Pixel Stuff
Accumulate Pixel Result
Do Geometry Stuff
Do Pixel Stuff
Do Pixel Stuff
Memory
Memory
Memory
Memory
FBI
MUX
384. Do Geometry Stuff
Add Programmability (Early 2000s)
78
Front End
Do Geometry Stuff
Do Pixel Stuff
Raster Operations
Geometry Shader ALU
Do Pixel Stuff
Pixel Shader ALU
Memory
Memory
Memory
Memory
FBI
MUX
385. Do Geometry Stuff
Add Programmability (Early 2000s)
79
Front End
Do Geometry Stuff
Do Pixel Stuff
Raster Operations
Geometry Shader ALU
Do Pixel Stuff
Pixel Shader ALU
Memory
Memory
Memory
Memory
FBI
MUX
These two stages
look similar:
1. Get internal data
2. Get external data
3. Process data
4. Output data
386. Unified Shader (Mid 2000s)
80
Front End
Raster Operations
THE Shader
Memory
Memory
Memory
Memory
FBI
MUX
THE Shader
THE SHADER
Buffer
& MUX
Resource booking
is important here
- deadlock
- throughput
387. Scaling Up Again
81
Front End
Raster Operations
THE Shader
Memory
Memory
Memory
Memory
FBI
MUX
THE Shader
THE SHADER
Buffer
& glue
389. 83
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron.
Scalable parallel programming with cuda. Queue, 6(2):40{53, 2008.
390. 84
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron.
Scalable parallel programming with cuda. Queue, 6(2):40{53, 2008.
Processors
391. 85
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron.
Scalable parallel programming with cuda. Queue, 6(2):40{53, 2008.
Sorting &
Distribution
392. 86
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron.
Scalable parallel programming with cuda. Queue, 6(2):40{53, 2008.
Raster Operations
& Memory I/O
394. Wei-Chao Chen (weichao.chen@gmail.com)
Is GPU SIMD?
88
} Suppose we have this program:
c[i] = a[i] + b[i]
if (c[i]>pi)
path-A
else
path-B
…
} What happens at the if... statement?
395. Wei-Chao Chen (weichao.chen@gmail.com)
Divergence
89
} Some say yay, some say nay
c[i] = a[i] + b[i]
if (c[i]>pi)
path-A
path-A
path-A
else
path-B
path-B
…
yay nay
c[i] = a[i] + b[i]
if (c[i]>pi)
path-A
path-A
path-A
else
path-B
path-B
…
399. Wei-Chao Chen (weichao.chen@gmail.com)
Multiple Programs – GPUs?
93
} SIMD-like processing with insufficient resources + multiple
programs
} Many people assume this is how GPUs work. But this isn’t the
case.
time
401. Wei-Chao Chen (weichao.chen@gmail.com)
SIMT (T for Threads)
95
} First appeared in NVIDIA Tesla GPU Architecture, 2006
} Jump to different program upon stall, bubble, etc
} BTW, this is not exactly what really happened inside of
GPUs either, but close enough.
402. Summary: GPU Programming
96
} Easy to start coding
} Relatively simple programming model
} Extension of C++ language
} Difficult to master
} Memory access tend to be the bottleneck
} Optimisation still an art form
} Portability across GPUs still hard
} Use libraries / frameworks whenever possible.
404. ASK STUDENTSTO DO IT (!)
For a less corrupt lifestyle, go online and shop around
https://developer.nvidia.com/gpu-accelerated-libraries
405. MATLAB
•Supports GPUs through Parallel Computing
Toolbox
•Supports multiple processors and distributed
servers as well
•Use parfor for parallel for loops
•Use gpuArray to create an array on GPU
•Use built-in functions with gpuArray args
Look for:“Parallel Computing with MATLAB” presentation on mathworks.com to get started
407. CUDA BLAS LIBRARY
•BLAS - Basic Linear Algebra Subprograms
•The library Matlab is built on
•Processor vendors implement their BLAS library
• e.g., Intel MKL (Math Kernel Library)
•cuBLAS - CUDA version, very fast
• No need to write your own, unless you are researching the topic
409. NVIDIATHRUST LIBRARY
•A little like C++ STL Library for CUDA
•Very few lines of code for vector manipulation
•Fast implementation of parallel primitives
• reduce
• scan
• sort
Source: https://developer.nvidia.com/thrust
411. NVIDIA CUDNN
•VERY fast, very hard to beat
•Convolution, array and tensor transformations, etc.
•Supports popular deep learning frameworks
•Caffe,TensorFlow,Torch, CNTK, etc
•Basically, you don’t have to use this directly — just get started with
one of the deep learning frameworks above (e.g., Caffe)
https://developer.nvidia.com/deep-learning-courses
412. DEEP LEARNING
GETTING STARTED ADVISES
•Borrow (steal if you must) a modern GPU
•Use Caffe for your deep learning projects
•http://caffe.berkeleyvision.org/
•Browse through the Caffe Model Zoo and try out the existing (pre-
trained) models (AlexNet, R-CNN and GoogLeNet are free to use)
•http://caffe.berkeleyvision.org/model_zoo.html