SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Introduction to Halide
Champ Yen
champ.yen@gmail.com
https://tinyurl.com/ubqye3y
Overview of Halide
2
Why Halide? 3
Halide's answer: decouples Algorithm from Scheduling
Algorithm: what is computed.
Schedule: where and when it's computed.
Easy for programmers to build pipelines 
• simplifies algorithm code
• improves modularity
Easy for programmers to specify & explore optimizations 
• fusion, tiling, parallelism, vectorization
• can’t break the algorithm
Easy for the compiler to generate fast code
Image Processing Tradeoffs
Experienced Engineers always keep
PARALLELISM, LOCALITY and REDUNDANT
WORK in mind.
Processing Policies/Skills used in image processing coding 5
bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3
bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3
Breadth-First
Sliding-Window
Fusion
Tiling
Sliding-Window
with Tiling
Performance - It's All about Scheduling 6
To optimize is to find a
better scheduling in the
valid space.
Example – C++, Optimized C++ and Halide 7
How Halide works 8
define Algorithms & JIT in
Halide
9
Types – Var, Expr, Func and RDom 10
Func: represents a (schedulable) pipeline stage.
Func gradient;
Var: names to use as variables in the definition of a Func.
Var x, y;
Expr: calculations of those variables, expressions and other functions in a function.
Expr e = x + y;
gradient(x, y) = e;
add a definition for the Func object:
RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation
RDom r(-1, 3) // MIN, EXTENTS
Expr e = sum(f(x+r, y));
Advanced things in Functions 11
bfloat16_t: truncated version 16b version of float32
Func ops;
ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div);
Tuple: represents a Func with mutiple outputs
float16_t: IEEE754 16-bit float representation
Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html
Expr u8val;
u8val = u8(clamp(out, 0, 255));
u8val = saturating_cast<uint8>(out);
// math, other like ceil, floor, pow, sin/cos/tan ...
Expr logval = log(x);
// select, works like “?:” in C or switch-case in complex cases
Expr c = select( c < 0, 0, c);
JIT Image Processing Example 12
// load the input image
Buffer<uint8_t> input = load_image("images/rgb.png");
// function used to brighter the image
Func brighter;
// variables used to define brighter function
Var x, y, c;
// 'value' Expr is used to define the procedure of image processing
Expr value = input(x, y, c);
value = Halide::cast<float>(value);
value = value * 1.5f;
value = Halide::min(value, 255.0f);
value = Halide::cast<uint8_t>(value);
// define the function
brighter(x, y, c) = value;
// get output result
Buffer<uint8_t> output = 
        brighter.realize(input.width(), input.height(), input.channels());
// save the output to a file
save_image(output, "brighter.png");
Put It All Together! - 3x3 Blur - In JIT 13
https://github.com/champyen/halide_2019.git
Scheduling in Halide
14
Scheduling Basics – Default Loop Structure 15
func_foo (a, b, c, … x, y, z) = …
inner-most loop
outermost loop
//default scheduling equal to the below loop:
for(z = 0; z < Z_MAX; z++){
    for(y = 0; y < Y_MAX; y++){
        for(x = 0; x < X_MAX; x++){
            …
                for(a = 0; a < A_MAX; A++){
                    // computing at here
                }
            … 
        }
    }
}
Scheduling Basics - Reodering 16
func_foo.reorder (z, y, x, … c, b, a) = …
inner-most loop
outermost loop
//reordered scheduling equal to the below loop:
for(a = 0; a < A_MAX; a++){
    for(b = 0; b < B_MAX; b++){
        for(c = 0; c < C_MAX; c++){
            …
                for(z = 0; z < Z_MAX; Z++){
                    // computing at here
                }
            … 
        }
    }
}
Scheduling Basics - Splitting 17
func_foo(x, y) = ...
func_foo.split(y, yo, yi, 32);
//splitted scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
    for(yi = 0; yi < 32; yi++){
        for(x = 0; x < X_MAX; x++){
            //computation is here
        }
    }
}
Scheduling Basics - Tiling 18
func_foo(x, y) = ...
func_foo.tile(x, y, xo, xi, yo, yi, 32, 32);
//tiled scheduling equal to the below loop:
for(yo = 0; yo < Y_MAX/32; yo++){
    for(xo = 0; xo < X_MAX/32; xo++){
        for(yi = 0; yi < 32; yi++){
            for(xi = 0; xi < 32; xi++{
                //computation is here
            }
        }
    }
}
Schedule Basics - Fuse 19
func_foo(x, y) = ...
func_foo.fuse(x, y, fidx);
//fused scheduling equal to the below loop:
for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){
    //computation is here
}
serialized by fidx
Scheduling – Vectorize, Parallel 20
func_foo(x, y) = ...
func_foo.vectorize(x, 8);
//vectorized scheduling equal to the below loop:
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x+=8){
        //8-LANE auto-vectorization
    }
}
func_foo(x, y) = ...
func_foo.parallel(y);
//parallel scheduling equal to the below loop:
#pragma omp paralle for
for(y = 0; y < Y_MAX; y++){
    for(x = 0; x < X_MAX; x++){
        //computation is here
    }
}
Vectorize
Parallel
compute_at/store_at, compute_root/store_root
● store position should be same or outer than computation
● store_root => indicate the stage/function has whole frame buffer output
● compute_root => bread-first
○ and also mean store_root
● store_at(Func, Var)
○ the Func’s storage is declared in Var’s loop of Func
● compute_at( Func, Var )
○ computed in Var’s loop of Func
○ also mean store_at(Func, Var)
● Var::outermost()
21
The Schedule Directives Combinations 22
Ahead-of-Time(AOT) Workflow 23
CodeGen
Executable
Halide
Code
Static
Library
(.a + .h)
Function
Implement
Code
Halide
Shared
Library
(.so)
Final
Executable
/Library
Halide
Runtime
Buffer
(.h)
AOT code structure & example 24
//box_aot.cpp: Box_2x2 DownSample
class BoxDown2 : public Generator<BoxDown2> {
public:
    // Input/Output types are not specified, they are set in code-generation phase.
    Input<Buffer<>> input{"input", 3};
    Output<Func> output{"output", 3};
    void generate() {
        Func clamp_input = BoundaryConditions::repeat_edge(input);
        output(x, y, c) = cast(output.type(), 
                            ((clamp_input(2*x, 2*y, c)+
                            clamp_input(2*x+1, 2*y, c)+ 
                            clamp_input(2*x, 2*y+1, c)+ 
                            clamp_input(2*x+1, 2*y+1), c) >> 2) );
    }
    void schedule() {
        output.vectorize(x, 16).parallel(y);
    }
private:
    Var x, y, c;
};
HALIDE_REGISTER_GENERATOR(BoxDown2, box_down2);
$ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo
-lpthread -ldl; 
//change targe to "arm-64-android" for Android usage
$ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host
AOT code usage 25
//test.cpp
…
#include "halide_image_io.h"
#include "HalideBuffer.h"
#include "box_down2.h"
…
using namespace Halide::Tools;
using Halide::Runtime::Buffer;
int main(int argc, char** argv)
{
    Buffer<uint8_t> input = load_image(argv[1]);
    Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());
    box_down2(input, output);
    save_image(output, "output.png");
}
$ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I
aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz
$ ./test input.jpg
More about Runtime Buffer Manipulation
Buffer<uint8_t> buf(width, height); //2D buffer
// get buffer pointer
unsigned char* buf_ptr = (unsigned char*)(buf.data());
// get ROI buffer object
Buffer<uint8_t> crop_buf= buf.cropped(0, crop_x, crop_w).cropped(1, crop_y,
crop_h);
…
// use external memory (from other place, eg: OpenCV mat) for Buffer creation
uint8_t *data = (uint8*)malloc(width*height*channels);
Buffer<uint8_t> external_buf(data, channels, width, height);
26
Put It All Together! - Matrix Multiplication
• https://github.com/champyen/halide_2019
• halide_mm
• Generator
• mm_generator.cpp
• Application
• mm.cpp
27
Resource
• Halide Official Tutorial
• http://halide-lang.org/tutorials/tutorial_introduction.html
• Halide Site
• http://halide-lang.org/
• Halide GitHub
• https://github.com/halide/Halide
• https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf
• Qualcomm Halide Software (in Hexagon SDK)
• https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools
28
Q & A
29

Mais conteúdo relacionado

Mais procurados

JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
Charles Nutter
 
php and sapi and zendengine2 and...
php and sapi and zendengine2 and...php and sapi and zendengine2 and...
php and sapi and zendengine2 and...
do_aki
 
「Frama-Cによるソースコード検証」 (mzp)
「Frama-Cによるソースコード検証」 (mzp)「Frama-Cによるソースコード検証」 (mzp)
「Frama-Cによるソースコード検証」 (mzp)
Hiroki Mizuno
 

Mais procurados (20)

RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Part II: LLVM Intermediate Representation
Part II: LLVM Intermediate RepresentationPart II: LLVM Intermediate Representation
Part II: LLVM Intermediate Representation
 
실무자가 말하는 모의해킹
실무자가 말하는 모의해킹실무자가 말하는 모의해킹
실무자가 말하는 모의해킹
 
DWARF Data Representation
DWARF Data RepresentationDWARF Data Representation
DWARF Data Representation
 
Project ACRN Device Passthrough Introduction
Project ACRN Device Passthrough IntroductionProject ACRN Device Passthrough Introduction
Project ACRN Device Passthrough Introduction
 
JavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for DummiesJavaOne 2012 - JVM JIT for Dummies
JavaOne 2012 - JVM JIT for Dummies
 
php and sapi and zendengine2 and...
php and sapi and zendengine2 and...php and sapi and zendengine2 and...
php and sapi and zendengine2 and...
 
Tegra 186のu-boot & Linux
Tegra 186のu-boot & LinuxTegra 186のu-boot & Linux
Tegra 186のu-boot & Linux
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
HDR Theory and practicce (JP)
HDR Theory and practicce (JP)HDR Theory and practicce (JP)
HDR Theory and practicce (JP)
 
Sw技術者に送るfpga入門
Sw技術者に送るfpga入門Sw技術者に送るfpga入門
Sw技術者に送るfpga入門
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple Languages
 
Usando numba onde python é lento
Usando numba onde python é lentoUsando numba onde python é lento
Usando numba onde python é lento
 
Hable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr LightingHable John Uncharted2 Hdr Lighting
Hable John Uncharted2 Hdr Lighting
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
Python testing using mock and pytest
Python testing using mock and pytestPython testing using mock and pytest
Python testing using mock and pytest
 
「Frama-Cによるソースコード検証」 (mzp)
「Frama-Cによるソースコード検証」 (mzp)「Frama-Cによるソースコード検証」 (mzp)
「Frama-Cによるソースコード検証」 (mzp)
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 

Semelhante a Halide tutorial 2019

C++ lectures all chapters in one slide.pptx
C++ lectures all chapters in one slide.pptxC++ lectures all chapters in one slide.pptx
C++ lectures all chapters in one slide.pptx
ssuser3cbb4c
 

Semelhante a Halide tutorial 2019 (20)

How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Circles graphic
Circles graphicCircles graphic
Circles graphic
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Machine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting ConcernsMachine-level Composition of Modularized Crosscutting Concerns
Machine-level Composition of Modularized Crosscutting Concerns
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstraction
 
Introducción a Elixir
Introducción a ElixirIntroducción a Elixir
Introducción a Elixir
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
 
Introduction to Coding
Introduction to CodingIntroduction to Coding
Introduction to Coding
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach Covering a function using a Dynamic Symbolic Execution approach
Covering a function using a Dynamic Symbolic Execution approach
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020JCConf 2020 - New Java Features Released in 2020
JCConf 2020 - New Java Features Released in 2020
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Write Python for Speed
Write Python for SpeedWrite Python for Speed
Write Python for Speed
 
Coscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usageCoscup2021 - useful abstractions at rust and it's practical usage
Coscup2021 - useful abstractions at rust and it's practical usage
 
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)
 
COSCUP: Introduction to Julia
COSCUP: Introduction to JuliaCOSCUP: Introduction to Julia
COSCUP: Introduction to Julia
 
C++ lectures all chapters in one slide.pptx
C++ lectures all chapters in one slide.pptxC++ lectures all chapters in one slide.pptx
C++ lectures all chapters in one slide.pptx
 

Mais de Champ Yen (8)

Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack Linux SD/MMC Driver Stack
Linux SD/MMC Driver Stack
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
Video Compression Standards - History & Introduction
Video Compression Standards - History & IntroductionVideo Compression Standards - History & Introduction
Video Compression Standards - History & Introduction
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
OpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming IntroductionOpenGL ES 2.x Programming Introduction
OpenGL ES 2.x Programming Introduction
 
Chrome OS Observation
Chrome OS ObservationChrome OS Observation
Chrome OS Observation
 
Play With Android
Play With AndroidPlay With Android
Play With Android
 
Linux Porting
Linux PortingLinux Porting
Linux Porting
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 

Último (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 

Halide tutorial 2019

  • 1. Introduction to Halide Champ Yen champ.yen@gmail.com https://tinyurl.com/ubqye3y
  • 3. Why Halide? 3 Halide's answer: decouples Algorithm from Scheduling Algorithm: what is computed. Schedule: where and when it's computed. Easy for programmers to build pipelines  • simplifies algorithm code • improves modularity Easy for programmers to specify & explore optimizations  • fusion, tiling, parallelism, vectorization • can’t break the algorithm Easy for the compiler to generate fast code
  • 4. Image Processing Tradeoffs Experienced Engineers always keep PARALLELISM, LOCALITY and REDUNDANT WORK in mind.
  • 5. Processing Policies/Skills used in image processing coding 5 bh(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y)/3 bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1)/3 Breadth-First Sliding-Window Fusion Tiling Sliding-Window with Tiling
  • 6. Performance - It's All about Scheduling 6 To optimize is to find a better scheduling in the valid space.
  • 7. Example – C++, Optimized C++ and Halide 7
  • 9. define Algorithms & JIT in Halide 9
  • 10. Types – Var, Expr, Func and RDom 10 Func: represents a (schedulable) pipeline stage. Func gradient; Var: names to use as variables in the definition of a Func. Var x, y; Expr: calculations of those variables, expressions and other functions in a function. Expr e = x + y; gradient(x, y) = e; add a definition for the Func object: RDom: reduction domain, calculate a value from a area of inputs, as loops for calculation RDom r(-1, 3) // MIN, EXTENTS Expr e = sum(f(x+r, y));
  • 11. Advanced things in Functions 11 bfloat16_t: truncated version 16b version of float32 Func ops; ops(x, y) = Tuple( expr_add, expr_sub, expr_mul, expr_div); Tuple: represents a Func with mutiple outputs float16_t: IEEE754 16-bit float representation Halide::* : special math or operations, refer to https://halide-lang.org/docs/namespace_halide.html Expr u8val; u8val = u8(clamp(out, 0, 255)); u8val = saturating_cast<uint8>(out); // math, other like ceil, floor, pow, sin/cos/tan ... Expr logval = log(x); // select, works like “?:” in C or switch-case in complex cases Expr c = select( c < 0, 0, c);
  • 12. JIT Image Processing Example 12 // load the input image Buffer<uint8_t> input = load_image("images/rgb.png"); // function used to brighter the image Func brighter; // variables used to define brighter function Var x, y, c; // 'value' Expr is used to define the procedure of image processing Expr value = input(x, y, c); value = Halide::cast<float>(value); value = value * 1.5f; value = Halide::min(value, 255.0f); value = Halide::cast<uint8_t>(value); // define the function brighter(x, y, c) = value; // get output result Buffer<uint8_t> output =          brighter.realize(input.width(), input.height(), input.channels()); // save the output to a file save_image(output, "brighter.png");
  • 13. Put It All Together! - 3x3 Blur - In JIT 13 https://github.com/champyen/halide_2019.git
  • 15. Scheduling Basics – Default Loop Structure 15 func_foo (a, b, c, … x, y, z) = … inner-most loop outermost loop //default scheduling equal to the below loop: for(z = 0; z < Z_MAX; z++){     for(y = 0; y < Y_MAX; y++){         for(x = 0; x < X_MAX; x++){             …                 for(a = 0; a < A_MAX; A++){                     // computing at here                 }             …          }     } }
  • 16. Scheduling Basics - Reodering 16 func_foo.reorder (z, y, x, … c, b, a) = … inner-most loop outermost loop //reordered scheduling equal to the below loop: for(a = 0; a < A_MAX; a++){     for(b = 0; b < B_MAX; b++){         for(c = 0; c < C_MAX; c++){             …                 for(z = 0; z < Z_MAX; Z++){                     // computing at here                 }             …          }     } }
  • 17. Scheduling Basics - Splitting 17 func_foo(x, y) = ... func_foo.split(y, yo, yi, 32); //splitted scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(yi = 0; yi < 32; yi++){         for(x = 0; x < X_MAX; x++){             //computation is here         }     } }
  • 18. Scheduling Basics - Tiling 18 func_foo(x, y) = ... func_foo.tile(x, y, xo, xi, yo, yi, 32, 32); //tiled scheduling equal to the below loop: for(yo = 0; yo < Y_MAX/32; yo++){     for(xo = 0; xo < X_MAX/32; xo++){         for(yi = 0; yi < 32; yi++){             for(xi = 0; xi < 32; xi++{                 //computation is here             }         }     } }
  • 19. Schedule Basics - Fuse 19 func_foo(x, y) = ... func_foo.fuse(x, y, fidx); //fused scheduling equal to the below loop: for(fidx = 0; fidx < X_MAX*Y_MAX; fidx++){     //computation is here } serialized by fidx
  • 20. Scheduling – Vectorize, Parallel 20 func_foo(x, y) = ... func_foo.vectorize(x, 8); //vectorized scheduling equal to the below loop: for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x+=8){         //8-LANE auto-vectorization     } } func_foo(x, y) = ... func_foo.parallel(y); //parallel scheduling equal to the below loop: #pragma omp paralle for for(y = 0; y < Y_MAX; y++){     for(x = 0; x < X_MAX; x++){         //computation is here     } } Vectorize Parallel
  • 21. compute_at/store_at, compute_root/store_root ● store position should be same or outer than computation ● store_root => indicate the stage/function has whole frame buffer output ● compute_root => bread-first ○ and also mean store_root ● store_at(Func, Var) ○ the Func’s storage is declared in Var’s loop of Func ● compute_at( Func, Var ) ○ computed in Var’s loop of Func ○ also mean store_at(Func, Var) ● Var::outermost() 21
  • 22. The Schedule Directives Combinations 22
  • 23. Ahead-of-Time(AOT) Workflow 23 CodeGen Executable Halide Code Static Library (.a + .h) Function Implement Code Halide Shared Library (.so) Final Executable /Library Halide Runtime Buffer (.h)
  • 24. AOT code structure & example 24 //box_aot.cpp: Box_2x2 DownSample class BoxDown2 : public Generator<BoxDown2> { public:     // Input/Output types are not specified, they are set in code-generation phase.     Input<Buffer<>> input{"input", 3};     Output<Func> output{"output", 3};     void generate() {         Func clamp_input = BoundaryConditions::repeat_edge(input);         output(x, y, c) = cast(output.type(),                              ((clamp_input(2*x, 2*y, c)+                             clamp_input(2*x+1, 2*y, c)+                              clamp_input(2*x, 2*y+1, c)+                              clamp_input(2*x+1, 2*y+1), c) >> 2) );     }     void schedule() {         output.vectorize(x, 16).parallel(y);     } private:     Var x, y, c; }; HALIDE_REGISTER_GENERATOR(BoxDown2, box_down2); $ clang++ -O3 -fno-rtti -std=c++11 -o box_aot box_aot.cpp $HALIDE_ROOT/tools/GenGen.cpp -I $HALIDE_ROOT/include/ -L $HALIDE_ROOT/bin/ -lHalide -ltinfo -lpthread -ldl;  //change targe to "arm-64-android" for Android usage $ LD_LIBRARY_PATH=$HALIDE_ROOT/bin/ ./box_aot -g box_down2 -o ./aot input.type=uint8 output.type=uint8 target=host
  • 25. AOT code usage 25 //test.cpp … #include "halide_image_io.h" #include "HalideBuffer.h" #include "box_down2.h" … using namespace Halide::Tools; using Halide::Runtime::Buffer; int main(int argc, char** argv) {     Buffer<uint8_t> input = load_image(argv[1]);     Buffer<uint8_t> output(input.width()/2, input.height()/2, input.channels());     box_down2(input, output);     save_image(output, "output.png"); } $ clang++ -fno-rtti -std=c++11 -O3 -o test test.cpp aot/box_down2.a -I aot -I $HALIDE_ROOT/include -I aot/ -lpthread -ldl -ljpeg -ltinfo -lpng –lz $ ./test input.jpg
  • 26. More about Runtime Buffer Manipulation Buffer<uint8_t> buf(width, height); //2D buffer // get buffer pointer unsigned char* buf_ptr = (unsigned char*)(buf.data()); // get ROI buffer object Buffer<uint8_t> crop_buf= buf.cropped(0, crop_x, crop_w).cropped(1, crop_y, crop_h); … // use external memory (from other place, eg: OpenCV mat) for Buffer creation uint8_t *data = (uint8*)malloc(width*height*channels); Buffer<uint8_t> external_buf(data, channels, width, height); 26
  • 27. Put It All Together! - Matrix Multiplication • https://github.com/champyen/halide_2019 • halide_mm • Generator • mm_generator.cpp • Application • mm.cpp 27
  • 28. Resource • Halide Official Tutorial • http://halide-lang.org/tutorials/tutorial_introduction.html • Halide Site • http://halide-lang.org/ • Halide GitHub • https://github.com/halide/Halide • https://suif.stanford.edu/~courses/cs243/lectures/l14-halide.pdf • Qualcomm Halide Software (in Hexagon SDK) • https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools 28