SlideShare uma empresa Scribd logo
1 de 28
CODE TRANSLATION WITH COMPILER
REPRESENTATIONS
(Accepted as a conference paper at ICLR 2023)
Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve
(Meta AI)
Presented by: Gebremedhin G. Maru
Kangwon National University
Programming Language & Machine Learning Lab
March 2, 2023
Presentation Outline
1) Introduction.
2) Intermediate Representations In Compilers
3) Training Objectives.
4) Data.
5) Results.
6) Discussion.
7) Conclusion.
2
1) Introduction
• Automatic code translation allows to port old
codebases to new frameworks.
• Limitation of existing NMT for PL:
Unreliability
 Failure on translating semantics of the input
program accurately.
3
Intuition About The proposed
work
• Leverages information from compiler toolchains (LLVM).
• Compilers’ Intermediary Representations(IR).
• IR is language-agnostic pseudocode that describes the
semantics of the program.
• Benefits of IR:
 Help to align embeddings for different Programming Languages.
 Improves the semantic understanding of the code.
4
Motivational Example
Figure 1: Improvements over TransCoder. 5
Contributions of the paper
• IR-augmented translation(using LLVM).
 Average improvement of 5.5%.
• Useful in the low data situations.
 E.g 29.7% and 25.6% improvements when translating to and from Rust.
• Extending test datasets of 852 functions from TransCoder (Roziere et al.
2020) by adding 343 and 280 functions of Go and Rust, respectively.
• Achievement of 78% accuracy on decompiling LLVM IRs to C++.
6
2) Intermediate Representations In
Compilers.
• Compilers: translate source code to machine-specific executable (machine code).
7
Fig 2: Compiler toolchain with LLVM.
Why use an IR?
• Analysis and synthesis requirements in the
translations.
• To create machine independent representations and
optimization.
• Low data resource programming languages can be
benefited from IR augmented code representation.
8
3) Training Objectives.
• Unsupervised NMT.
 Learning multilingual sequence embeddings.
 Aligning the embeddings and generating an output from these
embeddings.
• Source sentence x = x1 ……..xNso ,
• Corresponding IR z(x) = z(x)
1 ………..z(x)Nir ,
• Target sentence y = y1……………yNta
• We define the machine translation loss(seq2seq loss) function as follows:
9
3.1 Common Objective Functions
• Masked Language Modeling (MLM):
 Trains an encoder to predict randomly masked inputs.
 Where mask(x) masked version of the code sentence x, and enc(t) the encoder
output
• Denoising Auto Encoding (AE):
 Retrieve an original sequence from a corrupted version.
 Where noise(x) denotes the corrupted version of x.
• Back-Translation (BT):
 Generate a noisy translation of input sentence, and then recover the original input from translation.
10
3.2 IR For Code
Representations
• IR provide additional information to training dataset about code to be
translated using three new objective functions.
• Translation Language Modeling (TLM):-
 Generates common representations for parallel sentences in different
languages.
• Translation Auto-Encoding (TAE):-
 Transposes the TLM objective into a denoising auto-encoder.
• IR Generation (MT):-
 Trains the model to translate the source code into the corresponding IR.
11
Figure 3: IR for code representation objectives.
12
3.3 Additional Losses: IR
Decompilation and Pivot
• IR used for 2 alternatives in this study:
I. IR decompilation.
 Predict Source code from IR, it reverses compilers' tasks.
II. IR pivot translation:
 Decompiling the uniform IR format of languages to one of the
target language.
 Uses neural decompiler.
13
4) Data.
4.1 Training Data.
• Google BigQuery
 Indexes over 2.8 million open-source repositories from GitHub.
 Extracted all individual C++, Java, Rust and Go functions.
• CodeNet dataset
 Repository of 14 million competitive programming solutions in 55
languages..
 Used for IR decompilation.
14
4.2 Generating Intermediate
Representations
• clang:- LLVM C++ compilation toolchain.
• JLang8:- Java.
• Gollvm9:- Go
• rustc:- Rust.
15
4.3 Evaluation
• The computational accuracy test suite used in Transcoder (Roziere et al.,
2020) is utilized and enhanced.
 852 parallel functions of C++, Java and Python in Roziere et al.(2020).
 In this work additional 280 in Rust and 343 functions in Go were created
as test sets.
16
5 Results
5.1 Experimental Details.
• The model has 12 layers (6 in the encoder and 6 in the decoder),
• 8 attention heads, and a dimension of 1024.
• 15% of tokens masked in MLM and TLM objectives.
• 20% of tokens masked in AE and TAE objectives.
• Except MLM other objectives are trained at function level.
17
Translation Results
.
Table 2: Translation performance (CA@1), for greedy decoding and beam size 5.
18
Cont’d
Table 4: Translation results with different beam sizes.
19
DECOMPILATION Results
Table 5: Performance of LLVM IRs Decompilation: outperforms RedDec on
C++
20
5.2 IR-Augmented Code Representations For
Translation.
• Best average performance by leveraging IR (Table 2).
• Comparing to baseline TransCoder average improvement of performance
5.5%.
• Translations from and into Rust (less data language) improved by 25.6%.
• Though translations using IR-Augmented objectives (TLM, TAE and MT)
good, IR Pivot method is relatively low performance.
• Generates embeddings that better capture token semantics (refer to
slide no 23).
21
Figure 6: Java to Rust translation examples.
Java bitwise complement operator ~ is ! in Rust.
signed int in Java is i32 in Rust.
22
Figure 10: Token similarities. Rank and token similarity with u32 for this
model (right) and the baseline model (left).
23
Table 7: Reduction of Rust error types.
24
6. DISCUSSION
• Different IR and interpreted languages:
 Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available
for Interpreted one too. Front-ends of the language-pairs should use same IR.
• Pivot vs Embedding:
 The pivot method learns to translate using only IR-level similarities, it uses
source code only to compute IR.
 Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the
model to learn multilingual representations of source code from similarities in the
IR and in the source code itself.
• Using our model at inference time:
 TLM, TAE and MT objectives are used only during training for improving
multilingual code representation, but at test time the process is same with
TransCoder.
25
PIVOT METHOD Issues: IR Dialects
Solution:
• One decoder per target language.
• Use back-translation to make the model to translate from any IR dialect to any
language.
I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source
language).
II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence).
III. Then train the model to re-generate the C++ sequences from noisy
translations.
26
7. Conclusion
• LLVM IRs to improve code translation.
• IR provides semantically rich compiled language.
• Provide 3 objectives (TLM, TAE and MT) which lead to 5.5%
average translation improvements.
• Seq2seq transformer shown its effectiveness on decompilation.
• The approach can be extended to any pair languages that share common IR.
• In future works, IR can be generated by compiling entire projects to solve the
current limitation in source and target sequences.
27
Thank You & Questions?

Mais conteúdo relacionado

Semelhante a 고급컴파일러구성론_개레_230303.pptx

.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3aminmesbahi
 
1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptxvenkatapranaykumarGa
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxnuruddinnnaim
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question keyArthyR3
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction Thapar Institute
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfDrIsikoIsaac
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler ConstructionAhmed Raza
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overviewamudha arul
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfDrIsikoIsaac
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compilerAbha Damani
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll buildMark Stoodley
 
Introduction_to_Programming.pptx
Introduction_to_Programming.pptxIntroduction_to_Programming.pptx
Introduction_to_Programming.pptxPmarkNorcio
 

Semelhante a 고급컴파일러구성론_개레_230303.pptx (20)

.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3
 
1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptx
 
Mcs lec2
Mcs lec2Mcs lec2
Mcs lec2
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question key
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction
 
Introduction to programming c
Introduction to programming cIntroduction to programming c
Introduction to programming c
 
Chap01-Intro.ppt
Chap01-Intro.pptChap01-Intro.ppt
Chap01-Intro.ppt
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdf
 
1 cc
1 cc1 cc
1 cc
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overview
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdf
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Introduction_to_Programming.pptx
Introduction_to_Programming.pptxIntroduction_to_Programming.pptx
Introduction_to_Programming.pptx
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Último

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Último (20)

OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

고급컴파일러구성론_개레_230303.pptx

  • 1. CODE TRANSLATION WITH COMPILER REPRESENTATIONS (Accepted as a conference paper at ICLR 2023) Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve (Meta AI) Presented by: Gebremedhin G. Maru Kangwon National University Programming Language & Machine Learning Lab March 2, 2023
  • 2. Presentation Outline 1) Introduction. 2) Intermediate Representations In Compilers 3) Training Objectives. 4) Data. 5) Results. 6) Discussion. 7) Conclusion. 2
  • 3. 1) Introduction • Automatic code translation allows to port old codebases to new frameworks. • Limitation of existing NMT for PL: Unreliability  Failure on translating semantics of the input program accurately. 3
  • 4. Intuition About The proposed work • Leverages information from compiler toolchains (LLVM). • Compilers’ Intermediary Representations(IR). • IR is language-agnostic pseudocode that describes the semantics of the program. • Benefits of IR:  Help to align embeddings for different Programming Languages.  Improves the semantic understanding of the code. 4
  • 5. Motivational Example Figure 1: Improvements over TransCoder. 5
  • 6. Contributions of the paper • IR-augmented translation(using LLVM).  Average improvement of 5.5%. • Useful in the low data situations.  E.g 29.7% and 25.6% improvements when translating to and from Rust. • Extending test datasets of 852 functions from TransCoder (Roziere et al. 2020) by adding 343 and 280 functions of Go and Rust, respectively. • Achievement of 78% accuracy on decompiling LLVM IRs to C++. 6
  • 7. 2) Intermediate Representations In Compilers. • Compilers: translate source code to machine-specific executable (machine code). 7 Fig 2: Compiler toolchain with LLVM.
  • 8. Why use an IR? • Analysis and synthesis requirements in the translations. • To create machine independent representations and optimization. • Low data resource programming languages can be benefited from IR augmented code representation. 8
  • 9. 3) Training Objectives. • Unsupervised NMT.  Learning multilingual sequence embeddings.  Aligning the embeddings and generating an output from these embeddings. • Source sentence x = x1 ……..xNso , • Corresponding IR z(x) = z(x) 1 ………..z(x)Nir , • Target sentence y = y1……………yNta • We define the machine translation loss(seq2seq loss) function as follows: 9
  • 10. 3.1 Common Objective Functions • Masked Language Modeling (MLM):  Trains an encoder to predict randomly masked inputs.  Where mask(x) masked version of the code sentence x, and enc(t) the encoder output • Denoising Auto Encoding (AE):  Retrieve an original sequence from a corrupted version.  Where noise(x) denotes the corrupted version of x. • Back-Translation (BT):  Generate a noisy translation of input sentence, and then recover the original input from translation. 10
  • 11. 3.2 IR For Code Representations • IR provide additional information to training dataset about code to be translated using three new objective functions. • Translation Language Modeling (TLM):-  Generates common representations for parallel sentences in different languages. • Translation Auto-Encoding (TAE):-  Transposes the TLM objective into a denoising auto-encoder. • IR Generation (MT):-  Trains the model to translate the source code into the corresponding IR. 11
  • 12. Figure 3: IR for code representation objectives. 12
  • 13. 3.3 Additional Losses: IR Decompilation and Pivot • IR used for 2 alternatives in this study: I. IR decompilation.  Predict Source code from IR, it reverses compilers' tasks. II. IR pivot translation:  Decompiling the uniform IR format of languages to one of the target language.  Uses neural decompiler. 13
  • 14. 4) Data. 4.1 Training Data. • Google BigQuery  Indexes over 2.8 million open-source repositories from GitHub.  Extracted all individual C++, Java, Rust and Go functions. • CodeNet dataset  Repository of 14 million competitive programming solutions in 55 languages..  Used for IR decompilation. 14
  • 15. 4.2 Generating Intermediate Representations • clang:- LLVM C++ compilation toolchain. • JLang8:- Java. • Gollvm9:- Go • rustc:- Rust. 15
  • 16. 4.3 Evaluation • The computational accuracy test suite used in Transcoder (Roziere et al., 2020) is utilized and enhanced.  852 parallel functions of C++, Java and Python in Roziere et al.(2020).  In this work additional 280 in Rust and 343 functions in Go were created as test sets. 16
  • 17. 5 Results 5.1 Experimental Details. • The model has 12 layers (6 in the encoder and 6 in the decoder), • 8 attention heads, and a dimension of 1024. • 15% of tokens masked in MLM and TLM objectives. • 20% of tokens masked in AE and TAE objectives. • Except MLM other objectives are trained at function level. 17
  • 18. Translation Results . Table 2: Translation performance (CA@1), for greedy decoding and beam size 5. 18
  • 19. Cont’d Table 4: Translation results with different beam sizes. 19
  • 20. DECOMPILATION Results Table 5: Performance of LLVM IRs Decompilation: outperforms RedDec on C++ 20
  • 21. 5.2 IR-Augmented Code Representations For Translation. • Best average performance by leveraging IR (Table 2). • Comparing to baseline TransCoder average improvement of performance 5.5%. • Translations from and into Rust (less data language) improved by 25.6%. • Though translations using IR-Augmented objectives (TLM, TAE and MT) good, IR Pivot method is relatively low performance. • Generates embeddings that better capture token semantics (refer to slide no 23). 21
  • 22. Figure 6: Java to Rust translation examples. Java bitwise complement operator ~ is ! in Rust. signed int in Java is i32 in Rust. 22
  • 23. Figure 10: Token similarities. Rank and token similarity with u32 for this model (right) and the baseline model (left). 23
  • 24. Table 7: Reduction of Rust error types. 24
  • 25. 6. DISCUSSION • Different IR and interpreted languages:  Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available for Interpreted one too. Front-ends of the language-pairs should use same IR. • Pivot vs Embedding:  The pivot method learns to translate using only IR-level similarities, it uses source code only to compute IR.  Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the model to learn multilingual representations of source code from similarities in the IR and in the source code itself. • Using our model at inference time:  TLM, TAE and MT objectives are used only during training for improving multilingual code representation, but at test time the process is same with TransCoder. 25
  • 26. PIVOT METHOD Issues: IR Dialects Solution: • One decoder per target language. • Use back-translation to make the model to translate from any IR dialect to any language. I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source language). II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence). III. Then train the model to re-generate the C++ sequences from noisy translations. 26
  • 27. 7. Conclusion • LLVM IRs to improve code translation. • IR provides semantically rich compiled language. • Provide 3 objectives (TLM, TAE and MT) which lead to 5.5% average translation improvements. • Seq2seq transformer shown its effectiveness on decompilation. • The approach can be extended to any pair languages that share common IR. • In future works, IR can be generated by compiling entire projects to solve the current limitation in source and target sequences. 27
  • 28. Thank You & Questions?

Notas do Editor

  1. Compilers consists:- Front-end: takes source code as input. Lexes (tokenizes) and parses program then produces AST. Translates AST to IR. Middle-end: Performs optimizations on IR (independent from the source language and target machine). Constant folding. Death-code analysis and storage reduction. Back-end: produces machine binary code. Converts the IR into machine-specific executable code.
  2. To build retargetable compilers: We can build new back ends for an existing front end (making the source language more portable across machines). We can build a new front-end for an existing back end (so a new machine can quickly get a set of compilers for different source languages). We only have to write 2n2n half-compilers instead of n(n−1)n(n−1) full compilers. (Though this might be a bit of an exaggeration in practice!) To build compilers. We can build new back ends for an existing front end. We can build a new front-end for an existing back end.
  3. IR decompilation consists in recovering source code corresponding to a given IR. In practice, it reverses the computations performed by the compiler. IR Pivot is a translation method built upon IR decompilation. Since LLVM can compile many languages (C++, Java, Rust, Go) into the same IR, an obvious approach to code translation consists in decompiling the IR generated from the source language into code in the target language.