This presentation material is my review about SOTA model related paper entitled with "Code Translation with Compiler Representations". It is a paper from Meta AI, and was accepted for an ICLR 2023.
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
고급컴파일러구성론_개레_230303.pptx
1. CODE TRANSLATION WITH COMPILER
REPRESENTATIONS
(Accepted as a conference paper at ICLR 2023)
Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve
(Meta AI)
Presented by: Gebremedhin G. Maru
Kangwon National University
Programming Language & Machine Learning Lab
March 2, 2023
3. 1) Introduction
• Automatic code translation allows to port old
codebases to new frameworks.
• Limitation of existing NMT for PL:
Unreliability
Failure on translating semantics of the input
program accurately.
3
4. Intuition About The proposed
work
• Leverages information from compiler toolchains (LLVM).
• Compilers’ Intermediary Representations(IR).
• IR is language-agnostic pseudocode that describes the
semantics of the program.
• Benefits of IR:
Help to align embeddings for different Programming Languages.
Improves the semantic understanding of the code.
4
6. Contributions of the paper
• IR-augmented translation(using LLVM).
Average improvement of 5.5%.
• Useful in the low data situations.
E.g 29.7% and 25.6% improvements when translating to and from Rust.
• Extending test datasets of 852 functions from TransCoder (Roziere et al.
2020) by adding 343 and 280 functions of Go and Rust, respectively.
• Achievement of 78% accuracy on decompiling LLVM IRs to C++.
6
7. 2) Intermediate Representations In
Compilers.
• Compilers: translate source code to machine-specific executable (machine code).
7
Fig 2: Compiler toolchain with LLVM.
8. Why use an IR?
• Analysis and synthesis requirements in the
translations.
• To create machine independent representations and
optimization.
• Low data resource programming languages can be
benefited from IR augmented code representation.
8
9. 3) Training Objectives.
• Unsupervised NMT.
Learning multilingual sequence embeddings.
Aligning the embeddings and generating an output from these
embeddings.
• Source sentence x = x1 ……..xNso ,
• Corresponding IR z(x) = z(x)
1 ………..z(x)Nir ,
• Target sentence y = y1……………yNta
• We define the machine translation loss(seq2seq loss) function as follows:
9
10. 3.1 Common Objective Functions
• Masked Language Modeling (MLM):
Trains an encoder to predict randomly masked inputs.
Where mask(x) masked version of the code sentence x, and enc(t) the encoder
output
• Denoising Auto Encoding (AE):
Retrieve an original sequence from a corrupted version.
Where noise(x) denotes the corrupted version of x.
• Back-Translation (BT):
Generate a noisy translation of input sentence, and then recover the original input from translation.
10
11. 3.2 IR For Code
Representations
• IR provide additional information to training dataset about code to be
translated using three new objective functions.
• Translation Language Modeling (TLM):-
Generates common representations for parallel sentences in different
languages.
• Translation Auto-Encoding (TAE):-
Transposes the TLM objective into a denoising auto-encoder.
• IR Generation (MT):-
Trains the model to translate the source code into the corresponding IR.
11
12. Figure 3: IR for code representation objectives.
12
13. 3.3 Additional Losses: IR
Decompilation and Pivot
• IR used for 2 alternatives in this study:
I. IR decompilation.
Predict Source code from IR, it reverses compilers' tasks.
II. IR pivot translation:
Decompiling the uniform IR format of languages to one of the
target language.
Uses neural decompiler.
13
14. 4) Data.
4.1 Training Data.
• Google BigQuery
Indexes over 2.8 million open-source repositories from GitHub.
Extracted all individual C++, Java, Rust and Go functions.
• CodeNet dataset
Repository of 14 million competitive programming solutions in 55
languages..
Used for IR decompilation.
14
16. 4.3 Evaluation
• The computational accuracy test suite used in Transcoder (Roziere et al.,
2020) is utilized and enhanced.
852 parallel functions of C++, Java and Python in Roziere et al.(2020).
In this work additional 280 in Rust and 343 functions in Go were created
as test sets.
16
17. 5 Results
5.1 Experimental Details.
• The model has 12 layers (6 in the encoder and 6 in the decoder),
• 8 attention heads, and a dimension of 1024.
• 15% of tokens masked in MLM and TLM objectives.
• 20% of tokens masked in AE and TAE objectives.
• Except MLM other objectives are trained at function level.
17
21. 5.2 IR-Augmented Code Representations For
Translation.
• Best average performance by leveraging IR (Table 2).
• Comparing to baseline TransCoder average improvement of performance
5.5%.
• Translations from and into Rust (less data language) improved by 25.6%.
• Though translations using IR-Augmented objectives (TLM, TAE and MT)
good, IR Pivot method is relatively low performance.
• Generates embeddings that better capture token semantics (refer to
slide no 23).
21
22. Figure 6: Java to Rust translation examples.
Java bitwise complement operator ~ is ! in Rust.
signed int in Java is i32 in Rust.
22
23. Figure 10: Token similarities. Rank and token similarity with u32 for this
model (right) and the baseline model (left).
23
25. 6. DISCUSSION
• Different IR and interpreted languages:
Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available
for Interpreted one too. Front-ends of the language-pairs should use same IR.
• Pivot vs Embedding:
The pivot method learns to translate using only IR-level similarities, it uses
source code only to compute IR.
Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the
model to learn multilingual representations of source code from similarities in the
IR and in the source code itself.
• Using our model at inference time:
TLM, TAE and MT objectives are used only during training for improving
multilingual code representation, but at test time the process is same with
TransCoder.
25
26. PIVOT METHOD Issues: IR Dialects
Solution:
• One decoder per target language.
• Use back-translation to make the model to translate from any IR dialect to any
language.
I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source
language).
II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence).
III. Then train the model to re-generate the C++ sequences from noisy
translations.
26
27. 7. Conclusion
• LLVM IRs to improve code translation.
• IR provides semantically rich compiled language.
• Provide 3 objectives (TLM, TAE and MT) which lead to 5.5%
average translation improvements.
• Seq2seq transformer shown its effectiveness on decompilation.
• The approach can be extended to any pair languages that share common IR.
• In future works, IR can be generated by compiling entire projects to solve the
current limitation in source and target sequences.
27
Compilers consists:-
Front-end: takes source code as input.
Lexes (tokenizes) and parses program then produces AST.
Translates AST to IR.
Middle-end: Performs optimizations on IR (independent from the source language and target machine).
Constant folding.
Death-code analysis and storage reduction.
Back-end: produces machine binary code.
Converts the IR into machine-specific executable code.
To build retargetable compilers:
We can build new back ends for an existing front end (making the source language more portable across machines).
We can build a new front-end for an existing back end (so a new machine can quickly get a set of compilers for different source languages).
We only have to write 2n2n half-compilers instead of n(n−1)n(n−1) full compilers. (Though this might be a bit of an exaggeration in practice!)
To build compilers.
We can build new back ends for an existing front end.
We can build a new front-end for an existing back end.
IR decompilation consists in recovering source code corresponding to a given IR. In practice, it reverses the computations performed by the compiler. IR Pivot is a translation method built upon IR decompilation. Since LLVM can compile many languages (C++, Java, Rust, Go) into the same IR, an obvious approach to code translation consists in decompiling the IR generated from the source language into code in the target language.