My slides from "Inside PHP", a talk about how to change the syntax of the PHP programming language.
Modified PHP 5.4.4 source code (with the "until" keyword added during this presentation) is available here:
http://github.com/thomaslee/oscon2012-inside-php
2. Overview
• About me!
• New Relic’s PHP Agent escapee.
• Now on New Projects, doing unspeakably un-PHP things.
• Wannabe compiler nerd.
• Terminology & brief intro to compilers:
• Grammars, Scanners & Parsers
• General architecture of a bytecode compiler
• Hands on: Modifying the PHP language
• PHP/Zend compiler architecture & summary
• Case study in adding a new keyword
3. “Zend” vs. “Zend Engine” vs. “PHP”
•I will use all of these interchangeably throughout this talk.
• Referring to the bytecode compiler in the “Zend Engine 2” in most cases.
• The distinction doesn’t really matter here.
4. Compilers 101: Scanners
• Or lexical analyzers, or tokenizers T_WHILE
• Input: raw source code
'('
• Output: a stream of tokens
T_VARIABLE("x")
while ($x == $y)
T_IS_EQUAL
T_VARIABLE("y")
')'
5. Compilers 101: Parsers
• Input: a stream of tokens from the scanner T_WHILE
• Output is implementation dependent '('
• Often
an intermediate, in-memory representation of the
program in tree form. T_VARIABLE("x") 0: ZEND_IS_EQUAL ~0 !0 !1
• e.g. Parse Tree or Abstract Syntax Tree 1: ZEND_JMPZ ~0 ->3
2: …
• Or directly generate bytecode. 3: …
T_IS_EQUAL
• Goal of a parser is to structure
T_VARIABLE("y")
the token stream.
• Parsers are frequently generated from a DSL
')'
• Seeparser generators like Yacc/Bison, ANTLR, etc.
or e.g. parser combinators in Haskell, Scala, ML.
6. Compilers 101: Context-free grammars
• Or simply “grammar”
•A grammar describes the complete syntax of a (programming) language.
• Usually expressed in Extended Backus-Naur Form (EBNF)
• Or some variant thereof.
• Variants of EBNF used for a lot of DSL-based parser generators
• e.g. Yacc/Bison, ANTLR, etc.
8. Generalized *PHP* Compiler Architecture
Source files Source code Scanner Token stream
nguage_ scanner.l
Zend /zend_la
Parser
y
languag e_parser.
Ze nd/zend_
Bytecode Abstract
Bytecode Code Generator
Interpreter Syntax Tree
xecute.c compile.c PHP
d_e Ze nd/zend_ compil
Zend/zen es
directly
to
byteco
de!
9. Case Study: The “until” statement
<?php It’s basically
while (!...) ...
$x = 5;
until ($x == 0) {
$x--;
echo “Oh hi, Mark [$x]n”;
}
-- output --
Oh hi, Mark [4]
Oh hi, Mark [3]
Oh hi, Mark [2]
Oh hi, Mark [1]
Oh hi, Mark [0]
10. How to add “until” to the PHP language
1.Tell the scanner how to tokenize new keyword(s)
2.Describe the syntax of the new construct
3.Emit bytecode
11. Before you start...
• You’ll need the usual gcc toolchain, GNU Bison, etc.
• Debian/Ubuntuapt-get install build-essential
• OSX Xcode command line tools should give you most of what you need.
• Also ensure that you have re2c
• Debian/Ubuntu apt-get install re2c
• OSX (Homebrew) brew install re2c
• Used to generate the scanner
• Silently ignored if not found by the configure script!
• And, of course, source code for some recent version of PHP 5.
• I’m working with PHP 5.4.4
12. 1. Tell the scanner how to tokenize “until”
T_UNTIL
• Zend/zend_language_scanner.l
• Inputfor re2c, which will generate the Zend language scanner.
'('
• Describes how raw source code should be converted into tokens.
• Note that no structure is implied here: that’s the parser’s job.
T_VARIABLE("x")
• Tell the scanner that the word “until” is special. until ($x == $y)
T_IS_EQUAL
• The parser also needs to know about new tokens!
• How is this done for the while keyword? T_VARIABLE("y")
')'
13. 2. Describe the syntax of “until”
• Zend/zend_language_parser.y
• Essentially serves as the grammar for the Zend language.
• Also describes actions to perform during parsing.
• Input for the the parser generator (Bison) used to generate the PHP parser.
• Tell PHP how until statements are structured syntactically.
• How was it done for a while statement?
T_UNTIL '(' expr ')' statement
14. 3. Emit bytecode
• Add actions to Zend/zend_language_parser.y
• What should they do?
• Recall that PHP generates bytecode during the parsing process.
• Generate bytecode describing the semantics of
until in terms of the PHP VM.
• Er, wait -- what bytecode do we need to generate? Compiler
Bytecode
15. Intermission: PHP bytecode intro
• opline <opcode> <result?> <op1?> <op2?>
• Data structure representing a single line of PHP VM “assembly”
• Includes opcode + operands ZEND_JMP <op1>
Unconditional jump to the opline # in op1
• opline # associated with each opline
e.g. jump to opline #10
• Different variable types, differentiated by prefix: ZEND_JMP ->10
• Variables
($)
ZEND_JMPZ <op1> <op2>
• Compiled variables (!) Conditional jump to the opline # in op2
• Temporary variables (~) iff op1 is zero
e.g. jump to opline #3 if ~0 is zero
• ZEND_JMP ZEND_JMPZ ~0 ->3
• “goto”
• Conditional variants: ZEND_JMPZ, ZEND_JMPNZ ZEND_IS_EQUAL <result> <op1> <op2>
• opline #s used as address operand for JMP instructions (->) result=1 if op1 == op2, otherwise result=0
e.g. set ~0=1 if !0 == 10
ZEND_IF_EQUAL ~0 !0 10
27. 4. Emit bytecode (cont.)
• Zend/zend_compile.c
• The Zend language’s code generation logic lives here.
• No DSLs here: plain old C source code.
• First, let’s try to understand the bytecode for while
• How do we need to modify it for until?
28. Demo!
• Time to build!
• The usual ./configure && make dance on Linux & OSX.
• Tobe thorough, regenerate data used by the tokenizer extension.
(cd ext/tokenizer && ./tokenizer_data_gen.sh)
• http://php.net/manual/en/book.tokenizer.php
• You’ll need to run make again once you’ve done this.
• With a little luck, magic happens and you get a binary in sapi/cli/php
• Take until out for a spin!
29. And exhale.
• Lots to take in, right?
• In my experience, this stuff is best learned bit-by-bit through practice.
• Ask questions!
• Google
• php-internals
• Or hey, ask me...
30. Thanks!
oscon@tomlee.co @tglee
http://newrelic.com
... and come see Inside Python @ 5pm in D135 :)