SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
An Improved Hardware Acceleration
                       Scheme for Java Method Calls
                                      Tero S¨ ntti∗† , Joonas Tyystj¨ rvi∗‡ , and Juha Plosila†∗
                                            a                       a
                                   ∗ Dept.
                                         of Information Technology, University of Turku, Finland
                          †   Academy of Finland, Research Council for Natural Sciences and Engineering
                                           ‡ Turku Centre for Computer Science, Finland


                                                      {teansa|jttyys|juplos}@utu.fi


   Abstract— This paper presents a significantly improved strat-             This work is a part of the VirtuES project, which focuses
egy for accelerating the method calls in the REALJava co-                on fully utilizing the potential of embedded multicore systems
processor. The hardware assisted virtual machine architecture is         using a virtual machine approach.
described shortly to provide context for the method call accelera-
tion. The strategy is implemented in an FPGA prototype. It allows           Overview of the paper We proceed as follows. In Section
measurements of real life performance increase, and validates            2 we shortly describe the structure of our hardware assisted
the whole co-processor concept. The system is intended to be             JVM, and show how the proposed co-processor fits into the
used in embedded environments, with limited CPU performance              Java specifications. Section 3 describes the methods in Java
and memory available to the virtual machine. The co-processor            and sheds light on the differences in the ways methods can be
is designed in a highly modular fashion, especially separating
the communication from the actual core. This modularity of the           invoked. In Section 4 the strategy for accelerator is presented
design makes the co-processor more reusable and allows system            with details of the hardware unit focusing on the differences
level scalability. This work is a part of a project focusing on design   to the previous solution. In Section 5 some benchmark results
of a hardware accelerated multicore Java Virtual Machine for             are given and analyzed. Finally in Section 6 we draw some
embedded systems.                                                        conclusions and describe the future efforts related to the
                                                                         REALJava virtual machine.
                        I. I NTRODUCTION
   Java is very popular and portable, as it is a write-once run-                            II. JAVA V IRTUAL M ACHINE
any-where language. This enables coders to develop portable                 In the Java Virtual Machine Specification [4], Second Edi-
software for any platform. Java code is first compiled into byte-         tion the structure and behavior of all JVMs is specified at
code, which is then run on a Java Virtual Machine (hereafter             a quite abstract level. This specification can be met using
JVM). The JVM acts as an interpreter from bytecode to native             several techniques. The usual solutions are software only,
microcode, or more recently uses just in time compilation (JIT)          including some performance enhancing features, such as JIT
to affect the same result a bit faster at the cost of memory.            (Just In Time Compilation). We have chosen to use a HW/SW
This software only approach is quite inefficient in terms of              combination [7] in order to maximize the hardware usage and
power consumption and execution time. These problems rise                minimize the power consumption.
from the fact that executing one Java instruction requires
several native instructions. Another source for inefficiency is
the memory usage. The software based JVMs have to keep
internal registers of the virtual machine in the main memory
of the host system. When the execution of the bytecode is
performed on a hardware co-processor this is avoided and the
overall amount of memory accesses is reduced. Because the
methods in Java are generally quite small in terms of storage
requirements for the code that is running and the data being
processed it is possible to keep all the required items in a                      Fig. 1.   Internal architecture of the REALJava JVM
relatively small local memory inside the co-processor. Actually
just 128 kb of internal memory is enough to store all of the                The HW portion (shown on the right side of Figure 1)
methods used in an embedded application. This includes the               handles most of the actual Java bytecode execution, whereas
Java benchmark for embedded systems found in [15] and the                the SW portion (the left side of Figure 1) takes care of memory
embedded version of the CaffeineMark [14]. Since this local              management, class loading and native method calling. This
memory is not mirrored to the main memory, which usually                 partitioning gives the possibility to use the co-processor with
resides in a physically external memory chip, it is energy               any type of host CPU(s) and operating systems, as all of the
efficient.                                                                platform dependent properties are implemented in software


978-1-4244-8971-8/10$26.00 c 2010 IEEE
and most of the platform independent bytecode execution is         methods can be found in [11]. An independent study can also
done in hardware.                                                  be found in [1].
   Because Java supports multithreading at the language level,
it makes sense to integrate several co-processors as a SoC.                                Salesman       Sort     Raytrace   Caffeine
                                                                     Stack frame size        8.98         4.77       7.01       5.55
This gives an ideal solution for complex systems running             Method length           38.86        8.67       9.26       14.83
several Java threads and possibly some native code at the same       Total invocations      991228      18412516   1957996    27779867
time. This approach brings forth true multithreading and thus                                         TABLE I
improves performance. Also large systems possibly contain           S TATISTICS FROM     METHOD INVOCATIONS IN SELECTED BENCHMARKS .
several software subsystems, such as internet protocols, user        T HE FIRST TWO ROWS ARE AVERAGES AND THEY ARE MEASURED         IN
interface controllers and so on, which can easily be coded                                      32- BIT WORDS .
in Java, and since they all are executed in parallel the user
experience is enhanced.                                                              III. M ETHOD CALLS IN JAVA
   The system architecture can be chosen to be a network of           The Java virtual machine specification [4] defines the types
any kind or bus based, as suitable for other components in         of methods that can be invoked in Java. Because Java is an
the system. The structure of the underlying communication          object-oriented language, methods are usually invoked on ob-
medium is rather irrelevant, as long as the lower level provides   jects, with the actual method implementation chosen based on
two properties: 1) the datagrams must arrive in their destina-     the runtime type of the object. Methods that are not invoked on
tion in the same order that they were sent, and 2) the datagrams   objects are called static methods. Besides static methods, the
arriving from two different sources to the same destination        most important categories of methods are defined in the access
must be identifiable. The first property can be be achieved          flags bit field of the method definition. The most important
with a lower level network protocol, like ATM adaptation           access flags during bytecode execution are acc synchronized
layer (AAL) for the internet, or by the physical structure of a    and acc native. Acc synchronized means that when the method
bus. The second property seems quite natural, and should be        is invoked, the monitor (the primary synchronization construct
present in all solutions. The communication scheme for the         in Java) of the object that the method is invoked on is
co-processor is discussed in more detain in [5].                   entered, and the monitor is exited on return from the method.
   The architecture for the co-processor is presented in [6] and   Acc native means that the method is implemented in a native
the whole system including hardware and software portions          language of the platform. Native methods can be bound to
can be found in [7] and [10]. The basic design used for the        actual native functions at runtime.
FPGA implementations in this paper is the same, with only             Methods are invoked using one of the four bytecode
minor fine tuning on some of the units. There are 5 control         instructions invokevirtual, invokespecial, invokeinterface and
registers in the execution unit. These are the program counter     invokestatic. All of these instructions perform a method lookup
P C, stack top pointer ST , code offset CO, local variable         based on a 16-bit index to the constant pool of the currently ex-
pointer LV and local variable info LO. The P C holds the           ecuting class. Invokevirtual and invokeinterface then perform
address of the current instruction relative to the CO. The ST      a further lookup based on the runtime type of the object that
and LV registers are internal addresses to the local memory.       the method is being invoked on, while invokestatic and invoke-
The CO contains the starting address of the current method         special invoke the method found immediately. As symbolic
in the method area of the co-processor. The last register holds    method resolution is very slow, it is common to modify the
two values, the number of parameters Nparams and the number        constant pool and the instruction data itself during either class
of local variables Nlocals in for the current method. After        loading or after the execution of a call instruction. A common
applying the new method invocation structure, the LO register      technique for accelerating invokevirtual instructions is the use
is removed from the design.                                        of virtual tables [2], which contain a pointer to each non-
   The Java virtual machine also provides a rich standard          interface method that a class implements, with a fixed index
library. In most current research virtual machines the GNU         for each method identifier. Performing a virtual table lookup
Classpath [16] is used. The GNU Classpath is a free imple-         is much faster than finding the method by symbolic lookup in
mentation of the standard library, and it is constantly being      the class of the object. A somewhat related technique is so-
developed. Currently it covers more than 95% of the methods.       called “inline caches” [3], which enable just-in-time compilers
The missing methods are quite rare, so in most cases the           to quickly inline the most common implementations of virtual
GNU Classpath is sufficient. As per recommendations for Java        methods into their call sites.
programming, the classpath has been build from very small             As Java is an object-oriented language, invokevirtual is
methods, which are invoked often during the execution of a         intended to be the primary method invocation instruction. The
Java program. Also many of the methods in the classpath call       other instructions are used for special cases: invokespecial is
even smaller sub-methods. This emphasizes the importance           used to invoke object constructors, private methods (which can
of having fast method invocation architecture in a virtual         be hidden by subclasses) and to explicitly invoke a certain
machine. The method size statistics for selected benchmarks        implementation of a virtual method, invokeinterface is only
are shown in Table I, and they clearly support the claim of        used to invoke methods through an interface pointer and
small methods being invoked often. More statistics about Java      invokestatic does not operate on an object instance. It is
important to notice that as long as a class has no subclasses,        The symbol X is just a shorthand for Nlocals − Nparams .
invokevirtual can be executed like invokespecial. The same
applies for interfaces with only one implementation.
   If the overload status of virtual functions is stored in the
method definition and updated when new classes are loaded,
three types of method invocation instructions can be executed
without access to heap data or native functions: non-native
invokestatic, invokespecial and invokevirtual with a single
implementation. These instructions can be implemented using
only a constant pool lookup and, in the case of invokevirtual,
a test of the overload status of the method.
   The new architecture presented in this paper makes use of
an observation about Java programs we made recently. We
noticed that the stack of a given Java method is always empty
when a return instruction is executed. This feature is not
mentioned in the Java virtual machine specification [4], but
it seems that not one of the Java compilers we tried generates
code where the stack would not be empty. Assuming the stack
to always be empty makes the return much simpler, but we                     Fig. 2.   The effects of the invocation process on the stack.
were hesitant due to the fact that a class with non empty
stack during return would still be a legal construct. When a
                                                                         Now let us review the mechanism presented in [9]. In
bytecode modification engine [12] was added to the bytecode
                                                                      the following formulas the CallInfo vector comes from the
verification phase, it was noticed that it could be used for
                                                                      invoker module shown in Figures 3 and 4. In the original
emptying the stack if required. The bytecode verification keeps
                                                                      architecture the CallInfo was 56-bits long. The SWCT RL
count of the stack at all points of the bytecode, so adding just
                                                                      symbol is used for control bits that tell both the hardware and
the required amount of pop instructions before an offending
                                                                      the software that some special actions are required during the
return instruction would fix the situation. So far this has never
                                                                      return phase of the method. An example would be a return
been observed, but the check up is kept in the verification
                                                                      to a native method. This situation cannot be handled in the
process for sake of security and in order to be compliant with
                                                                      hardware, since the control is returned to the native method
all legal Java code.
                                                                      executed by the CPU. Please notice that pushing the return
   Returning from a process happens using one of 6 instruc-
                                                                      info to the stack after the new register values have been
tions, return, ireturn, freturn, areturn, lreturn or dreturn. These
                                                                      calculated updates the ST accordingly.
differ only by the data pushed to the stack of the calling
method. The first one pushes nothing, while the next three
                                                                        Formulas for calculating the new register values:
push one word and the last two push two words. Even though
                                                                        PC ⇐ 0
the 32-bit versions have several bytecodes reserved, they
                                                                        ST ⇐ STOLD − CallInf o(15..0) + CallInf o(31..16)
are implemented using only one mechanism. The difference
                                                                        CO ⇐ CallInf o(55..32)
between these instructions is only used during class loading
                                                                        LV ⇐ STOLD − CallInf o(15..0)
for verification purposes. The 64-bit instructions are handled
                                                                        LO ⇐ CallInf o(31..0)
similarly. Since the actual returning process is exactly the
same for all of the instructions, we only consider the return
instruction, and state that the data to be pushed to the calling        Data pushed to the stack frame (Return Info):
method stack is stored into temporary registers during the              SWCT RL & P COLD
return process.                                                         STOLD − CallInf o(15..0)
                                                                        COOLD
                                                                        LVOLD
        IV. I NVOCATION AND RETURN PROCEDURES
                                                                        LOOLD
   First, let us have a look at what happens in the stack of
the virtual machine during a method invocation. Figure 2                Then we can move on to the new method invocation
shows how the new stack frame is created. Before the actual           procedure. Here are the modified invocation formulas. These
invocation, the calling method pushes the required parameters         use the new architecture, so the CallInfo is only 16-bits long,
to the top of its stack. In the Figure these are shown as             and it is presented in Figure 5.
Parameters, and the number of them is denoted with the
symbol Nparams . The symbol Nlocals tells how many local                Formulas for calculating the new register values:
variables the new method uses. Note that the parameters                 PC ⇐ 0
become a part of the local variable array for the new method.           ST ⇐ STOLD + X
CO ⇐ CallInf o(15..0)
  LV ⇐ STOLD − Nparams

  Data Pushed to the stack frame (Return Info):
  SWCT RL & COOLD
  LVOLD & P COLD

   Naturally the procedures for performing a return instruction
are also simplified in the new architecture. The process of
returning from a method can be seen as going to the opposite
direction in Figure 2. The original values for most of the
registers associated with the stack frame on the left side
are returned. Only the ST is modified, to reflect the fact
that the parameters consumed by the invoked method have
been removed from the stack. The old system performed the                 Fig. 3.   The invoker connected to the ALU and the registers.
following actions during return instructions.

  Formulas for calculating the new register values:                 method id and the code offset as the key, as shown in Figure
  P C ⇐ Data0                                                       4. In the old architecture the code offset was 24 bits long. In
  ST ⇐ Data1                                                        the new architecture the software performs a process called
  CO ⇐ Data2                                                        constant pool merging, during which the constants defined by
  LV ⇐ Data3                                                        a given class are added to a global constant pool instead of a
  LO ⇐ Data4                                                        separate pool for each class. This saves memory by merging
                                                                    constants already defined by other classes and also speeds up
   Where the DataN symbols are being retrieved from the             the constant pool look-up, since finding the constant pool for
stack frame using a separate indexing scheme, which offsets         the current class is not required. This technique also reduces
the index by Nlocals and then uses the normal local variable        the size of the CAM key, because the code offset of the
loading mechanism inside the co-processor. Altogether this          current method is no longer needed. Only the method id,
sequence requires 5 data items to be retrieved from the data        now in the new unified constant pool, is required. The new
memory.                                                             structure can be seen in Figure 5. As an additional bonus,
                                                                    the method cache utilization is improved. This happens while
 The modified architecture gets the same results in a                invoking one method, let us call it A, from several different
much simpler fashion, using the following formulas:                 classes. The scenario results in only one cache line, while the
 P C ⇐ Data0(15..0)                                                 old architecture would have required a separate line for each
 ST ⇐ LVOLD                                                         method invoking A.
 CO ⇐ Data1(1..0)
 LV ⇐ Data0(31..16)

   Again the DataN symbols are retrieved from the data
memory, but now they can be retrieved using the normal pop
mechanism. This simplifies the hardware since now the unit
handling local variables is not required to handle the additional
offsetting. The amount of data to be retrieved is also decreased                      Fig. 4.   The original CAM structure.
from 5 to just 2 words. This naturally decreases the amount of
memory required for return information in each stack frame
from 5 to 2 words. Using the pop mechanism is possible since
we are now assuming that the stack of the current method is
empty when performing the return. Notice also, that now the
new value for the ST is not calculated at all, but it is simply
the value of the LV in the current method.
   The invoker will speed up the invocation of methods that
are already loaded to the local memory of the co-processor.                           Fig. 5.   The modified CAM structure.
When an invocation command is encountered in the ALU, it
sends the constant pool index of the method to the invoker             After the key has been found in the CAM, the match address
module and sets query high. At this time the invoker performs       is sent to a normal RAM, which stores the information needed
a look up in the content addressable memory (CAM) using the         to perform the method call. This RAM was 56 bits wide, and
Processor                 REALJava (old)    REALJava (new)    Kaffe on PPC    Units   Gain %
                  Engine speed                   100               100               300        MHz      N/A
                  Simple call                  3125000           6666666            59453        1/s      113.3
                  Instance call                713042            1160730            19460        1/s        62.8
                  Synchronized call            366473            564810             15567        1/s        54.1
                  Final call                   671343            1097950            18090        1/s        63.5
                  Class call                   671255            1248860            18847        1/s        86.0
                  Synchronized class call      260401            350181                          1/s        34.5
                  Salesman                      11438             9027              111824       ms         26.7
                  Sort                          40569             31386             856684       ms         29.3
                  Raytrace                      7205              5494              169646       ms         31.1
                  EmbeddedCaffeineMark           156               231                10                    48.1
                  EmbeddedCaffeineMark ND        184               279                11                    51.6
                                                             TABLE II
                                               R ESULTS FROM VARIOUS BENCHMARKS .

consisted of 24 bits for the code offset of the new method,       be loaded. Overloading of methods causes them to fall out of
16 bits for the number of local variables, Nlocals , and finally   the cache because selecting the implementation for a specific
16 bits for the number of parameters, Nparams , taken by the      call requires access to heap data. The host CPU is better suited
new method. Our improved scheme stores only the code offset       for this kind of task, so it is assigned to there.
for the method to be invoked. The length is now limited to           The module was integrated into our REALJava co-processor
16 bits. In stead of storing the Nlocals and Nparams to the       prototype as 8 places deep. This depth was chosen as the
invoker cache, a different approach is chosen. Namely the         statistics presented in [9] show that size to provide highest
Nparams and X are stored to the instruction memory, just          impact on performance with least resources. The prototype
before the actual Java code for the method. The value of X is     is based on a Xilinx ML410 demonstration board. This board
calculated during class loading to minimize the computation       provides all the services one might expect of a computer, such
required during runtime. This strategy minimizes the size of      as a network controller, a hard drive controller, a PCI bus and
the method cache unit. The increased memory requirement           so on. The FPGA chip is a Virtex4FX, which includes two
for the instruction side is only one word per each method,        hardcore PowerPC CPUs. The co-processor is connected to
and since the stack frames have been reduced by 3 words,          the CPU via the Processor Local Bus (PLB 3.4). The system
the net effect is positive even if each method is invoked         runs the co-processor at 100 MHz, while the PowerPC CPU
only once. Naturally, if there are subsequent invocations for     runs at 300 MHz. The CPU runs Linux 2.4.20 as the operating
a method already loaded to the instruction memory, the net        system providing services (network, filesystem, etc.) to the
saving resulting from the new architecture is 3 words per         virtual machine. For more details on the prototype, please see
invocation. The code offset found from the RAM is also sent       [8] and [10]. The system has also been implemented on a
to the instruction memory controller, which in turn returns       Virtex5 based board. This configuration used the newer PLB
the values for Nparams and X to the ALU for use in the            bus (4.6) as the communication channel and a MicroBlaze
invocation. If a key is found, then get regs signal is set high   as the CPU. The CPU provides considerably less arithmetic
to indicate a valid match. This triggers the ALU to capture       performance, since it is a softcore processor implemented
the CallInfo and the Nparams and the X, and to calculate          using FPGA resources and runs at 100 MHz. The larger FPGA
new register values using the rules presented earlier.            chip allowed us to include eight co-processor cores to the
   In case a match is not found in the CAM, a trap is produced.   system. Unfortunately we have only implemented the new
To indicate this condition to the ALU the do trap signal is       invocation architecture on that platform, so we do not present
set high. Upon receiving this signal the ALU sets the trap        detailed results for this platform in this paper.
signal high to communication module, and finally the host
CPU performs the needed actions to start execution of the                                    V. R ESULTS
new method. At the same time the invoker module saves the            The results in Table II show that the invoker module has
key to the CAM. When the execution resumes after the trap         significant impact on execution times of the benchmarks. In
the invoker module captures the required register values and      the table REALJava (old) stands for a configuration with the
saves them to the RAM. Now the invoker is ready to speed up       original invoker, REALJava (new) stands for a configuration
execution in case the same method is called again. When the       with the improved invoker and Kaffe on PPC is Kaffe Virtual
invoker module saves a new key to the CAM it uses circular        Machine running on the same PowerPC processor. REALJava,
oldest algorithm to choose which entry to replace. This scheme    even though running at lower clock speed, clearly outperforms
provides reasonably close approximation of the least recently     the Kaffe in all of the benchmarks. The Gain is the percentage
used algorithm with very low complexity.                          of improvement achieved with the improved invoker module.
   The invoker module can also clear its contents. This is           The first set of benchmarks is a collection of method call
required for situations where a virtual method has been cached    tests. They measure mostly the method invocation perfor-
to the module, and a new overloading virtual method needs to      mance, and do not include (significant amounts of) arithmetics.
The first one simply calls an empty method and then returns           frames, thus reducing the overall memory requirements for the
without any processing inside the invoked method. The next           co-processor. Also the hardware is simplified, since the LO
5 are taken from the Java Grande Suite [17] to show the per-         register and the offsetting mechanism for local variables were
formance gains for various method types. These benchmarks            removed.
contain a few Java instructions inside the invoked methods,             We plan to continue refining the REALJava virtual machine.
so some time is spend performing actual arithmetics. The             Currently we are mostly focusing on improvements to the
arithmetic speed of the JPU is exactly the same for both             software partition, but the hardware is also evolving at the
versions, which explains the lower gain percentages in these         same time. On the hardware side the most interesting new
test when compared to the simple call test.                          topic we are studying is making the co-processor core into a
    The next set of benchmarks is a collection of tests that         reconfigurable module and providing system level support for
have been written to evaluate real life performance. The             dynamically adding and removing co-processors as needed.
benchmark programs do not contain any special optimizations          This kind of a system could better utilize the resources on a
for our hardware. Short descriptions of the benchmarks fol-          given FPGA by providing several special purpose cores to be
low. Salesman solves the traveling salesman problem using a          used based on the user application.
naive try all combinations method, Sort tests array handling
                                                                                               ACKNOWLEDGMENT
performance by creating arrays of random numbers and then
sorting them, and Raytrace renders a 3D sphere above a plane.           The authors would like to thank the Academy of Finland
As the benchmarks emphasize different aspects of the system,         for their financial support for this work through the VirtuES
together they should give a rather good estimation of different      project.
practical applications that might be found on an embedded                                           R EFERENCES
Java system. The results show 26 to 31 percent improvement
                                                                     [1] S. Byrne, C. Daly, D. Gregg and J. Waldron, “Dynamic Analysis of the
in the execution speed with the new invocation module.                   Java Virtual Machine Method Invocation Architecture”, In Proc. WSEAS
    Several websites and research papers dedicated to Java               2002, Cancun, Mexico, May 2002.
execution have used the CaffeineMark as a performance mea-           [2] O.-J. Dahl and B. Myhrhaug, “Simula Implementation Guide”, Publica-
                                                                         tion S 47, Norwegian Computing Center, March 1973.
surement. The CaffeineMark is also available as an embedded          [3] J. Lee, B. Yang, S. Kim, K. Ebcio˘ lu, E. Altman, S. Lee, Y. C. Chung,
                                                                                                                 g
version, which omits the graphical tests from the desktop                H. Lee, J. H. Lee, and S. Moon, “Reducing virtual call overheads in a
version. The test scores are calibrated so that a score of 100           Java VM just-in-time compiler”, In SIGARCH Comput. Archit. News 28,
                                                                         1, pp 21-33, March 2000.
would equal the performance of a desktop computer with               [4] T. Lindholm and F. Yellin, “The Java Virtual Machine Specification”,
133 MHz Intel Pentium class processor. The individual tests              Second Edition, Addison-Wesley, 1997.
cover a broad spectrum of applications. Since the REALJava           [5] T. S¨ ntti and J. Plosila, “Communication Scheme for an Advanced Java
                                                                              a
                                                                         Co-Processor”, In Proc. IEEE Norchip 2004, Oslo, Norway, November
is intended for embedded systems, we also calculated the                 2004.
scores without the floating point sub-test. These scores are          [6] T. S¨ ntti and J. Plosila, “Architecture for an Advanced Java Co-
                                                                               a
reported in Table II on the line marked with ND (No Double               Processor”, In Proc. International Symposium on Signals, Circuits and
                                                                         Systems 2005, Iasi, Romania, July 2005.
arithmetics). These results are marked with italics because          [7] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “Java Co-Processor for Embedded
                                                                              a               a
they were measured using a new version of the software par-              Systems”, In Processor Design: System-on-Chip Computing for ASICs
tition of the REALJava virtual machine, which contains some              and FPGAs, J. Nurmi, Ed. Kluwer Academic Publishers / Springer
                                                                         Publishers, 2007, ch. 13, pp. 287-308, ISBN-10: 1402055293, ISBN-13:
modifications besides the invocation architecture. Because of             978-1402055294.
this, the results do not give an accurate view on the effect         [8] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “FPGA Prototype of the REALJava
                                                                             a                a
of the new invocation architecture. For reference we give                Co-Processor”, In Proc. 2007 International Symposium of System-on-
                                                                         Chip, Tampere, Finland, November 2007.
the scores for the Virtex5 based system, which are 142 and           [9] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “A Novel Hardware Acceleration
                                                                              a                a
198 for the embedded CaffeineMark with and without double                Scheme for Java Method Calls”, In Proc. ISCAS 2008, Seattle, Washing-
arithmetics. These results show a decrease from the PowerPC              ton, USA, May 2008.
                                                                     [10] T. S¨ ntti, “A Co-Processor Approach for Efficient Java Execution in Em-
                                                                               a
based system, which is due to the significantly slower CPU.               bedded Systems”, Ph.D. thesis, (https://oa.doria.fi/handle/10024/42248),
Naturally this test was run using only one core on that system,          University of Turku, November 2008.
although eight of them could be used in parallel. More results       [11] J. Tyystj¨ rvi, “A Virtual Machine for Embedded Systems with a Co-
                                                                                     a
                                                                         Processor”, M.Sc. Thesis, University of Turku, 2007.
can be found at our results site [13], the invocation architecture   [12] J. Tyystj¨ rvi, T. S¨ ntti and J. Plosila, “Instruction Set Enhancements for
                                                                                     a         a
was changed and fine tuned between versions 2.09 and 3.01                 High-Performance Multicore Execution on the REALJava Platform”, In
of the REALJava.                                                         Proc. NORCHIP 2008, Tallinn, Estonia, November 2008.
                                                                     [13] “BenchMark Results”, consulted 18 August 2010,
          VI. C ONCLUSIONS AND F UTURE W ORK                             http://vco.ett.utu.fi/˜teansa/REALResults.
                                                                     [14] “CaffeineMark 3.0”, consulted 18 August 2010,
  An improved strategy for accelerating of method calls in               http://www.benchmarkhq.ru/cm30/.
                                                                     [15] “Embedded Java Book Index”, consulted 18 August 2010,
Java using a hardware module was presented. The module                   http://www.practicalembeddedjava.com/.
was implemented on Xilinx FPGA to provide several bench-             [16] “GNU Classpath”, consulted 18 August 2010,
marks and show significant improvement in both specialized                http://www.gnu.org/software/classpath/.
                                                                     [17] “JavaG Benchmarking”, consulted 18 August 2010,
and more general benchmarks. In addition to the improved                 http://www2.epcc.ed.ac.uk/computing/research activities/java grande/
performance, the new architecture reduces the size of the stack

Mais conteúdo relacionado

Mais procurados

Presentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenPresentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenTAIWAN
 
Pc.&.network.technology june.2012
Pc.&.network.technology june.2012Pc.&.network.technology june.2012
Pc.&.network.technology june.2012NATHEEN
 
An integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsAn integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsVLSICS Design
 
An application specific reconfigurable architecture for fault testing and dia...
An application specific reconfigurable architecture for fault testing and dia...An application specific reconfigurable architecture for fault testing and dia...
An application specific reconfigurable architecture for fault testing and dia...eSAT Journals
 
An application specific reconfigurable architecture
An application specific reconfigurable architectureAn application specific reconfigurable architecture
An application specific reconfigurable architectureeSAT Publishing House
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoVincenzo Lomonaco
 
H.264 Library
H.264 LibraryH.264 Library
H.264 LibraryVideoguy
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 
C programming session 14
C programming session 14C programming session 14
C programming session 14AjayBahoriya
 
Fault Injection Approach for Network on Chip
Fault Injection Approach for Network on ChipFault Injection Approach for Network on Chip
Fault Injection Approach for Network on Chipijsrd.com
 
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engine
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engineTAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engine
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engineTAUS - The Language Data Network
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...ijceronline
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The SupercomputerAnkit Singh
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
 

Mais procurados (20)

Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Presentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenPresentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chen
 
fj
fjfj
fj
 
Pc.&.network.technology june.2012
Pc.&.network.technology june.2012Pc.&.network.technology june.2012
Pc.&.network.technology june.2012
 
An integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processorsAn integrated approach for designing and testing specific processors
An integrated approach for designing and testing specific processors
 
An application specific reconfigurable architecture for fault testing and dia...
An application specific reconfigurable architecture for fault testing and dia...An application specific reconfigurable architecture for fault testing and dia...
An application specific reconfigurable architecture for fault testing and dia...
 
An application specific reconfigurable architecture
An application specific reconfigurable architectureAn application specific reconfigurable architecture
An application specific reconfigurable architecture
 
Deep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with TheanoDeep Learning libraries and first experiments with Theano
Deep Learning libraries and first experiments with Theano
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
H.264 Library
H.264 LibraryH.264 Library
H.264 Library
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
design_flow
design_flowdesign_flow
design_flow
 
C programming session 14
C programming session 14C programming session 14
C programming session 14
 
Chapter6 threads
Chapter6 threadsChapter6 threads
Chapter6 threads
 
Fault Injection Approach for Network on Chip
Fault Injection Approach for Network on ChipFault Injection Approach for Network on Chip
Fault Injection Approach for Network on Chip
 
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engine
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engineTAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engine
TAUS USER CONFERENCE 2010, The Deep Hybrid machine translation engine
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
 
A0520106
A0520106A0520106
A0520106
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...
 

Destaque

G UÍ A R E G I S T R O S P I C16 F873
G UÍ A  R E G I S T R O S  P I C16 F873G UÍ A  R E G I S T R O S  P I C16 F873
G UÍ A R E G I S T R O S P I C16 F873Luis Zurita
 
Chuong 8 dong_co_dien
Chuong 8 dong_co_dienChuong 8 dong_co_dien
Chuong 8 dong_co_dienqdai2008
 
Ejercicios Laplace
Ejercicios LaplaceEjercicios Laplace
Ejercicios Laplaceunefa
 
Erika ramirez perilla taller 2 de informatica
Erika ramirez perilla taller 2 de informatica Erika ramirez perilla taller 2 de informatica
Erika ramirez perilla taller 2 de informatica erikaramirezperilla
 
William\'s School Sport portfolio
William\'s School Sport portfolioWilliam\'s School Sport portfolio
William\'s School Sport portfolioTamaki College
 
2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 projectmdragoescu
 
Alzheimer
AlzheimerAlzheimer
AlzheimerHarold
 
Sept.5
Sept.5Sept.5
Sept.5J B
 
Albanna Advertisingnew2
Albanna Advertisingnew2Albanna Advertisingnew2
Albanna Advertisingnew2guestd7341b8
 
Axp Portfolio Builder Series Prospectus S 6282 99
Axp Portfolio Builder Series Prospectus S 6282 99Axp Portfolio Builder Series Prospectus S 6282 99
Axp Portfolio Builder Series Prospectus S 6282 99MGD123
 
PVS-Studio: analyzing ReactOS's code
PVS-Studio: analyzing ReactOS's codePVS-Studio: analyzing ReactOS's code
PVS-Studio: analyzing ReactOS's codeAndrey Karpov
 
Brain Enchanted Loom.ppt
Brain Enchanted Loom.pptBrain Enchanted Loom.ppt
Brain Enchanted Loom.pptShama
 
Poqemonas vs Pelo lais
Poqemonas vs Pelo laisPoqemonas vs Pelo lais
Poqemonas vs Pelo laisguest4b4d60
 
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10Russell Barry, Eyeblaster, Mediacom Engage 19/05/10
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10MediaCom Edinburgh
 
Sponsorship proposal you can do it 2011 - acb
Sponsorship proposal  you can do it 2011 - acbSponsorship proposal  you can do it 2011 - acb
Sponsorship proposal you can do it 2011 - acbjerrykid84
 
Sense títol 1
Sense títol 1Sense títol 1
Sense títol 1eugenimasg
 

Destaque (20)

G UÍ A R E G I S T R O S P I C16 F873
G UÍ A  R E G I S T R O S  P I C16 F873G UÍ A  R E G I S T R O S  P I C16 F873
G UÍ A R E G I S T R O S P I C16 F873
 
Chuong 8 dong_co_dien
Chuong 8 dong_co_dienChuong 8 dong_co_dien
Chuong 8 dong_co_dien
 
Mask
MaskMask
Mask
 
Wireless
WirelessWireless
Wireless
 
Ejercicios Laplace
Ejercicios LaplaceEjercicios Laplace
Ejercicios Laplace
 
Erika ramirez perilla taller 2 de informatica
Erika ramirez perilla taller 2 de informatica Erika ramirez perilla taller 2 de informatica
Erika ramirez perilla taller 2 de informatica
 
William\'s School Sport portfolio
William\'s School Sport portfolioWilliam\'s School Sport portfolio
William\'s School Sport portfolio
 
2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project
 
Alzheimer
AlzheimerAlzheimer
Alzheimer
 
Sept.5
Sept.5Sept.5
Sept.5
 
Albanna Advertisingnew2
Albanna Advertisingnew2Albanna Advertisingnew2
Albanna Advertisingnew2
 
Building a Web site: How Is It Important to Your Business
Building a Web site: How Is It Important to Your BusinessBuilding a Web site: How Is It Important to Your Business
Building a Web site: How Is It Important to Your Business
 
Axp Portfolio Builder Series Prospectus S 6282 99
Axp Portfolio Builder Series Prospectus S 6282 99Axp Portfolio Builder Series Prospectus S 6282 99
Axp Portfolio Builder Series Prospectus S 6282 99
 
PVS-Studio: analyzing ReactOS's code
PVS-Studio: analyzing ReactOS's codePVS-Studio: analyzing ReactOS's code
PVS-Studio: analyzing ReactOS's code
 
Brain Enchanted Loom.ppt
Brain Enchanted Loom.pptBrain Enchanted Loom.ppt
Brain Enchanted Loom.ppt
 
Poqemonas vs Pelo lais
Poqemonas vs Pelo laisPoqemonas vs Pelo lais
Poqemonas vs Pelo lais
 
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10Russell Barry, Eyeblaster, Mediacom Engage 19/05/10
Russell Barry, Eyeblaster, Mediacom Engage 19/05/10
 
Sponsorship proposal you can do it 2011 - acb
Sponsorship proposal  you can do it 2011 - acbSponsorship proposal  you can do it 2011 - acb
Sponsorship proposal you can do it 2011 - acb
 
Adriana cervera grupo 47
Adriana cervera grupo 47Adriana cervera grupo 47
Adriana cervera grupo 47
 
Sense títol 1
Sense títol 1Sense títol 1
Sense títol 1
 

Semelhante a 43

Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singhSudeep Singh
 
Linux Assignment 3
Linux Assignment 3Linux Assignment 3
Linux Assignment 3Diane Allen
 
Binary translation
Binary translationBinary translation
Binary translationGFI Software
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
Ijaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderIjaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderijaprr_editor
 
Sybsc cs sem 3 core java
Sybsc cs sem 3 core javaSybsc cs sem 3 core java
Sybsc cs sem 3 core javaWE-IT TUTORIALS
 
just in time JIT compiler
just in time JIT compilerjust in time JIT compiler
just in time JIT compilerMohit kumar
 
Network Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationNetwork Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationIJECEIAES
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH veena babu
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptxssuser41d319
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareIndicThreads
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
How Java 19 Influences the Future of Your High-Scale Applications .pdf
How Java 19 Influences the Future of Your High-Scale Applications .pdfHow Java 19 Influences the Future of Your High-Scale Applications .pdf
How Java 19 Influences the Future of Your High-Scale Applications .pdfAna-Maria Mihalceanu
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch ProcessingChris Adkin
 
Java dev mar_2021_keynote
Java dev mar_2021_keynoteJava dev mar_2021_keynote
Java dev mar_2021_keynoteSuyash Joshi
 

Semelhante a 43 (20)

Synopsis on online shopping by sudeep singh
Synopsis on online shopping by  sudeep singhSynopsis on online shopping by  sudeep singh
Synopsis on online shopping by sudeep singh
 
Linux Assignment 3
Linux Assignment 3Linux Assignment 3
Linux Assignment 3
 
Remote Web Desk
Remote Web DeskRemote Web Desk
Remote Web Desk
 
1
11
1
 
Binary translation
Binary translationBinary translation
Binary translation
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
Ijaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderIjaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinder
 
Sybsc cs sem 3 core java
Sybsc cs sem 3 core javaSybsc cs sem 3 core java
Sybsc cs sem 3 core java
 
just in time JIT compiler
just in time JIT compilerjust in time JIT compiler
just in time JIT compiler
 
Network Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationNetwork Function Modeling and Performance Estimation
Network Function Modeling and Performance Estimation
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Java multi thread programming on cmp system
Java multi thread programming on cmp systemJava multi thread programming on cmp system
Java multi thread programming on cmp system
 
How Java 19 Influences the Future of Your High-Scale Applications .pdf
How Java 19 Influences the Future of Your High-Scale Applications .pdfHow Java 19 Influences the Future of Your High-Scale Applications .pdf
How Java 19 Influences the Future of Your High-Scale Applications .pdf
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch Processing
 
LEXICAL ANALYZER
LEXICAL ANALYZERLEXICAL ANALYZER
LEXICAL ANALYZER
 
Java dev mar_2021_keynote
Java dev mar_2021_keynoteJava dev mar_2021_keynote
Java dev mar_2021_keynote
 
Java unit 1
Java unit 1Java unit 1
Java unit 1
 

Mais de srimoorthi (20)

94
9494
94
 
87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
75
7575
75
 
73
7373
73
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
68
6868
68
 
63
6363
63
 
62
6262
62
 
61
6161
61
 
60
6060
60
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
55
5555
55
 

43

  • 1. An Improved Hardware Acceleration Scheme for Java Method Calls Tero S¨ ntti∗† , Joonas Tyystj¨ rvi∗‡ , and Juha Plosila†∗ a a ∗ Dept. of Information Technology, University of Turku, Finland † Academy of Finland, Research Council for Natural Sciences and Engineering ‡ Turku Centre for Computer Science, Finland {teansa|jttyys|juplos}@utu.fi Abstract— This paper presents a significantly improved strat- This work is a part of the VirtuES project, which focuses egy for accelerating the method calls in the REALJava co- on fully utilizing the potential of embedded multicore systems processor. The hardware assisted virtual machine architecture is using a virtual machine approach. described shortly to provide context for the method call accelera- tion. The strategy is implemented in an FPGA prototype. It allows Overview of the paper We proceed as follows. In Section measurements of real life performance increase, and validates 2 we shortly describe the structure of our hardware assisted the whole co-processor concept. The system is intended to be JVM, and show how the proposed co-processor fits into the used in embedded environments, with limited CPU performance Java specifications. Section 3 describes the methods in Java and memory available to the virtual machine. The co-processor and sheds light on the differences in the ways methods can be is designed in a highly modular fashion, especially separating the communication from the actual core. This modularity of the invoked. In Section 4 the strategy for accelerator is presented design makes the co-processor more reusable and allows system with details of the hardware unit focusing on the differences level scalability. This work is a part of a project focusing on design to the previous solution. In Section 5 some benchmark results of a hardware accelerated multicore Java Virtual Machine for are given and analyzed. Finally in Section 6 we draw some embedded systems. conclusions and describe the future efforts related to the REALJava virtual machine. I. I NTRODUCTION Java is very popular and portable, as it is a write-once run- II. JAVA V IRTUAL M ACHINE any-where language. This enables coders to develop portable In the Java Virtual Machine Specification [4], Second Edi- software for any platform. Java code is first compiled into byte- tion the structure and behavior of all JVMs is specified at code, which is then run on a Java Virtual Machine (hereafter a quite abstract level. This specification can be met using JVM). The JVM acts as an interpreter from bytecode to native several techniques. The usual solutions are software only, microcode, or more recently uses just in time compilation (JIT) including some performance enhancing features, such as JIT to affect the same result a bit faster at the cost of memory. (Just In Time Compilation). We have chosen to use a HW/SW This software only approach is quite inefficient in terms of combination [7] in order to maximize the hardware usage and power consumption and execution time. These problems rise minimize the power consumption. from the fact that executing one Java instruction requires several native instructions. Another source for inefficiency is the memory usage. The software based JVMs have to keep internal registers of the virtual machine in the main memory of the host system. When the execution of the bytecode is performed on a hardware co-processor this is avoided and the overall amount of memory accesses is reduced. Because the methods in Java are generally quite small in terms of storage requirements for the code that is running and the data being processed it is possible to keep all the required items in a Fig. 1. Internal architecture of the REALJava JVM relatively small local memory inside the co-processor. Actually just 128 kb of internal memory is enough to store all of the The HW portion (shown on the right side of Figure 1) methods used in an embedded application. This includes the handles most of the actual Java bytecode execution, whereas Java benchmark for embedded systems found in [15] and the the SW portion (the left side of Figure 1) takes care of memory embedded version of the CaffeineMark [14]. Since this local management, class loading and native method calling. This memory is not mirrored to the main memory, which usually partitioning gives the possibility to use the co-processor with resides in a physically external memory chip, it is energy any type of host CPU(s) and operating systems, as all of the efficient. platform dependent properties are implemented in software 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. and most of the platform independent bytecode execution is methods can be found in [11]. An independent study can also done in hardware. be found in [1]. Because Java supports multithreading at the language level, it makes sense to integrate several co-processors as a SoC. Salesman Sort Raytrace Caffeine Stack frame size 8.98 4.77 7.01 5.55 This gives an ideal solution for complex systems running Method length 38.86 8.67 9.26 14.83 several Java threads and possibly some native code at the same Total invocations 991228 18412516 1957996 27779867 time. This approach brings forth true multithreading and thus TABLE I improves performance. Also large systems possibly contain S TATISTICS FROM METHOD INVOCATIONS IN SELECTED BENCHMARKS . several software subsystems, such as internet protocols, user T HE FIRST TWO ROWS ARE AVERAGES AND THEY ARE MEASURED IN interface controllers and so on, which can easily be coded 32- BIT WORDS . in Java, and since they all are executed in parallel the user experience is enhanced. III. M ETHOD CALLS IN JAVA The system architecture can be chosen to be a network of The Java virtual machine specification [4] defines the types any kind or bus based, as suitable for other components in of methods that can be invoked in Java. Because Java is an the system. The structure of the underlying communication object-oriented language, methods are usually invoked on ob- medium is rather irrelevant, as long as the lower level provides jects, with the actual method implementation chosen based on two properties: 1) the datagrams must arrive in their destina- the runtime type of the object. Methods that are not invoked on tion in the same order that they were sent, and 2) the datagrams objects are called static methods. Besides static methods, the arriving from two different sources to the same destination most important categories of methods are defined in the access must be identifiable. The first property can be be achieved flags bit field of the method definition. The most important with a lower level network protocol, like ATM adaptation access flags during bytecode execution are acc synchronized layer (AAL) for the internet, or by the physical structure of a and acc native. Acc synchronized means that when the method bus. The second property seems quite natural, and should be is invoked, the monitor (the primary synchronization construct present in all solutions. The communication scheme for the in Java) of the object that the method is invoked on is co-processor is discussed in more detain in [5]. entered, and the monitor is exited on return from the method. The architecture for the co-processor is presented in [6] and Acc native means that the method is implemented in a native the whole system including hardware and software portions language of the platform. Native methods can be bound to can be found in [7] and [10]. The basic design used for the actual native functions at runtime. FPGA implementations in this paper is the same, with only Methods are invoked using one of the four bytecode minor fine tuning on some of the units. There are 5 control instructions invokevirtual, invokespecial, invokeinterface and registers in the execution unit. These are the program counter invokestatic. All of these instructions perform a method lookup P C, stack top pointer ST , code offset CO, local variable based on a 16-bit index to the constant pool of the currently ex- pointer LV and local variable info LO. The P C holds the ecuting class. Invokevirtual and invokeinterface then perform address of the current instruction relative to the CO. The ST a further lookup based on the runtime type of the object that and LV registers are internal addresses to the local memory. the method is being invoked on, while invokestatic and invoke- The CO contains the starting address of the current method special invoke the method found immediately. As symbolic in the method area of the co-processor. The last register holds method resolution is very slow, it is common to modify the two values, the number of parameters Nparams and the number constant pool and the instruction data itself during either class of local variables Nlocals in for the current method. After loading or after the execution of a call instruction. A common applying the new method invocation structure, the LO register technique for accelerating invokevirtual instructions is the use is removed from the design. of virtual tables [2], which contain a pointer to each non- The Java virtual machine also provides a rich standard interface method that a class implements, with a fixed index library. In most current research virtual machines the GNU for each method identifier. Performing a virtual table lookup Classpath [16] is used. The GNU Classpath is a free imple- is much faster than finding the method by symbolic lookup in mentation of the standard library, and it is constantly being the class of the object. A somewhat related technique is so- developed. Currently it covers more than 95% of the methods. called “inline caches” [3], which enable just-in-time compilers The missing methods are quite rare, so in most cases the to quickly inline the most common implementations of virtual GNU Classpath is sufficient. As per recommendations for Java methods into their call sites. programming, the classpath has been build from very small As Java is an object-oriented language, invokevirtual is methods, which are invoked often during the execution of a intended to be the primary method invocation instruction. The Java program. Also many of the methods in the classpath call other instructions are used for special cases: invokespecial is even smaller sub-methods. This emphasizes the importance used to invoke object constructors, private methods (which can of having fast method invocation architecture in a virtual be hidden by subclasses) and to explicitly invoke a certain machine. The method size statistics for selected benchmarks implementation of a virtual method, invokeinterface is only are shown in Table I, and they clearly support the claim of used to invoke methods through an interface pointer and small methods being invoked often. More statistics about Java invokestatic does not operate on an object instance. It is
  • 3. important to notice that as long as a class has no subclasses, The symbol X is just a shorthand for Nlocals − Nparams . invokevirtual can be executed like invokespecial. The same applies for interfaces with only one implementation. If the overload status of virtual functions is stored in the method definition and updated when new classes are loaded, three types of method invocation instructions can be executed without access to heap data or native functions: non-native invokestatic, invokespecial and invokevirtual with a single implementation. These instructions can be implemented using only a constant pool lookup and, in the case of invokevirtual, a test of the overload status of the method. The new architecture presented in this paper makes use of an observation about Java programs we made recently. We noticed that the stack of a given Java method is always empty when a return instruction is executed. This feature is not mentioned in the Java virtual machine specification [4], but it seems that not one of the Java compilers we tried generates code where the stack would not be empty. Assuming the stack to always be empty makes the return much simpler, but we Fig. 2. The effects of the invocation process on the stack. were hesitant due to the fact that a class with non empty stack during return would still be a legal construct. When a Now let us review the mechanism presented in [9]. In bytecode modification engine [12] was added to the bytecode the following formulas the CallInfo vector comes from the verification phase, it was noticed that it could be used for invoker module shown in Figures 3 and 4. In the original emptying the stack if required. The bytecode verification keeps architecture the CallInfo was 56-bits long. The SWCT RL count of the stack at all points of the bytecode, so adding just symbol is used for control bits that tell both the hardware and the required amount of pop instructions before an offending the software that some special actions are required during the return instruction would fix the situation. So far this has never return phase of the method. An example would be a return been observed, but the check up is kept in the verification to a native method. This situation cannot be handled in the process for sake of security and in order to be compliant with hardware, since the control is returned to the native method all legal Java code. executed by the CPU. Please notice that pushing the return Returning from a process happens using one of 6 instruc- info to the stack after the new register values have been tions, return, ireturn, freturn, areturn, lreturn or dreturn. These calculated updates the ST accordingly. differ only by the data pushed to the stack of the calling method. The first one pushes nothing, while the next three Formulas for calculating the new register values: push one word and the last two push two words. Even though PC ⇐ 0 the 32-bit versions have several bytecodes reserved, they ST ⇐ STOLD − CallInf o(15..0) + CallInf o(31..16) are implemented using only one mechanism. The difference CO ⇐ CallInf o(55..32) between these instructions is only used during class loading LV ⇐ STOLD − CallInf o(15..0) for verification purposes. The 64-bit instructions are handled LO ⇐ CallInf o(31..0) similarly. Since the actual returning process is exactly the same for all of the instructions, we only consider the return instruction, and state that the data to be pushed to the calling Data pushed to the stack frame (Return Info): method stack is stored into temporary registers during the SWCT RL & P COLD return process. STOLD − CallInf o(15..0) COOLD LVOLD IV. I NVOCATION AND RETURN PROCEDURES LOOLD First, let us have a look at what happens in the stack of the virtual machine during a method invocation. Figure 2 Then we can move on to the new method invocation shows how the new stack frame is created. Before the actual procedure. Here are the modified invocation formulas. These invocation, the calling method pushes the required parameters use the new architecture, so the CallInfo is only 16-bits long, to the top of its stack. In the Figure these are shown as and it is presented in Figure 5. Parameters, and the number of them is denoted with the symbol Nparams . The symbol Nlocals tells how many local Formulas for calculating the new register values: variables the new method uses. Note that the parameters PC ⇐ 0 become a part of the local variable array for the new method. ST ⇐ STOLD + X
  • 4. CO ⇐ CallInf o(15..0) LV ⇐ STOLD − Nparams Data Pushed to the stack frame (Return Info): SWCT RL & COOLD LVOLD & P COLD Naturally the procedures for performing a return instruction are also simplified in the new architecture. The process of returning from a method can be seen as going to the opposite direction in Figure 2. The original values for most of the registers associated with the stack frame on the left side are returned. Only the ST is modified, to reflect the fact that the parameters consumed by the invoked method have been removed from the stack. The old system performed the Fig. 3. The invoker connected to the ALU and the registers. following actions during return instructions. Formulas for calculating the new register values: method id and the code offset as the key, as shown in Figure P C ⇐ Data0 4. In the old architecture the code offset was 24 bits long. In ST ⇐ Data1 the new architecture the software performs a process called CO ⇐ Data2 constant pool merging, during which the constants defined by LV ⇐ Data3 a given class are added to a global constant pool instead of a LO ⇐ Data4 separate pool for each class. This saves memory by merging constants already defined by other classes and also speeds up Where the DataN symbols are being retrieved from the the constant pool look-up, since finding the constant pool for stack frame using a separate indexing scheme, which offsets the current class is not required. This technique also reduces the index by Nlocals and then uses the normal local variable the size of the CAM key, because the code offset of the loading mechanism inside the co-processor. Altogether this current method is no longer needed. Only the method id, sequence requires 5 data items to be retrieved from the data now in the new unified constant pool, is required. The new memory. structure can be seen in Figure 5. As an additional bonus, the method cache utilization is improved. This happens while The modified architecture gets the same results in a invoking one method, let us call it A, from several different much simpler fashion, using the following formulas: classes. The scenario results in only one cache line, while the P C ⇐ Data0(15..0) old architecture would have required a separate line for each ST ⇐ LVOLD method invoking A. CO ⇐ Data1(1..0) LV ⇐ Data0(31..16) Again the DataN symbols are retrieved from the data memory, but now they can be retrieved using the normal pop mechanism. This simplifies the hardware since now the unit handling local variables is not required to handle the additional offsetting. The amount of data to be retrieved is also decreased Fig. 4. The original CAM structure. from 5 to just 2 words. This naturally decreases the amount of memory required for return information in each stack frame from 5 to 2 words. Using the pop mechanism is possible since we are now assuming that the stack of the current method is empty when performing the return. Notice also, that now the new value for the ST is not calculated at all, but it is simply the value of the LV in the current method. The invoker will speed up the invocation of methods that are already loaded to the local memory of the co-processor. Fig. 5. The modified CAM structure. When an invocation command is encountered in the ALU, it sends the constant pool index of the method to the invoker After the key has been found in the CAM, the match address module and sets query high. At this time the invoker performs is sent to a normal RAM, which stores the information needed a look up in the content addressable memory (CAM) using the to perform the method call. This RAM was 56 bits wide, and
  • 5. Processor REALJava (old) REALJava (new) Kaffe on PPC Units Gain % Engine speed 100 100 300 MHz N/A Simple call 3125000 6666666 59453 1/s 113.3 Instance call 713042 1160730 19460 1/s 62.8 Synchronized call 366473 564810 15567 1/s 54.1 Final call 671343 1097950 18090 1/s 63.5 Class call 671255 1248860 18847 1/s 86.0 Synchronized class call 260401 350181 1/s 34.5 Salesman 11438 9027 111824 ms 26.7 Sort 40569 31386 856684 ms 29.3 Raytrace 7205 5494 169646 ms 31.1 EmbeddedCaffeineMark 156 231 10 48.1 EmbeddedCaffeineMark ND 184 279 11 51.6 TABLE II R ESULTS FROM VARIOUS BENCHMARKS . consisted of 24 bits for the code offset of the new method, be loaded. Overloading of methods causes them to fall out of 16 bits for the number of local variables, Nlocals , and finally the cache because selecting the implementation for a specific 16 bits for the number of parameters, Nparams , taken by the call requires access to heap data. The host CPU is better suited new method. Our improved scheme stores only the code offset for this kind of task, so it is assigned to there. for the method to be invoked. The length is now limited to The module was integrated into our REALJava co-processor 16 bits. In stead of storing the Nlocals and Nparams to the prototype as 8 places deep. This depth was chosen as the invoker cache, a different approach is chosen. Namely the statistics presented in [9] show that size to provide highest Nparams and X are stored to the instruction memory, just impact on performance with least resources. The prototype before the actual Java code for the method. The value of X is is based on a Xilinx ML410 demonstration board. This board calculated during class loading to minimize the computation provides all the services one might expect of a computer, such required during runtime. This strategy minimizes the size of as a network controller, a hard drive controller, a PCI bus and the method cache unit. The increased memory requirement so on. The FPGA chip is a Virtex4FX, which includes two for the instruction side is only one word per each method, hardcore PowerPC CPUs. The co-processor is connected to and since the stack frames have been reduced by 3 words, the CPU via the Processor Local Bus (PLB 3.4). The system the net effect is positive even if each method is invoked runs the co-processor at 100 MHz, while the PowerPC CPU only once. Naturally, if there are subsequent invocations for runs at 300 MHz. The CPU runs Linux 2.4.20 as the operating a method already loaded to the instruction memory, the net system providing services (network, filesystem, etc.) to the saving resulting from the new architecture is 3 words per virtual machine. For more details on the prototype, please see invocation. The code offset found from the RAM is also sent [8] and [10]. The system has also been implemented on a to the instruction memory controller, which in turn returns Virtex5 based board. This configuration used the newer PLB the values for Nparams and X to the ALU for use in the bus (4.6) as the communication channel and a MicroBlaze invocation. If a key is found, then get regs signal is set high as the CPU. The CPU provides considerably less arithmetic to indicate a valid match. This triggers the ALU to capture performance, since it is a softcore processor implemented the CallInfo and the Nparams and the X, and to calculate using FPGA resources and runs at 100 MHz. The larger FPGA new register values using the rules presented earlier. chip allowed us to include eight co-processor cores to the In case a match is not found in the CAM, a trap is produced. system. Unfortunately we have only implemented the new To indicate this condition to the ALU the do trap signal is invocation architecture on that platform, so we do not present set high. Upon receiving this signal the ALU sets the trap detailed results for this platform in this paper. signal high to communication module, and finally the host CPU performs the needed actions to start execution of the V. R ESULTS new method. At the same time the invoker module saves the The results in Table II show that the invoker module has key to the CAM. When the execution resumes after the trap significant impact on execution times of the benchmarks. In the invoker module captures the required register values and the table REALJava (old) stands for a configuration with the saves them to the RAM. Now the invoker is ready to speed up original invoker, REALJava (new) stands for a configuration execution in case the same method is called again. When the with the improved invoker and Kaffe on PPC is Kaffe Virtual invoker module saves a new key to the CAM it uses circular Machine running on the same PowerPC processor. REALJava, oldest algorithm to choose which entry to replace. This scheme even though running at lower clock speed, clearly outperforms provides reasonably close approximation of the least recently the Kaffe in all of the benchmarks. The Gain is the percentage used algorithm with very low complexity. of improvement achieved with the improved invoker module. The invoker module can also clear its contents. This is The first set of benchmarks is a collection of method call required for situations where a virtual method has been cached tests. They measure mostly the method invocation perfor- to the module, and a new overloading virtual method needs to mance, and do not include (significant amounts of) arithmetics.
  • 6. The first one simply calls an empty method and then returns frames, thus reducing the overall memory requirements for the without any processing inside the invoked method. The next co-processor. Also the hardware is simplified, since the LO 5 are taken from the Java Grande Suite [17] to show the per- register and the offsetting mechanism for local variables were formance gains for various method types. These benchmarks removed. contain a few Java instructions inside the invoked methods, We plan to continue refining the REALJava virtual machine. so some time is spend performing actual arithmetics. The Currently we are mostly focusing on improvements to the arithmetic speed of the JPU is exactly the same for both software partition, but the hardware is also evolving at the versions, which explains the lower gain percentages in these same time. On the hardware side the most interesting new test when compared to the simple call test. topic we are studying is making the co-processor core into a The next set of benchmarks is a collection of tests that reconfigurable module and providing system level support for have been written to evaluate real life performance. The dynamically adding and removing co-processors as needed. benchmark programs do not contain any special optimizations This kind of a system could better utilize the resources on a for our hardware. Short descriptions of the benchmarks fol- given FPGA by providing several special purpose cores to be low. Salesman solves the traveling salesman problem using a used based on the user application. naive try all combinations method, Sort tests array handling ACKNOWLEDGMENT performance by creating arrays of random numbers and then sorting them, and Raytrace renders a 3D sphere above a plane. The authors would like to thank the Academy of Finland As the benchmarks emphasize different aspects of the system, for their financial support for this work through the VirtuES together they should give a rather good estimation of different project. practical applications that might be found on an embedded R EFERENCES Java system. The results show 26 to 31 percent improvement [1] S. Byrne, C. Daly, D. Gregg and J. Waldron, “Dynamic Analysis of the in the execution speed with the new invocation module. Java Virtual Machine Method Invocation Architecture”, In Proc. WSEAS Several websites and research papers dedicated to Java 2002, Cancun, Mexico, May 2002. execution have used the CaffeineMark as a performance mea- [2] O.-J. Dahl and B. Myhrhaug, “Simula Implementation Guide”, Publica- tion S 47, Norwegian Computing Center, March 1973. surement. The CaffeineMark is also available as an embedded [3] J. Lee, B. Yang, S. Kim, K. Ebcio˘ lu, E. Altman, S. Lee, Y. C. Chung, g version, which omits the graphical tests from the desktop H. Lee, J. H. Lee, and S. Moon, “Reducing virtual call overheads in a version. The test scores are calibrated so that a score of 100 Java VM just-in-time compiler”, In SIGARCH Comput. Archit. News 28, 1, pp 21-33, March 2000. would equal the performance of a desktop computer with [4] T. Lindholm and F. Yellin, “The Java Virtual Machine Specification”, 133 MHz Intel Pentium class processor. The individual tests Second Edition, Addison-Wesley, 1997. cover a broad spectrum of applications. Since the REALJava [5] T. S¨ ntti and J. Plosila, “Communication Scheme for an Advanced Java a Co-Processor”, In Proc. IEEE Norchip 2004, Oslo, Norway, November is intended for embedded systems, we also calculated the 2004. scores without the floating point sub-test. These scores are [6] T. S¨ ntti and J. Plosila, “Architecture for an Advanced Java Co- a reported in Table II on the line marked with ND (No Double Processor”, In Proc. International Symposium on Signals, Circuits and Systems 2005, Iasi, Romania, July 2005. arithmetics). These results are marked with italics because [7] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “Java Co-Processor for Embedded a a they were measured using a new version of the software par- Systems”, In Processor Design: System-on-Chip Computing for ASICs tition of the REALJava virtual machine, which contains some and FPGAs, J. Nurmi, Ed. Kluwer Academic Publishers / Springer Publishers, 2007, ch. 13, pp. 287-308, ISBN-10: 1402055293, ISBN-13: modifications besides the invocation architecture. Because of 978-1402055294. this, the results do not give an accurate view on the effect [8] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “FPGA Prototype of the REALJava a a of the new invocation architecture. For reference we give Co-Processor”, In Proc. 2007 International Symposium of System-on- Chip, Tampere, Finland, November 2007. the scores for the Virtex5 based system, which are 142 and [9] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “A Novel Hardware Acceleration a a 198 for the embedded CaffeineMark with and without double Scheme for Java Method Calls”, In Proc. ISCAS 2008, Seattle, Washing- arithmetics. These results show a decrease from the PowerPC ton, USA, May 2008. [10] T. S¨ ntti, “A Co-Processor Approach for Efficient Java Execution in Em- a based system, which is due to the significantly slower CPU. bedded Systems”, Ph.D. thesis, (https://oa.doria.fi/handle/10024/42248), Naturally this test was run using only one core on that system, University of Turku, November 2008. although eight of them could be used in parallel. More results [11] J. Tyystj¨ rvi, “A Virtual Machine for Embedded Systems with a Co- a Processor”, M.Sc. Thesis, University of Turku, 2007. can be found at our results site [13], the invocation architecture [12] J. Tyystj¨ rvi, T. S¨ ntti and J. Plosila, “Instruction Set Enhancements for a a was changed and fine tuned between versions 2.09 and 3.01 High-Performance Multicore Execution on the REALJava Platform”, In of the REALJava. Proc. NORCHIP 2008, Tallinn, Estonia, November 2008. [13] “BenchMark Results”, consulted 18 August 2010, VI. C ONCLUSIONS AND F UTURE W ORK http://vco.ett.utu.fi/˜teansa/REALResults. [14] “CaffeineMark 3.0”, consulted 18 August 2010, An improved strategy for accelerating of method calls in http://www.benchmarkhq.ru/cm30/. [15] “Embedded Java Book Index”, consulted 18 August 2010, Java using a hardware module was presented. The module http://www.practicalembeddedjava.com/. was implemented on Xilinx FPGA to provide several bench- [16] “GNU Classpath”, consulted 18 August 2010, marks and show significant improvement in both specialized http://www.gnu.org/software/classpath/. [17] “JavaG Benchmarking”, consulted 18 August 2010, and more general benchmarks. In addition to the improved http://www2.epcc.ed.ac.uk/computing/research activities/java grande/ performance, the new architecture reduces the size of the stack