The document discusses a market data vendor that processes data from exchange feeds and distributes it to customers. It provides millions of quotes per second using a pure Java solution called QDS Core, which parses, normalizes, and distributes the data. QDS Core uses optimized data structures like arrays and lock-free synchronization to achieve high performance. The vendor also provides an easier to use API called dxFeed on top of QDS Core to enable integration.
2. Market Data Rates
10000 000
9000 000
8000 000
7000 000
messages per second
6000 000
5000 000
4000 000
3000 000
2000 000
1000 000
0
Основной Основной Основной Основной Основной Основной Основной Основной
US Equities, Indexes and Futures OPRA
3. Market Data Vendor
• Process data coming from exchange data feeds
- Parse
- Normalize
• Distribute data to customers
- Gather into a single feed
- Store and retrieve (for onDemand historical requests)
- Serialize and transfer
- Scatter to multiple consumers based on actual subscription
4. dxFeed High Level Picture
CME, CBOT, NYMEX, COMEX,
ICE Futures U.S., CBOE, TSX, TSXV,
MX
Chicago ticker plant
10Gbit
resilient redundant connectivity
infrastructure
NYSE, AMEX,
NASDAQ,
ISE, OPRA,
FINRA, PinkSheets
New York ticker plant
Direct cross-connect
Customer connection point
SFTI
TNS
SAVVIS
BT Radianz
Internet
5. A Bit of History
• Devexperts was founded in 2002
- as an Upscale Financial IT company
• QDS project was born in 2003
- to address market data distribution problem
- in a high performance-way (initial design goal was 1M mps)
• dxFeed service was launched in 2008
- to provide our customers with live market data directly from
exchanges, using QDS for distribution
• dxFeed API was created on top of QDS in 2009
- to provide an easier customer-facing API and enable 3rd party
developers to integrate their code with dxFeed
6. Threads Portability
Community Developers
Garbage Collection
Libraries and frameworks
Backwards-compatibility
Refactoring Type Safety
Open source
Memory model
Reflection
Productivity Tools
Readability
HotSpot JIT
Byte-code manipulation
Simplicity The most popular language
8. Java object layout
String[] • String[] that is filled with
some strings in Java
header
size String
[0]
header
[1] char[]
[2] value
header
[3] hash
... String size
„T‟
header
„E‟
value „S‟
hash „T‟
... ...
9.
10. Memory layout solution
• Prefer array-based data-structures to linked ones
- Most Java programs get immediate performance boost by replacing all
mentions of LinkedList by ArrayList
• Use Java arrays or ByteBuffer classes where it matters
- They are guaranteed to be contiguous in memory
- Layout your data into array manually
• That‟s how QDS core is designed
- All it critical data structures are rolled onto int[] and Object[]
11. byte[] vs ByteBuffer
• byte[] is always heap-based
- Faster for byte-oriented access
• ByteBuffer can be both “heap” and “direct”
- Be especially careful with direct ByteBuffers
- If you don‟t Pool them, you may run out of native memory before Java
GC has a chance to run
- Can be faster for short-, int- or long- oriented access via get/putXXX
methods
• But make sure you use native byte order (BIG_ENDIAN is default)
- Direct ByteBuffers don‟t need an extra buffer copy when doing
input/output with NIO
14. Garbage collection
• Makes your code much easier
- to design
- to debug
- to maintain
• GC performs really well when
- Objects are very short-lived
• They are not promoted to old gen
• They are reclaimed by high-throughput scavenge GC
- Object are very long-lived and are not modified or contain primitives
• Scavenge GC does not waste time scanning them
15. Object allocation
• Allocation of small objects is fast
- new String() is ~20 bytes on 64bit VM with compressed oops
• not counting char[] object inside of it
- ~4.5ns per allocation (on 2.6GHz i5)
• But becomes slower when you include amortized GC cost
• And can become much slower if you
- have big static memory footprint
- have “medium-lived” objects
- have lots of threads (and thus a lot of GC roots and coordination)
- use references (java.lang.ref) a lot
- mutate your memory a lot, especially references (GC card marking)
16. Manual memory management
• When you would consider manual memory management in native
code (custom object pools), consider doing the same in Java
• General advise
- Pool large objects
• They are expensive to be allocated and to be collected by GC
- Avoid small objects
• Especially “medium-lived” ones
• Layout them into arrays if you need store them
17.
18. Object allocation action plan (1)
• Watch the percentage of time your system spends doing GC
- -verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
- “jconsole” and “jvisualvm” tools show this information
- It is available programmatically via GarbageCollectorMXBean
• At Devexperts we collect it and report (push) in real-time via
MARS (Monitoring and Reporting System) using a dedicated
JVMSelfMonitoring plugin
• Our support team have alerts configured on high GC % in our
systems
• Act when it becomes too big
19. Object allocation action plan (2)
• Tune GC to reduce overhead without code changes
• Identify places when most of allocations take places and optimize
them
- Use off-the-shelf Java profilers
- Use Devexperts aprof for a full allocation picture at production speed
http://code.devexperts.com/display/AProf/
20. Object reuse and sharing
• Pooling small objects in often a bad idea
- Unless you are trying to quickly speed up code that heavily relies on
lots of small objects
- It‟s better to get rid of small objects altogether
• See boxing in performance critical code get rid of it
• But reusing / sharing small objects is great
- Strings are typical candidate for data-processing code
• Common pitfalls (don‟t do it, unless you fully understand it)
- String.intern
- WeakReference
22. String I/O
• String are often duplicated in memory
• Reading any string-denoted data from database, from file, from
network – all produces new strings
• Where performance matters, reuse strings
- For example see StringCache class from
http://docs.dxfeed.com/dxlib/api/com/devexperts/util/StringCache.html
- The key method is get(char[])
• You can reuse char[] where data is read
• And get an instance of String from cache if it is there
23. Radical object / reference elimination
• Unroll complex objects into arrays
- For example, a collection of strings can be represented in a single
byte[]
• Renumber shared object instances
- Represent string reference as int
- That‟s what QDS core does for efficient String manipulation
• Faster to compare
• Faster to hash
• Avoids slower “modify reference” operations (marks GC cards)
- But requires hand-crafted memory management
• QDS does reference counting, but custom GC is also feasible
24. Hardcore optimization
• Use sun.misc.Unsafe when everything else fails
- It gives you full native speed
- But no range checks nor type-safety
• You are on your own!
- Good fit for integration with native data structures when needed
• QDS core uses it in few places
- Mainly to provide wait-free execution guarantees with an appropriate
synchronization for array-based data structures
- But there is a fallback code for cases when sun.misc.Unsafe is not
available
25. Even more hardcore – hand-written SMT
• If you have to use linked data structures
- Consider traversing multiple linked lists simultaneously in the same
thread
- Akin to hardware SMT, but in software
- The code becomes much more complicated
- But the performance can considerably increase
* Not a Java-specific optimization, but fun to mention here
26.
27. Threads and scalability
• Share data across the threads to further reduce memory footprint
- But carefully design and implement this sharing
• Learn and love Java Memory Model
- It makes your correctly-synchronized multi-threaded code fully
portable across CPU architectures
• QDS core is a thread-safe data structure with a mix of lock-
free, fine-grained and coarse-grained locking approaches which
makes it vertically scalable
28. Be careful with threads and locks
• Thread switches introduce a
considerable latency (~20us) 1. Enter Lock
• Lock contention forces even 2. Context Switch
more thread switches 3. Try to lock
• It is not a Java-specific 4. Context Switch
5. Exit Lock
concern, but a common Java- 6. Context switch
specific problem, since Java and enter lock
makes threads easier for
programmers to use (and many
do use them)
29.
30. Data flow for horizontal scalability
Subscribes:
IBM, GE. QQQQ, MSFT,
INTC, SPX
IBM, GE ticks
Multiplexor
QDTicker
GE ticks
IBM, GE ticks
Subscibes: Subscibes:
IBM, GE, QQQQ, MSFT GE, INTC, SPX
QDTicker QDTicker
IBM
GE GE SPX
MSFT
IBM INTC INTC
QQQQ
31.
32. HotSpot Server VM
• Run “java -server” (it is a default on server-class machines)
• Does
- Very deep code inlining
- Loop unrolling
- Optimize virtual and interface calls based on collected profile
- Escape analysis for synchronization and allocation elimination
• Embrace it!
- Don‟t fear writing your code in a nice object-oriented way
• In most of cases, that is
• Do still avoid too much “object orientation” in the most
performance-sensitive places
33. HotSpot challenges
• It is harder to profile, stress-test, and tune code
- You need to “warm up” the code to get meaningful result
- Small changes in code can lead to big differences that are hard to
explain
- Compilation of less busy code can trigger at any time and cause
unexpected latency spikes
• Don‟t do micro-tests
- Test the whole system together instead
• Do micro-tests
- To learn which code patters are better across the board
- Small savings add up
34.
35. Looking at generated assembly code
• -XX:+UnlockDiagnosticVMOptions
-XX:CompileCommand=print,*<class-name>.<method-name>
-XX:PrintAssemblyOptions=intel
• You will need “hsdis” library added to your JRE/JDK with the actual
disassembler code
- But you have to build it yourself:
http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/tip/src/share/tools/hsdis/README
36. Use native profilers
• Java profiles are great tools, but they don‟t use processor
performance counters and lack the ability to recognize such
problems like memory pressure
- And they don‟t always produce a clear picture
- All “cpu time” is reported at the nearest “safe point”, not at the actual
code line that consumed CPU
• Use native profilers to figure it out
- Sun Studio Performance Analyzer
- Intel VTune Amplifier
- AMD CodeAnalyst
37.
38. General (1)
• Classic data structures and algorithms
- Use CPU and memory efficient data structures and algorithms
- Know and love hash tables
• They are the most useful data structure in a typical business
application
• Lock-free data structures will help you to scale vertically
• Every byte counts. Remember about bytes.
- QDS core compactly represents data as 4-byte integers while working
with them in memory
- QDS uses compact byte-level compression on the wire
- Even more compact bit-level compression is used in long-term store
39. General (2)
• Burst handling
- Process data in batches to amortize batch overhead across messages
- QDS increases batch size under load to decrease overhead
• Architecture
- Use layers
- Lower layers of architectures should generally be used in more places
and be more optimized
- The outer layer, dxFeed API, is the easies one to use and understand
and most object-oriented, but less optimized
40. Architecture layers
JS API
dxFeed API Tools Gateways
QDS Core
Transport Protocol
ZLIB SSL
Sockets NIO Files, etc
44. QDS API Summary
• Pros
- High-performance design
- Flexible (can be used in various ways)
• QDS Multiplexor is an application on top of QDS API
• As well as all other command-line QDS tools
- Extensible with clear separation of interfaces and implementation
• Cons
- Verbose, lots of code to do simple things
- Error-prone (easy to get wrong and to introduce subtle bugs)
• Everybody needs Quote, Trade, etc with easy-to-use API
- Hence, dxFeed API was born