An embedded system usually involves low level languages like C and highly customized hardware. In this talk we will see a use case of a soft real time system which was developed taking a very different approach, written in Go. We will see what are the advantages of this choice, along with its limits.
4. â Industrial machines
â Quality features
â Color
â Weight
â Defects
â Shape
â ...
â Classification
â Grouping items together,
according to quality features
Quality Classifier
Photo by Kate Buckley on flickr
6. Specs & Outline
â 100 lanes
â 20 items/sec per lane
â 2000 items/sec
â 10 exits per lane
â Industrial scale
Photo by Chris Chadd from Pexels
Feeder
Sensor
Exit
Lane
Rotary
Encoder
Ejector
Classify
7. Need of Precision
â Items are eventually ejected
â Precise timing of ejection
â Precision of 250 us
â Multiple exits
â Usually real time OS are used
â Higher determinism
9. Our Machine Layout
IO IO
exit 1 exit 2
ejectorssensors
BL
data in data out
checkpoint
â Boards
â BL: Business Logic
â IO: Input/Output
â Business Logic
â Acquires data from sensors
â Manages every lane
â Network traffic is heavy
â Up to 250 sensors
â Up to 2000 items per second
â Checkpoint
â Trigger for classification
10. The Challenge
Canonical Way
â RTOS kernel
â Custom hardware & boards
â CANBUS communication
â Single board
â Firmware C bare metal
11. The Challenge: Linux and Go
Canonical Way
â RTOS kernel
â Custom hardware & boards
â CANBUS communication
â Single board
â Firmware C bare metal
Our Solution
â GNU/Linux standard kernel
â Hardware standard components
â Ethernet based communication
â Distributed system
â Go language
12. Why Linux?
GNU/Linux
â Real time processes
â Microprocessor boards
â No Safety Certifications
â Plenty of Drivers
â Separation of competences
â Debug on desktop with tools
â Many Languages & Libraries
RTOS
â Tasks with priorities
â Microcontrollers (no MMU)
â Safety and Certifications
â Limited number of Drivers
â Single big application
â Debug on hardware boards
â Few Languages and Libraries
13. Network connections
â âBLâ Single Business Logic board
â Freescale i.MX 6, Quad Core ARM Cortex A9 @ 1.2 GHz
â Performs the items classification for every lane
â âIOâ Multiple Input/Output boards
â Develboard Atmel, Single Core ARM @ 600 MHz
â Digital inputs and outputs
â Multiple sensor sensors
â Ethernet bus with standard switches/routers
BL
IO IO IO
Ethernet
Switch
Star topology
14. Different topology with Linux Sw bridge
BL
IO IO IO
BL
IO IO IO
Star topology
Serial topology
â Simplified cabling
â Sw bridge: 15% CPU
15. Latency for Soft Real Time
â Kernel driver with a precision of 250 us
â DMA + double buffering
â Buffer has a duration of 100 ms
â Actual precision of 66 us
â Queue of scheduled activations
â User space software writes activations to the kernel driver
â Soft real time latency
â 100 ms + queue management ~= 150 ms
â System canât react faster than 150 ms (e.g. change of speed)
16. Rotary encoder
â Lanes are physically bound
â Multiple encoders in case of
big machines
â Encoder steps
â Square waves
â 2000 steps/round
â Kernel driver
â Parameters exported in sysfs
A
B
Z
17. Linear Interpolation
â We cannot synchronize thousands
times per second with Ethernet
â Synchronization every 100 ms
â Linear interpolation
â Encoder accelerations are âslowâ because
itâs bound to a mechanical transport
â Workaround of a real time protocol
t
#step
â Real curve
â Interpolated curve
18. BL/IO Clock Synchronization
â Activation messages are marked with a specific timestamp
â We need to synchronize clocks
â Usually, NTP is used
â Precision of ~milliseconds => Not enough for us
â We need a precision of (at least) 250 us
â Precision Time Protocol IEEE 1585 (PTP)
â Two PTP timestamping models: Hardware or Software
â Software timestamping: Kernel interrupt => Precision of ~microseconds
â Hardware timestamping: Ethernet interface => Precision of ~nanoseconds
â Develboard supports IEEE 1858, but software timestamping is enough
20. Basic Advantages
â Simple language, clients got used to it very quickly
â Simple documentation and maintenance
â Static binaries
â Large ecosystem of libraries
â Concurrent programming
â Easy cross compilation
â Embedded (ARM)
â Windows
â Linux
21. Embedded: Go vs C++
â Stack trace and data race analysis
â Valgrind slows down performance
â Debug tools
â Remote debugging and system analysis (gdb vs pprof)
â Linter and code analysis
â Easier to integrate static analysis tools (e.g. golint, go vet)
â Tags (go build -tags âŠ)
â Useful for embedded apps and stubs
â Cleaner approach compared to #ifdef
22. Fine tuning: Disassembly
â go tool objdump -s main.join -S <binary_name>
func join(strings []string) string {
0x8bae0 e59a1008 MOVW 0x8(R10), R1
0x8bae4 e15d0001 CMP R1, R13
0x8bae8 9a00001a B.LS 0x8bb58
e 0x8baec e52de024 MOVW.W R14, -0x24(R13)
0x8baf0 e3a00000 MOVW $0, R0
0x8baf4 e3a01000 MOVW $0, R1
0x8baf8 e3a02000 MOVW $0, R2
for _, str := range strings {
0x8bafc ea00000f B 0x8bb40
0x8bb00 e58d0020 MOVW R0, 0x20(R13)
0x8bb04 e59d3028 MOVW 0x28(R13), R3
0x8bb08 e7934180 MOVW (R3)(R0<<3), R4
0x8bb0c e0835180 ADD R0<<$3, R3, R5
0x8bb10 e5955004 MOVW 0x4(R5), R5
package main
import (
"fmt"
"os"
)
func join(strings []string) string {
var ret string
for _, str := range strings {
ret += str
}
return ret
}
func main() {
fmt.Println(join(os.Args[1:]))
}
23. How do we perform tests on Embedded?
1. Unit tests
2. Full integration tests
â Integration framework
â Mocking board/instruments as Goroutines
â Easier than C++
â Fast prototyping for tests
3. Continuous integration
â The real embedded system was simulated on CircleCI
24. â Monitoring of performance
â Metrics
â Profiling
â Google pprof upstream version:
â go get -u github.com/google/pprof
â Small CPU profile file => 10 minutes execution => just 185 KiB
â Stand alone, no binary
â Can read from both local file or over HTTP
â pprof -http :8081 http://localhost:8080/debug/pprof/profile?seconds=30
Avoid Performance Regression
25. Hardware in the Loop
â Automatic performance monitoring
â We have a real hardware test bench
â We want to deploy our system
directly to the test bench
â Results from the test bench
are retrieved by CircleCI
Repo
CI
Metrics
Hardware
26. Remote Introspection via Browser
â Uncommon in embedded apps
â Expvar
â Standard interface for public variables
â Exports figures about the program
â JSON format
// At global scope
var requestCount = expvar.NewInt("RequestCount")
...
func myHandler(w http.ResponseWriter, r *http.Request) {
requestCount.Add(1)
...
}
28. Metrics
â Performance analysis
â We donât want performance regressions
â Refactoring
â Test suites donât help
â âTachymeterâ library to monitor metrics
â Low impact, samples are added to a circular buffer
â Average, Standard Deviation, Percentiles, Min, Max, âŠ
â Multiple outputs
â Formatted string, JSON string, Histogram text and html
â HTTP endpoint for remote analysis
29. Checkpoint Margin
â Average
â Avg 2.301660948s
â StdDev 176.75148ms
â Percentiles
â P75 2.222552667s
â P95 1.921699001s
â P99 1.721095s
â P999 1.575430001s
â Limits
â Max 2.916016667s
â Min 1.464427001s
checkpointsensors
margin
activation
2 minutes run
30. How is Checkpoint Margin affected?
â I/O bound
â Reading packets from connections
â We need to read fairly from 250 tcp sockets
eth/tcpBL
S
S
S
31. Standard Network Loop
â One Goroutine per connection
â 1. Read data from network
â 2. Decode packets
â 3. Send to main loop via channel
â chan packet
â Sending one packet at time
to the main loop
â Can we do better?
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
chan
packet
Concurrent
Goroutines
32. Batched Channel
â chan packet vs chan []packet
â Sending one packet at time
is too slow
â Use a single channel write
operation to send all packets
received from a single TCP read
â Minimizing channel writes is good
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
chan
[ ]packet
Concurrent
Goroutines
33. Number of Channel Writes
â Channel
â Buffered
â Slice of packets
â Writes per second
â 2000 â 25000 w/s
â Total GC STW time
â 2.28 â 11.50 s
Channel Writes per Second [w/s]
â Checkpoint Margin [s]
â GC Time [s]
2 minutes run
34. Failed Test: Using a Mutex
â Goroutines will block on a mutex
â High contention
â Go scheduler is cooperative
â Deadline missed
â Checkpoint event is delayed
â Conn.Read(): Channel Mutex
â Min 13 us 13 us
â Max 773 us 1.15 s
â P99 64 us 510 ms
â Activation margin: Channel Mutex
â P99 466 ms -1.13 s
main
loop
TCP
gorunTCP
gorunTCP
gorunTCP
Read
mutex
checkpoint
margin
activation
delay
35. Alternative: Using EPOLL
â EPOLL syscall allows to use
a single Goroutine
â MultiReader Go interface
â Reading from multiple connections
â Monitoring of multiple file descriptors
â Drawbacks
â It canât be used on Windows
â Cannot use net.Socket
â Maintenance
main
loop
TCP
Multi
Read
type MultiPacketReader interface {
// TCP connection with framing
Register(conn PacketConn)
// Reads from one of the
// registered connections
ReadPackets(packets [][]byte)
(n int, conn PacketConn, err error)
}
chan
[ ]packet
36. CPU Usage: EPOLL VS Go
â 4 CPUs in total
â Graph shows just one CPU
(for simplicity)
â Go impl
â CPU usage is higher...
â ⊠but more âuniformâ
â EPOLL impl
â CPU cores are switched
more frequently
â EPOLL
â Go
Time (2 minutes)
37. Conclusions
Thanks
mirko@develer.com
â Standard Linux OS and hardware
â Faster development
â Distributed system
â Testing and monitoring
â Fast prototyping for tests
â Profiling and metrics
â Performance tests on real hardware
â Optimizations
â Goroutines management
â Packets reception
â Drawbacks
â GC impact must be reduced
â Mutex contention can be a problem
â Network APIs are not flexible enough
â Go can be used for embedded apps!