Wix has scaled from serving 30 million users to over 1 billion user media files daily by evolving their architecture and processes over time. Some of the key changes included splitting the monolithic application into separate editor and public segments, introducing caching and media storage solutions, adopting continuous delivery practices, and moving to managed hosting and cloud infrastructure to allow for scalability. People and culture changes like emphasizing empowered developers and frequent releases were also important to allow for increased velocity.
5. Wix in Numbers
• Wix was founded in 2006
• 30M registered users from most countries
• Over 1,000,000 new users every month
• Over 1,000,000 new websites every month
• Over 150 TByte of users media files
– More than 1 billion users media files
– More than 1.5 TByte uploaded files daily
• Over 300 Servers in 2+1 datacenters + Google + Amazon
6. Wix Initial Architecture
Wix MySQL
• Tomcat, Hibernate, Custom web framework (Tomcat) DB
– Everything generated from HBM files
– Built for fast development
– Statefull login (tomcat session), EHCache, File uploads
– Not considering performance, scalability, fast feature rollout, testing
– It reflected the fact that we didn’t really know what is our business
– We know that we will need to replace it when we grow.
– However, we failed to understand how difficult that can be!
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
7. Wix Initial Architecture
After two years, we have found out that
• Our initial architecture allowed us to progress vary fast
• However, as we progressed, we slowed down
• So, we learned that
– Don’t worry about ‘building it right from the start’ – you won’t
– You are going to replace stuff you are building in the initial stages
– Be ready to do it
– Get it up to customers as fast as you can. Get feedback. Evolve.
– Our mistake was not planning for gradual re-write
– Build for gradual re-write as you learn the problems and find the right
solutions
8. Distributed Cache
Next we added EHCache as Hibernate 2nd-level cache
• Why?
– Cause it is in the design
• How was it?
– Black Box cache
– How do we know what is the state of our system?
– How to invalidate the cache?
– When to invalidate it?
– How does “operations” manage the cache?
• Did we really need it? No!
• We eventually dropped it
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
9. Distributed Cache
So we have learned (the hard way) that
• You don’t need a Cache
• Really, you don’t
• Cache is not part of an architecture
It is a means to make something more efficient
• Architect while ignoring caching
Introduce caching only as needed to solve real performance problems
• When introducing a cache, think about
– Cache management – how do you know what is in the cache? How do you find
invalid data in the cache?
– Invalidation – who invalidates the cache? When? Why?
– Cache Reset – can your architecture stand a cache restart?
10. Editor & Public Segments
• The Challenge - Updates to our Server imposed downtime for our
customer’s websites
– Any Server or Database update has the potential of bringing down all Wix sites
– Is a symptom of a larger issue
• The Server served two different concerns
– Wix Users editing websites
– Viewing Wix Sites, the sites created by the Wix editor
• The two concerns require different SLA
– Wix Sites should never ever have a downtime!
– Wix Sites should work as fast as possible, always!
– However, an editing system does not require this level of SLA.
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
11. Editor & Public Segments
• The two concerns evolve independently
– Releases of Editing feature should have no impact on
Public
existing Wix sites operations! (Tomcat)
Public
DB
• Our Solution
– Split the Server into two Segments – Public and Editor
Editor Editor
• The Public segment targets serving websites for (Tomcat) DB
Wix Users
– Has mostly read-only usage pattern – only updated
when a site is published
– Simple publishing system
– Simple and readonly means it is easier to have higher SLA and DRP
– MySQL used as NoSQL – single large table with XML text fields
• The Editor segment
– Exposes the Wix Editing APIs, as well as user account and galleries
management APIs.
– Has different release schedule compared to the Public segment
12. Editor & Public Segments
What we have learned
• Architecture is inspired by aspects such as
Public Public
– SLA (Tomcat) DB
– Release Cycles – deployment flexibility
• Separate Segments for discrete concerns Editor Editor
(Tomcat)
– Editing (Editor Segment) DB
– Publishing (Public Segment)
• Modularity – SOA pattern (not WSDL!)
– Enabler for gradual re-write
– Enabler for continues delivery
– Simplifies QA, Operations & Release Cycles
– Introduces build architecture concerns
• Different Architectures
– Build, System, Data
13. Editor & Public Segments
What we have learned
• MySQL is a damn good NoSQL engine
Public Public
– Our public DB was (mainly) one huge table (Tomcat) DB
– Queries & Updates are by primary key
– Instead of relations, we use text/xml or text/json columns Editor Editor
– No updates for Blobs – immutable data (Tomcat) DB
– No Transactions
• Use indirection table to blob table
– Insert a new blob value, update the pointer to the new blob, async delete
• MySql auto-generated keys cause problems
– Locks on key generation
– Require a single instance to generate keys
• We use GUID keys
– Can be generated by any client
– No locks in key value generation
– Enabler for Master-Master replication
14. Wix by 2009
• We introduced a Billing Segment
– So that customers can pay us…
• Dropped Hibernate sessions
– Makes it harder to separate software to different segments
• Requires shared library or single sign-on
– Requires statefull load balancer
– Require syncing sessions between segments
• Cookie based authentication
– It’s the standard way
– Implement stronger security solution only where really required
Billing
Dropping
Sessions
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
15. Wix on Managed Hosting
Co-Location Managed Hosting Cloud
Own and maintain your Lease both hardware and Instantly lease hardware
own hardware maintenance
Provisioning == buy and Overnight provisioning Instant provisioning
deliver your new server Unlimited resources
Reliable software on Reliable software on Reliable software on
reliable hardware reliable hardware unreliable hardware
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
16. Data Centers
• Austin (Managed Hosting)
– Our first Data Center
• Chicago (Managed Hosting)
– Data DRP, then Active Active with Austin
• Amsterdam (Managed Hosting)
– The idea was 3xActive
– However, it failed – it is too complex to have 3 Active data centers
(3 way replication)
• Amazon, Google (Cloud)
– 2nd vendor, Service Disruption DRP
Amazon,
Austin Chicago Amsterdam Google
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
17. Wix Media Segment
• The Challenge – Our static storage reached over 500 GByte of small files
– The “upload to app server, post process files, copy to lighttpd server, serve by
lighttpd” pattern proved inefficient, slow and error prone
– Disk IO became slow and inefficient as the number of files increased
– We needed a solution we can grow with –
• HTTP connections
• number of files
– We needed control over caching and Http headers
• We needed dynamic image manipulations
– Rebuild a few millions of media files is not simple
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
18. Prospero – Wix Media Storage
• Our Solution
– Lighttpd based
– Sharded on the file name
– Two copies of each file
get 37D815B5.jpg Go to 37 range servers Fallback if not found
00-1f 20-ef 40-5f 60-7f
0.static HTTP 2.static HTTP 4.static HTTP 6.static
1.static HTTP 3.static HTTP 5.static HTTP 7.static
19. Prospero – Wix Media Storage
• Dynamic Image processing
– Picture Pyramid
– Picture resize, crop and sharpen “on the fly”
– Thumbnail generation
• Eventual Consistency solutions scale
– But you have to build for when eventual consistency is not consistent
• Media files caching headers are critical
– Max-age, ETag, if-modified-since, etc.
– Think how to tune those parameters for media files, as per your specific needs
• We tried Amazon S3 and Google for secondary storage
– However, Amazon proved unreliable (connections, availability)
• We found that using a CDN in front of Prospero is very affective
• Initially, files where stored on the filesystem
• T We added Tokyo Tyrant backend for small files
• M We added Memcached (Redis) layer for “in transit” files
20. Prospero – Wix Media Storage
• Our current architecture
Google Cloud x36
x36
Storage M T x32
M T
M T
Second fallback Chicago
First fallback
CDN x36
x36
M T x32
If not in CDN M T
M T
get 37D815B5.jpg Austin
21. CDN
• Use a CDN!
• CDN acts as a great connection manager
– We have CDN hit ratio’s of over 99.9%
• Use the “Cache Killer” pattern
– http://static.wix.com/client/css/viewer.css?v=327
– http://static.wix.com/client/1.3.2/css/viewer.css
– Makes flushing files from the CDN redundant
– Enabler for longer caching periods
• There are many vendors
– We started with 1 CDN vendor
– We are now working with two CDN vendors
– Different CDN vendors have advantages at different geo
• Tune HTTP Headers per CDN Vendor
– CDN Vendors interpret HTTP headers differently
22. Development Velocity
• The Challenge – Our codebase became large and entangled
– Feature rollout became harder over time, requiring longer and longer manual
regression
– The longer the regression was, the harder is became to make “a good release”
– Strange full-table scans queries generated by Hibernate, which we still have no
idea what code is responsible for…
• The solution
– Mid 2010 – Wix Framework – modern base libraries
– Beginning 2011 – CI / CD / TDD techniques + DevOps culture
– Mid 2011 – Scala
CI / CD / TDD + DevOps
– SOA Architecture (not WSDL)
Scala
Framework
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
23. People are the key
• Train the people you already have
– We sent our entire QA department to learn Java
– Developers learn TDD and CI/CD methodologies.
• Hiring the right people is key to success
– Hire only the best developers (only seniors)
– Don’t count only on the interview, you need to test actual coding
– Anyone who interviews can drop a candidate
– Hire people who will challenge you (no “yes man”)
– Get people you can trust with “root” access to production
• Never stop hiring
– If we find an excellent person we will create a position for him even if we do
not have one open.
• Wix is doubling its size every year
– Yes we are currently hiring.
– We’re considering to start hiring and training junior developers.
24. Wix-Framework
• The Wix Framework
– Java, Spring, Jetty, MySQL, MongoDB
– Spring MVC based
• Adjustments for Flash
– Flash imposes some restrictions on HTTP which require special handling
• DevOps support
– Built-in support for monitoring, configuration, usage tracking, profiling and
Self-Test in every app server
• TDD Support
– Unit-Testing helpers
– Multi-browser Javascript Unit-Testing support, integrated with IDE
– Integration Testing framework
– Embedded MySQL & Embedded MongoDB
• We are now re-evaluating our framework
– Netty? The Play Framework? Open Garden?
25. SOA Architecture
• SOA – as in Service Oriented, not WDSL
– Started getting more and more service in 2010
• We started with XML / HTTP
• Then moved to Hessian
– Native RPC support with Spring
• Then moved to JSON/RPC (Fjarr)
– Hessian is no longer maintained
– Jackson is almost as efficient as binary protocols (Protobuf, Thrift)
• Dispatcher (Smart load balancer)
• Considering moving to client side LB Dispatcher
– Similar to Finagle, Hystrix XML / HTTP Hessian JSON / RPC
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
26. Motivations for CI / CD / TDD + DevOps
• We were working traditional waterfall
• With fear of change
– It is working, why touch it?
– Uploading a release means downtime and bugs!
• With low product quality
– Want to risk fixing this bug? Who knows what may break?
• With slow development velocity
– From “I have a great new product idea” to “it is working” takes too match time
• With tradition enterprise development lifecycle
– Three months of a “VERSION” development and QA
– Six months of crisis mode cleaning bugs and stabilizing system
• With traditional operations
– Developers create “problems” for operations
– Operations have to “defend” from developers
27. Wix’s CI / CD / TDD + DevOps model
• Abandon “VERSION” paradigm – move feature centric life
• Make small and frequent release as soon as possible
– Today we release about 10 times a day, gaining velocity
• Empower the developer
– The developer is responsible from product idea to 10,000 active users
– Remove every obstacle in the developer’s path
– Big cultural change from waterfall – affects the whole company
– The developer is responsible for his app operations
• Automate everything – CI/CD/TDD
– CI – Continuous Integration
– CD – Continuous Delivery / Deployment
– TDD – Automated unit-tests, integration tests, GUI tests
• Measure Everything
– A/B test every new feature
– Monitor real KPIs (business, not CPU)
28. CI / CD @ Wix – Release Process
• Make an RC
– Runs build, unit-tests, integration tests
29. CI / CD @ Wix – Release Process
• Deploy as GA
– Using Chef, Noah, Artifactory
– Runs Self-Tests
30. CI / CD @ Wix – Release Process
• Monitor
– Deployment, NewRelic, App-Info, Recent Events
• Rollback
31. Automated Deployment
• We use Chef for deployment
– Automation platform for deployment
• Noah for topology
– Lightweight node/service registry
• Started with deploying our Media grid
• Then, App Servers
– Still improving support for service routing, gradual deployment,
self-test integrations
– We had to build quite a bit on chef to make it work
– Overall, Chef works great for us
Media App servers Noah
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
32. Products we built
• Wix Mobile
– Mobile presence for Flash sites
• Wix HTML5
– Full HTML 5 support – total rewrite of our Flash product
• Third Party Applications (TPAs)
– With over 200,000 installations in the 3 first months
• Answers
– Wix unique support system Billing
TPA
• Wix Billing System (PCI Compliant) eCommerce
– Support complex business models for TPAs App Builder
HTML 5
– Support diverse geo Answers
• eCommerce Mobile
– Based on Magento
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013
33. BI
• Stared with export of DBs to Prism
– Once a day
• Then, we introduced Flogger
– Realtime analytics sent from our editor + viewer
– Stored in MySQL, MS SQL
– Enabled BI and error reporting
• Hadoop + HBase + MS Reporting Services
– When MySQL & MS SQL could not scale
– When we needed more complex analytics, more flexibility
– When the number of consumers grow
Flogger Hadoop
HTML 5
Flash
2006 2007 2008 2009 2010 2011 2012 2013