This document evaluates the impact of a content delivery network (CDN) on an e-commerce environment. It finds that the CDN improved user-perceived performance and scalability by distributing content to edge servers located closer to users. However, it also found some negative impacts, such as slower server-side performance due to additional DNS redirects introduced by the CDN, and longer maintenance times for configuration and content changes. The document outlines the research methodology, presents results on key metrics like response times and resource utilization, and concludes the CDN provided benefits but also downsides that could be addressed in future work.
1. Evaluating the Impact of Content
Delivery Networks on N-tier E-
commerce Environments
Witold Rzepnicki
March 27th, 2007
2. 2
Short Bio
• Moved to U.S. from Poland circa 1995
• Completed undergraduate studies in Computer
Information Systems at Missouri State University
• I have worked for Hallmark Cards since 1998 as a Java EE
developer, project manager/lead and a technology
architect
• PMP and SCEA certifications….811, 816 and 818 came in
handy
• Hobbies: travel, foreign languages, tennis (outdoors and
on Nintendo Wii)
6. 6
The Non-technical Introduction
What Matters to Consumers?
• Are you happy with the web sites you visit?
• Consumers cite website performance and
responsiveness as key challenges for E-
commerce environments (Nielsen Research)
• Role of content and content delivery
Satisfaction Level 2005 2004 2003 2002
2002 vs 2005
change
Very Satisfied 40% 37% 40% 37% 3%
Somewhat Satisfied 24% 24% 23% 22% 2%
Neutral 31% 32% 30% 33% -3%
Somewhat Dissatisfied 4% 5% 5% 5% -1%
Very Dissatisfied 2% 3% 3% 3% -1%
7. 7
Typical Hourly Downtime Costs
• Brokerage operations $6,450,000
• Credit card authorization $2,600,000
• Ebay $225,000
• Amazon.com $180,000
• Package shipping services $150,000
• Home shopping channel $113,000
• Catalog sales center $90,000
• Airline reservation center $89,000
Source: Pp. 185-188 of the Proceedings of LISA '02: Sixteenth Systems Administration Conference,
(Berkeley, CA: USENIX Association, 2002).
13. 13
Problem Statement
• Insufficient performance and scalability during peaks
• Tactics to-date do not fully address the content
delivery layer
– Last-mile, first-mile, peering and backbone problems
– Upper limit to bandwidth scalability for content
delivery (single hosting site)
– Cost factors
• Symptom: performance degrades as Web servers get
overloaded with requests
15. 15
Content Delivery Networks
• The CDNs offload some or all of the content
delivery from the origin Web servers.
• It is a large set of replica servers called the
edge servers that deliver content on behalf of
the origin server.
• CDNs claim to address
– Client perceived latency (e.g. Web browsers)
– Capacity management of the servers
– Static content caching requirements
16. 16
• Quality attribute evaluation of the CDN claim
– Performance
– Scalability
– Availability
– Maintainability
• Consumer and server-side measurements
• Infrastructure footprint impact
– Potential cost savings can be significant
– One hosting center versus two
– Resilience of a geographically dispersed network
• Research to-date focuses on network impacts alone
Research Focus
19. 19
Tactics Implemented To-date
• Horizontal and vertical scalability strategies
implemented to-date
– Clustering
– Origin server caching – content and application
– Scaling individual nodes’ CPU and memory capacity
– Application and database tuning
– Additional bandwidth and switching improvements
– Considered introducing another hosting site to
further improve bandwidth
21. 21
Why a CDN?
• Server-side caching approaches not sufficient
• Fewer “hops” and more efficient routing
• Ease of implementation versus establishing a
set of secondary hosting facilities
• CDNs (e.g., Akamai) improve web performance
by
– Performing extensive network & server
measurements
– DNS redirection to the most efficient servers
23. 23
Content Delivery Network
•Browser requests redirected
to the most suitable edge
server
•Browser gets web site’s DNS
CNAME entry with domain
name in CDN network
•Hierarchy of a CDN’s DNS
servers direct client to a
“nearby” server
•Based on current network
conditions as measured by the
CDN
24. 24
CDN Selection and Implementation
• Redirect method selection: URL
rewrite vs. URL redirect, partial-
site vs. full-site
• DNS changes
– Local name server
• CDN configuration changes
25. 25
How To Measure Quality Attribute
Impacts?
• Performance
– Page response times
– Java EE component processing times
– Data center network latency
• Scalability
– Ability to sustain traffic spikes while maintaining the
same resource footprint
– Resource utilization (bandwidth, CPU, etc.)
• Other QA impacts
– Availability and maintainability
26. 26
Experimental Challenges
• Scalability
– Requires sufficient load to test elasticity of
resources
– Need to simulate fast transactional bursts
– Gather production environment data during the
February peak
• Performance
– Establish pre-CDN and post-CDN baselines under
steady state
– Eliminate outside “noise” by isolating transactions in
a non-production environment
27. 27
Monitoring and Measurement
Framework
• Consumer perspective
– Real-time user monitoring
– Browse versus shop transactions
– Geographic distribution
– Consistent and sustained rate
• Application perspective
– URI stem-level performance measurement
– Host, network and end-to-end times
• System perspective
– Vmstat and bandwidth utilization
28. 28
Consumer Transaction Emulation
• Response times before and after CDN
• Real-time user monitoring
• Transaction characteristics and frequency
ISP City and State
Level3 Los Angeles, CA
Savvis Santa Clara, CA
Verizon Denver, CO
MFN Washington, D.C.
Internap Miami, FL
Level3 Chicago, IL
Sprint New York, NY
30. 30
Browse and Shop Transaction
Characteristics
Transaction workload characteristics Browse Shop
Number of transaction steps 9 6
Number of images retrieved 163 94
Number of scripts, HTML, CSS, Flash components 57 39
Number of server-side J2EE components accessed 12 15
Average image size 2.9 KB 2.8 KB
Average size of HTML, script and Flash 4.9 KB 5.8 KB
Total number of bytes retrieved per connection 250 KB 98 KB
Number of web-server connections initiated from the
browser
4 5
38. 38
Web Tier Scale Factor
• Maximum concurrent Web server socket threads
• Maximum object “hits” in Akamai
• 16,000 hits / 3,600 threads
• Equivalent to 4x of our Web server farm
42. 42
Performance: HTML Object Download
Time
• Browse
Transaction
• Shop
Transaction
• Why the discrepancy between the RTUM and
Server performance?
43. 43
Maintainability and Availability
• Configuration management
– 2 hours on average to deploy configuration changes
• Content management
– 7-10 minutes to propagate content across edge
servers
• Achieved 100% availability during the observed
February peak
48. 48
Future Work
• Edge computing
– Edge delivery of applications
• Impact of edge delivery on media streaming
and protocols other than HTTP
– RTSP, MMS
Question for the audience
Why is this still a problem after all these years?
Focus on how little progress has been made 2002-2005 in terms of customer satisfaction and try to discuss whys:
Traffic growth, exponential growth of online transactions and infrastructures not always keeping up with demand
Discuss content as a driver to the website and enabler of shopping transactions
Now on to dollars and cents of what it costs to be unavailable…these figures are from real research and they are likely to be much higher these days
Surprising that the airline reservation center would have lower downtime costs than a home shopping channel
This presentation will focus on an e-commerce environment similar to the ones on slide 6 although we can’t really say what it costs per hour to be unavailable
This is a quick overview of our e-commerce environment from the architecture, workload and content delivery perspective
Seasonal spikes between 6-10x for different metrics: visits or page views
In subsequent slides, we’ll cover the architectural views and the content delivery model and its potential shortcomings
Typical things to consider in content delivery and management.
On first bullet bring up AJAX, RIAs and heavy Flash usage on some sites
This is a generic model of architecture. We’ll discuss potential problems with content delivery that result from this type of architecture.
Define STATIC and DYNAMIC content
Define performance and scalability as key quality attributes
Consumer and server-side views of performance
Static = non-unique to a particular consumer (images, article pages)
Dynamic = based on individual consumer characteristics (JSPs)
Describe interactions, differences between static and dynamic elements and how they’re served
Server-side caching helps offload repetitive requests for dynamic content
Function of load balancing in the context of scalability and performance
Describe where the content delivery problems from scalability and performance perspectives may reside
Internet cloud and its role in content delivery
Web servers - static
Application servers – dynamic
DB - dynamic
The First Mile bottleneck refers to the limitations in the website’s connectivity to
the Internet via its Internet Service Provider (ISP). In order to reach desired scalability
it needs to continuously expand its connectivity to the ISP. The ISP, in turn, must also
expand its capacity in order to meet its customers’ scalability requirements.
Peering points also represent potential bottlenecks as large networks are not economically motivated
to scale up the number of peering points with the networks of their competitors,
especially since a significant portion of the traffic handled by the peering points
is transit traffic with packets originating on other networks. This lack of competitive
and financial motivation over time has resulted in a limited number of peering points
across major networks.
The Backbone Problem refers to the fact that the ISPs’ router
capacity has historically not kept up with growth of traffic demands.
Finally, the Last Mile problem reflects the limited capacity of a typical user’s connection to their ISP.
85% of our website’s consumers have broadband access, so this is less of a problem for our website.
It’s important to note that just solving one of the above bottlenecks, such as the Last
Mile, by increasing the reach of broadband connectivity at home will not automatically
address the other limitations. These need to be treated as separate problems
that, if addressed, would help solve the problem as a whole.
The problems with the Internet cloud compound the other potential scalability and performance problems we discussed earlier.
Let’s talk about workload in terms of page views.
Traffic spikes several times a year and it’s “bursty” in nature. The weekly picture does not reflect hourly spikes we experience. Quick slide!
This slide suggests the need to scale 5x based on page views alone.
Can’t talk about content delivery without discussing content management and publishing
This is a generic content management model….
Describes differences between static and dynamic content and catalog data vs. article pages. Static content tags are embedded in the JSPs which are rendered within the application server and usually contain static and dynamic content elements.
Refer to outages during peaks from slide 15
With single hosting facility we cannot control the efficiency of content delivery once it leaves our network
We could create our own network of geographically dispersed servers, but it would be cost prohibitive
We have attempted to scale horizontally and vertically (define each)
A Web page download consists of the following basic steps: server
name resolution, TCP connection establishment, transmission of
the HTTP request, reception of the HTTP response, reception of
data packets, and TCP connection termination. Using HTTP/1.0
results in repeating the the above steps for each embedded object
within a composite page. Note that when the embedded objects
are stored on another server (e.g., servers in a content distribution
service), having HTTP/1.1 support for persistent TCP connections
across multiple HTTP requests does not eliminate the
first two steps – but it reduces them by a factor of 2 to 10
Our challenge is not only how many connections we have open, but also for how long…large video files
We’ll discuss significance from two perspectives:
The impact on our e-commerce environment and other e-commerce environments – practical value
The additional CDN research aspects evaluated in this work – research value
When is website suitable for a CDN
it has a high ratio of reads compared to writes
client access patterns tend to access some set of objects more frequently
limited windows of inconsistent data are acceptable
data updates occur relatively slowly
CDN stands for Content Delivery Network
What do CDNs claim to help with?
The spikes create extra load on our infrastructure that causes outages. It’s worth noting that 2006 is the year with the CDN in place…just a little preview of the results.
Hypothesis: could we reduce resource utilization with a CDN?
Here’s what we could address if we were to solve the problem…..
This chart is showing CPU utilization spike in the web tier, but we experience similar curves for bandwidth and memory.
Why do we even need to explore a new tactic?
Refer to definition of tactic from Bass et a
A design decision that is influential in the control of a quality attribute response. Tactics tell you what to do in order to affect a quality attribute response measure. Unlike sensitivity points, tactics are independent of any specific system.
How we went about determining the criteria to measure impact of a CDN.
Akarouting promises one-hop routing
DNS is essentially a distributed database that follows
the client-server architecture. Adequate performance of DNS is achieved through
replication and caching. The server side portion of a request is handled by programs
called name servers. They contain information about a portion of the global database
and are capable of forwarding requests to other authoritative servers if necessary. The
information is made available to the client-side software components called resolvers.
A typical domain name on the Internet consists of two or more parts separated by
dots such as my.yahoo.com. Top-level domain (TLD) represents the rightmost portion,
.com in our case, while the subdomain(s) are represented by the labels to the left
of the top-level domain.
In our example, my.yahoo.com is a subdomain of yahoo.com, which in turn
belongs to the .com top-level domain. Finally, the hostname refers to a domain name
that has one or more IP addresses associated with it. Each domain or subdomain has
an authoritative server associated with it. It contains and publishes information about
the domain and any other domains it encapsulates.
Root nameservers reside at the top of the DNS hierarchy and they are queried first to resolve TLD. Caching and timeto-live (TTL) are very important concepts in DNS and, as we will later discover, in
CDN implementations. IP mappings obtained from DNS can be stored in the
local resolver for a period of time as defined in TTL. This greatly reduces the load on
the DNS servers.
Figure 1 illustrates how a client typically finds the address
of a service using DNS. The client application uses a resolver,
usually implemented as a set of operating system library routines,
to make a recursive query to its local nameserver. The
local nameserver may be configured statically (e.g., in a system
file), or dynamically using protocols like DHCP or PPP. After
making the request, the client waits as the local nameserver iteratively
tries to resolve the name (www.service.com in this
example). The local nameserver first sends an iterative query to
the root to resolve the name (steps 1 and 2), but since the subdomain
service.com has been delegated, the root server responds
with the address of the authoritative nameserver for the
sub-domain, i.e., ns.service.com (step 3)1. The client’s
nameserver then queries ns.service.com and receives the
IP address of www.service.com (steps 4 and 5). Finally
the nameserver returns the address to the client (step 6) and the
client is able to connect to the server (step 7).
Very simple extension of the DNS redirection mechanism, but the complexity lies in the algorithm that measures current network conditions.
In essence, Akamai performs a highly complex translation of a customer’s domain
to the IP address of the most suitable edge server.
First, the Web browser requests an HTML object. In order to accommodate this request, the local DNS resolver has
to translate the domain name into an IP address. The resolver issues a query to the
customer’s DNS server which in turn forwards the request to the Akamai network.
This is enabled via a configuration of a canonical name record (CNAME) in the origin
site’s DNS name server. The CNAME triggers the request redirection to the CDN.
Next, a hierarchy of Akamai servers responds to the request using the requestor’s
IP address, the name of the CDN customer, and the name of the requested content
as seeds for its DNS resolution. The CDN name resolution step is perhaps the most
critical in this sequence of events. Configuration of the Akamai CDN is described
in [4].
The steps for our deployment can be summarized as follows:
1. Create origin hostname
2. Activate Akamai edge hostname3. Activate content delivery configuration
4. Point website to Akamai network
In our case, this process begins with configuration of a CNAME in our DNS name
server. A CNAME record maps an alias or nickname to the real name which may lie
outside the current zone. Typical format of a CNAME entry is as follows:
name ttl class rr canonical name
www IN CNAME joe.example.com.
We need to set up an origin server hostname that will resolve to our content server.
This server will be used by Akamai edge servers to retrieve our content, so it can be
made available to all of the nodes in the CDN. The naming convention for the origin
server is:
origin-<website>
where “website” refers to the is the hostname for our content that will be delivered
from Akamai. Our website stores all of its static content in the generic images folder,
so we will define the following origin server name:
origin-images.example.com for images.example.com
Next, we will create a DNS record for our origin server hostname on our authoritative
name server. We will use the CNAME record type for this step.
origin-www.example.com IN CNAME loadbalancer.example.com
We are now pointing our website to the Akamai network. An edge hostname will
need to be activated on an Akamai domain for our website using the CDN’s configuration
console. It will resolve to the Akamai network. For example,
www.example.com
would have to point to
www.example.edgesuite.net
and www.example.edgesuite.net would in turn resolve to individual servers on
the Akamai network since it owns the edgesuite.net domain. The remaining configuration
steps need to be performed in the configuration console of Akamai and they
are covered in-depth in [4].
The main purpose of a CDN is to direct consumer requests for objects to a server at
the optimal Internet location relative to the consumer’s location. The key components
of a CDN architecture are described in [37]. They are defined as: overlay network formation,
client request redirection, content routing and last-mile content delivery.
The two most common techniques employed by the networks are DNS redirection and URL
rewriting. The DNS redirection technique utilizes a series of DNS resolutions based on
several factors such as server availability and network conditions with the purpose
of identifying the most suitable server.
The end result is a DNS response with the IP address to the content server. The
response includes a time-to-live value that is usually limited to less than a minute (in
the case of Akamai it is 20 seconds). The TTL has to be set to a relatively low value
because the network conditions and server availability change constantly and quick
IP re-mapping is key.
The DNS redirection technique can facilitate either a full- or partial-site delivery.
Will full-site delivery, all requests to the origin server are directed using DNS to a
CDN server. If the CDN server can’t fulfill the request it simply routes it back to the
origin server. Several networks, including Adero and NetCaching, employ this delivery
model. The main shortcoming of this model is the additional routing overhead
of wasted DNS requests that could have been handled by the origin server to begin
with.
With partial-site content delivery, on the other hand, the origin site modifies the
URLs for certain objects or object directory locations to be resolved by the CDN’s DNS
server. This approach seems to be well suited for our website due to its combination of
static digital assets and dynamically generated server-side presentation components.
URL rewriting is another potential solution for server lookups. With this technique,
the origin server continuously rewrites the URL links for dynamically generated
pages in order to redirect them to the appropriate CDN server. The DNS functionality remains on the origin site with this approach. When a page is requested by the user it will be served from the origin server. However, before it is served, all of the embedded links will be rewritten to point to the CDN’s DNS. Figure 3.1 shows a typical
rewrite approach. The main drawback to the URL rewrite approach from the measurement
standpoint is the fact that the rewrites usually take place at the Web server
tier. Hence, the rewrite steps would inevitably introduce additional background noise
to our performance measurements. Therefore, we decided to avoid this approach for
the purpose of our study.
At the time of writing of this thesis, we counted 18 different networks on Davison’s website. It
is not our primary purpose to evaluate tradeoffs between the various networks and
their implementations. The choice we made does not reflect a belief in superiority of
one network over others - it is merely a reflection of the need to get our experimental
test bed up and running as quickly as possible within boundaries imposed on us by
our existing hosting facility. For our implementation, we settled on partial-site, DNS
redirection-based CDN implementation using the Akamai delivery network.
Why are availability and maintainability important?
Bandwidth utilization
Hosting facility
CDN
Server resource utilization
CPU and run queues
Memory page-ins and page-outs
Measured in the context of traffic and page views
h to date focused on network impacts alone
This is different from research to-date
What does a transaction consist of
Why geographically dispersed locations for testing are important
We ran over 1K transactions over a period of 48 hrs before and after
DNS look-up The process of calling a DNS server to lookup and convert a hostname
to an IP address. For instance, to convert www.foo.com to 10.0.0.1
Connect time The time it takes to connect to a Web server (or CDN edge server in our
case) across a network from a client browser or an RTUM agent
Secure sockets layer time The time it takes to create an SSL TCP/IP connection with
a website.
First byte time The time between the completion of the TCP connection with the destination
server that will provide the displayed page’s HTML, graphic, or other
component and the reception of the first packet (also known as first byte) for that
object.
Content download time The time in seconds that measures the actual time to deliver
content (images, HTML, or other objects) from theWeb server to the browser.
The application perspective will be captured using an appliance based application
monitoring solution. The network location of this appliance is depicted in Figure 3.3.1.
We will configure “watchpoints” using the appliance’s configuration tool to capture
server-side response times of Java EE components corresponding to the transaction
steps defined in the RTUM service. The appliance uses passive traffic analysis to capture
actual transactions from the RTUM within our hosting environment and measures
performance and availability of our e-commerce application as a whole. The important
difference between this and other approaches is that our appliance does not
generate any traffic and the only performance overhead it introduces is reading the
copy of traffic from the network connection. The data is assembled into requests for
objects, pages and user sessions. Performance metrics include host, SSL and redirect
times. This solution also measures server errors or prematurely terminated connections
due to increase in traffic. Figure 3.3.1 depicts the measurement timeline for a
sample request that would be captured by our appliance [26].
The appliance solution groups latency into the following six categories and defines
them as follows:
Host time This is the combined time the Web, application, and database servers take
to process a request. Host time is a key measure to assess performance implications
of implementing a CDN on performance of our Java EE components
(servlet, EJBs, etc.). It can be very short in the case of a static image or long
in cases of long reports and complex server-side transactions such as adding a
list of items to the shopping basket.
CHAPTER 3. PROPOSED SOLUTION 29
Network time This is the time spent traveling across intervening networks. Once the
server has prepared its response, host time is over and network time begins. A
small object might be delivered quickly; a large one might take a long time. This
time is highly dependent on the type of consumer’s connection. Low-bandwidth
connections will result in higher network times and vice versa with broadband
connections. Our monitoring appliance also records additional information on
packet loss, out-of-order delivery, and round-trip time to help with this diagnosis.
SSL time The appliance will record the time spent negotiating the encryption of encrypted
transactions. This portion of the SSL time represents the server-side
latency elements of the handshake versus the client-side SSL time captured by
the RTUM.
Redirect time This is the time the site spends sending a request on to other pages. In
some applications, a request for a page results in a redirect that usually points
elsewhere. This delay is recorded as redirect time.
Idle time When a browser is retrieving a page, but there is no activity between objects
on the same page, the HTTP interaction is defined as “idle”. This measurement
is key to understanding the amount of time spent processing client-side
scripts such as JavaScript. When there is inactivity in the middle of rendering
the page within the browser, our appliance will measure it as idle time.
End-to-end This is the total time for the object or page, from the moment the first
packet of a request is seen until the browser acknowledges delivery of the last
packet.
Differentiate between the two types and discuss why expect the browse tx to benefit more. Also discuss number of web server connections.
What does the appliance do?
Physical memory is a finite resource on any system. The UNIX memory handler
manages the memory allocations. The kernel is responsible for freeing up physical
memory of idle processes by saving it to disk until it is needed again. Paging and
swapping are used to accomplish this task. Paging refers to writing portions, termed
pages, of a process’ memory to disk. Swapping refers to writing the entire process, not
just part, to disk. Page-out represents the event of writing pages to disk, while page-in
is defined as retrieving memory data from disk. Page-ins are common and under normal
circumstances are not a cause for concern. However, if page-ins become excessive
the kernel can reach a point where it’s actually spending more time managing paging
activity than running the applications, and system performance suffers.
We decided to look at all tiers. The application servers experienced a spike in CPU utilization – probably due to the elimination of the Web server bottleneck and more traffic going to the app servers. The Web servers experienced the most benefit. The DB server improvements were not related to the CDN, but rather a DB tuning exercise we undertook.
Another way to look at the network efficiencies gained from offloading. High-content pages experienced the higher packet count drops. This results in lower resource utilization of the network gear (routers, switches, etc.), but we didn’t measure it as part of the experiments.
Note the spike over a period of just a couple of hours.
Discuss implications from hosting and cost perspective. For example, we could avoid start-up costs of a new hosting center.
Our bandwidth utilization went up because we eliminated the bottleneck in the Web tier
Akamai offloaded the equivalent of one Gbps connection to our hosting facility
First time in a few years we had an unqualified success in terms of availability.