Implementation of a realtime web logs capture and reporting system that was developed to provide realtime reports for measuring traffic parameters like pageviews, visits, unique visitors etc. in realtime.
Comparing Sidecar-less Service Mesh from Cilium and Istio
A Real Time Web Analytics System
1. A real-time Web Analytics System
Mahesh Patwardhan
Digital and New Media Consultant
2. Contents
1. Introduction
2. The Requirements
3. The Architecture
4. The Reports
5. The Implementation
6. Conclusion
3. Introduction
This document describes an implementation of a realtime
web logs capture and reporting system.
This system was developed to provide realtime reports for
measuring traffic parameters like pageviews, visits, unique
visitors etc. in realtime.
The system was designed and built to replace the batch
process system which generated reports in a deferred mode
Was built to allow for realtime monitoring and action on the
various online services.
4. Requirements
◦ Shortcomings of existing system
The existing system generated reports on the previous day’s logs and not real time,
the system could not be scaled up,
was not equipped to handle heavy traffic,
had no scope for adding new services
there was no scope for adding or editing logs.
◦ Requirements of the new system was to provide for
Real time web log capture from web servers at geographically dispersed locations
Building a robust web logs data warehouse
Provide extensive realtime reports from the web logs
◦ The advantages of this system would be:
Can access data in “real time”
The process can be scaled up to handle more traffic
Provision has been made to add a new service or delete an existing service, which can be accessed
from the very next day
Logs can be added and modified
.
5. …Requirements
◦ The system was required to capture, collate, and aggregate the web-logs
which accumulate on the web-app servers.
◦ The aggregates need to be produced in near-real time.
◦ A multi-layer architecture needed to be deployed
a layer of capture agents deployed on every web-app server
a layer of collation server applications which collate data from the capture agents
a layer of computation servers which aggregate data at high speed, needs to be
implemented.
◦ This multi-layer architecture would aggregate data in industry-standard
RDBMS tables, which could then be queried for viewing using user
interface screens.
◦ The aggregate tables were to be updated in near-real-time
7. …Architecture
◦ The architecture has four layers
Collation clients (L1),
Collation servers (L2),
Computation servers (L3),
Reporting server (L4)
A database server to store the aggregated results.
◦ By design the architecture is completely scalable in the
first three layers L1, L2, L3.
◦ All the layers communicate with each other over TCP/IP.
8. …Architecture
Each collation client in L1 will connect to one Collation server in L2.
◦ A maximum of 30 Collation clients can connect to one Collation server.
◦ Primary back-up fail-over features will be provided (If one of the
collation server fails, clients connecting to that will automatically
shift to other servers in the cluster).
The computation is distributed to the computation servers (L3) by
service.
◦ Computation required for a service will be handled by its Computation
server.
◦ Primary back fail-over is not possible in this layer.
◦ If required the architecture will allow distribution of computing by service.
(for example there can be two servers performing computations for a
service like e-mail).
The computed information (aggregated) is stored in a database, which
is used by the L4 (Reporting) layer.
9. Reports
◦ Hits by time
◦ Page Views by time, by pages
◦ Visits by time, by page
◦ Unique visitor by time, by page
◦ Return frequency
◦ Return visit
◦ Visiting frequency by visitor
◦ Average time spent
◦ By page average time spent
◦ Referrer by domains, URL
10. …Reports
◦ Search engines
◦ Search engine keywords
◦ By search engine by keyword
◦ Browser type, version, OS
◦ Parameter analysis
◦ Country, city, state wise reports
◦ By country top pages
◦ By ISP
◦ Top entry pages
◦ Top exit pages
◦ Path reporting (across service)
◦ Directory filter based reporting
◦ Fall-out reports
11. Implementation
The implementation of the solution was done on
an incremental basis. Deliverables were planned
for each increment based on the requirement
specified. There were five development cycles, the
details of which are as specified
Incremental cycle 1
◦ Setting up the framework for real-time log capture
◦ Health monitoring system
◦ Hits by time
◦ Page Views by time, by pages
12. …Implementation
Incremental cycle 2
◦ Visits by time, by page
◦ Unique visitor by time, by page
◦ Return frequency
◦ Return visit
◦ Visiting frequency by visitor
◦ Average time spent
◦ By page average time spent
Incremental cycle 3
◦ Referrer by domains, URL
◦ Search engines
◦ Search engine keywords
◦ By search engine by keyword
◦ Browser type, version, OS
◦ Parameter analysis
13. …Implementation
◦ Incremental cycle 4
Country, city, state wise reports
By country top pages
By ISP
Top entry pages
Top exit pages
Path reporting (across service)
◦ Incremental cycle 5
Directory filter based reporting
Fall-out reports
◦ The deliverables in each phase required elements of each layer to be
developed, implemented, tested and deployed. For instance, a few
database tables of the final aggregate table schema were needed to be
designed from the first cycle itself along with the corresponding
reports.
14. Conclusion
◦ This document describes an implementation of a realtime web logs capture and
reporting system.
◦ This system was developed to provide realtime reports for measuring traffic
parameters like pageviews, visits, unique visitors etc. in realtime.
◦ The system was designed and built to replace the batch process system which
generated reports in a deferred mode and did not allow for realtime
monitoring and action on the various online services.
◦ The architecture of the system consists of four layers - the Collation client
agent, the Collation layer ,the Computation layer and the Reporting layer
◦ This system has overcome the shortcomings of the existing system which was
not scalable and provided reports in a deferred mode.
◦ This was overcome by the present system which has a highly scalable
architecture and provides reports in real time.