2. Data Analytics: An Introduction
Collection
Processing Modelling Inference Visualization
3. Data Analytics: Use Cases
Business Intelligence
Social Networks
Astronomy and
Astrophysics
Finance and Stock
Market Medical Imaging
Computer Graphics
Computer Vision
Energy ExplorationMaps Retail
4. Data Analytics: Why Testing is Important
Volume
Domain
Complexity
Variety
Computations
Testing
15. Data Analytics - Testing
Extract
Transform
Load
Source
Data
Post-ETL Tests
Meta-data
Data transformation
Data quality checks
Business-specific validations
22. Learnings
ANALYSE
CODETEST
Initial Data Flow
• Pre defined data
template
• Pre-ETL data validations
Domain Knowledge
• KT Sessions involving SME’s
• Core computations
Business Involvement
• Test data closer to real
time data
• User flows prioritization
24. Learnings
Testing Process
• Step wise data
validation
• Defect investigation
Test Automation
• Data combinations
• Xml test data
Test Execution
• CI test execution
• Execution frequency
Testing Tools
• Spreadsheet gear
• Excel macros
ANALYSE
CODETEST
Data Analytics :
Process of collecting and examining the data with the goal of discovering useful information.
Exploratory data analytics : log file analysis
Driven by a specific problem statement : Market Basket Analysis
Not always a decision making system, but sometimes a decision support system.
Process :
Collection: data gathered from various sources like online sources, survey data, satellites in raw format etc.
Processing: Organize data in standard format
Analysis: Build mathematical models fitting the existing data; Use these models to infer results for new data
Visualization: Results communicated in the form of tables, graphs and charts
Lets take some examples :
Banks : analyze withdrawal and spending patterns to prevent fraud or identity theft
E-commerce : companies examine the navigation patterns to determine the customers buying patterns based upon their previous purchases
Energy : Industries are looking into how energy consumptions and operation costs could be optimized within a facility
Yes, Data analysis is the lifeline of any business, No business can sustain without analyzing the available data.
Data analytics is used in many industries to allow companies and organization to make better business decisions
Testing plays a very crucial role in building a data analytics product.
Lifeline of any progressive business
Critical in making informed decisions for business planning
Complexities of domain, computation, volume variety needs to be tackled with a planned testing approach
Data Validation :
Ensuring that the data is of right quality throughout the process
Various stages of data flow : gathering, representing, cleansing and transforming etc
Model Implementation :
This is very crucial part and in depth domain knowledge is needed
Validate if the model chosen is relevant for the respective domain
Understanding the statistical model thoroughly with every parameters involved in computation
Validating if the computations are implemented as required with right understanding
Business perspective :
Data is available, analysis is performed and some results are out. Now how to share it with the business ?
Need to have a clear vision on business problem that we are trying to solve
Its very important to have the business perspective here to ensure that the data represented serves the purpose
What kind of charts/graphs are to be displayed, what level of data aggregations are required and is the UI intuitive ?
Raw Data : Gather data in raw format
ETL : Process and organize data:
Extract data from multiple sources
Transform into the required format.
Load the data into database
Modelling:
Initial Analysis resulting in modeling which in turn results in model parameters
Models implementation : Applying the statistical models or algorithms & computations
Aggregation :
Data analysis and computations happens at the granular level
data needs to be aggregated at various hierarchies & different levels as per the business requirement
Visualization :
Communicate results of the analyzed data through visualization techniques
Effective visual communication through tables, graphs and charts
Format :
Is the data provided in the required format - csv or excel format
How many files or worksheet, what sort of data in each sheet , data types
Text casing, data formats, number formats etc
Consistency :
* Data needs to be consistent across
eg: there is a sales data in a particular city, but the city entry is not present in the reference data,
a cheque is cleared , but no corresponding money transaction
Completeness :
Data is complete as expected : every data has mandatory and optional aspect.
Like in a customer data name, phone & email are mandatory & address might be optional
For example, In an retail data, an inventory table might show 5 units reduced, whereas the corresponding sales data might not reflect the sales of the same, so some data might be missing here.
Post-ETL Validations :
Meta Data:
Ensuring the data model design is aligned with the real world domain
Includes testing of data type check, data length check and index/constraint check
Validating the data modelling : dimensions & facts
Transformation :
Validate whether the data values transformed are the expected data values.
Validating the data transformation rules and source to target mapping
Usually performed by validate counts, aggregates and actual data between the source and target
Quality :
Includes the data checks (text case, special characters, number checks/ precision, date format etc)
Data constraints checks – ensuring the data transformation is according to the model like foreign key constraints, unique key constraints, null value etc
To ensure all the expected data is loaded in the DB completely
Business Specific :
Business-side validations, domain specific, possible values
Client agnostic as well as client-specific data checks
Model Validation :
Validating if the model chosen is relevant to the domain
Performed by applying a model with past historic data
Uses statistical metrics like R2 etc.
Implementation :
Understanding the logic behind the model/algorithms
Getting the right values for the model parameters
Computation :
Validating the core analytics engine’s step wise computation
Aggregation :
Data should be aggregated at the required hierarchy level
Relevant data as per scope has to be considered for aggregation
Summarized values as per the computation for the above selected data should be validated
UI Validations :
Ensuring the correct data representation in the for of tables, charts and graphs
Validating the format of representation – units, scale, alignment, unit conversion etc
Usability testing aspect w.r.t the tables, graphs, chart : color combinations, filtering, UI interaction etc
Initial Client data flow :
setting predefined data template
pre-validations before data handover
Domain Knowledge :
Domain intensive : KT sessions within team and validating the understanding with SME‘s
Mimicking the simulation calculations in excel with a smaller dataset to thoroughly understanding
Business involvement :
Providing the test dataset closer to the real time data
Prioritizing the test scenarios to get real user experience
Implementation
No easy way to come up with expected data, so decided on parallel implementation
Business involvement in testing the model implementation
Computation/performance
Understanding the transformations, data explosions, data representation & the table joins
Analyzing the factors involved in computation which influence the time/memory
Test data :
what subset of data would suffice to get the best data distribution, bridging gap between ideal & real world data
coming up with edge case dataset
Testing process :
Testing data at every stage of data transformation
Defect investigation with QA/Dev pairing
Tools :
Choice of tools to fit the purpose and intended for the users of the tool
Spreadsheet gear, Excel macros, App manager
Automation :
DB structure varies per client, Generic (metadata SQLs) and Client specific tests, too many data combinations – so data driven framework
Xml test data to segregate the data for various Clients
Execution :
Due to h/w, memory and time constraints, cautiously organize the test execution in CI
Though automation was implemented at every stage, we cautiously decided on, to what extent automation coverage is required at each stage and accordingly decided the test execution frequency
Divide & conquer
QA/Dev pairing
Data combination : system used by multiple users with differing background – varying metadata
Test data in xml to support this
20% of possible dataset to cover 80% of the common use cases
SME involvement in edge case
Automation at every layer : cautious in deciding to what extent of automation
Execution frequency : resource usage & computation time and SME availability
Choice of tools