This document proposes extending the Object Constraint Language (OCL) with predefined aggregation functions to allow specification of complex queries in conceptual data warehouse schemas. It defines three categories of aggregation functions - distributive, algebraic, and holistic - and provides OCL definitions for examples like max, min, sum, count, average, variance and rank. The extension is validated using a modeling tool and two model-driven development scenarios are described for when the target platform does or does not directly support the new functions. The proposal aims to make conceptual modeling more complete by integrating query specification.
2. Introduction Conceptual modeling has proved to be very useful in the development of data warehouse systems. Main benefits -> benefits of conceptual modeling: Implementation-independent view of the system Possibility of (semi)automatic code-generation Better maintainability and evolution … Several proposals in this direction. UML Profile for multidimensional modeling of data warehouses [Luján et al DKE 2007] Model-driven approach for development of data warehouses [Mazón & Trujillo DSS 2008]
3. Conceptual Modeling of DWH (1/2) Modeling multidimensional concept at conceptual level Data structured in a multidimensional space Dimensions specify different ways the data can be viewed, aggregated, and sorted E.g., according to time, store, customer, product, etc. Events of interest for an analyst are represented as facts which are associated with cells or points in the multidimensional space and which are described in terms of a set of measures abstracted logical details: technology: relational, multidimensional, ... logical variations: star, snowflake schema, ... automatically obtain a logical representation model-driven approach
4. Conceptual Modeling of DWH An airline’s marketing department wants to analyze the flight activity of each member of its frequent flyer program
14. Someexamples (1/3) MAX: Returns the element in a non-empty collection of objects of type T with the highest value. COUNT DISTINCT: Returns the number of different elements in a collection contextCollection::max():T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | e >= e2)) context Collection::countDistinct(): Integer post: result = self−>asSet()−>size()
15. Someexamples (2/3) AVG: Returns the arithmetic average value of the elements in the non-empty collection. COVARIANCE: Returns the covariance value between two ordered sets context Collection::avg():Real pre: self−>notEmpty() post: result = self−>sum() / self−>size() context OrderedSet::covariance(Y: OrderedSet):Real pre: self−>size() = Y−>size() and self−>notEmpty() post: let avgY:Real = Y−>avg() in let avgSelf:Real = self−>avg() in result= (1/self−>size()) * self−>iterate(e; acc:Real=0 | acc + ((e - avgSelf) * (Y−>at(self−>indexOf(e)) - avgY))
16. Someexamples (3/3) MODE: Returns the most frequent value in a collection. DESCENDING RANK: Returns the position (i.e., ranking) of an element within a Collection. contextCollection::mode(): T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | self−>count(e) >= self−>count(e2)) context Collection::rankDescending(e: T): Integer pre: self−>includes(e) post: result = self−>size() - self−>select(e2 | e >= e2)−>size() + 1
17. Usingour new aggregatefunctions Our functions can be used wherever a OCL standard function can be used They are called exactly in the same way Ex of use of the avgfunction to compute the average number of miles earned by a customer in each flight leg. context Customer::avgMilesPerFlightLeg():Real body: self−>frequentFlyerLegs.Miles−>avg()
18. MDD of our “enriched” DWH CSs To be useful, we need to make sure that CSs using our new aggregate functions can be used as input of MDD processes and tools Current MDD methods do NOT need to be extended to cope with enriched CSs Our library is written in OCL itself (platform-independent) Complex functions can be reduced to standard OCL functions Two scenarios depending on whether the target implementation platform directly supports our function In the latter, preprocessing our functions is required to reexpress them in terms of standard OCL operations Existing OCLtoX (X=Java, SQL,…) tools can help in the process
20. MDD Scenario 2: Normalization/unfolding context Customer::avgMilesPerFlightLeg():Real body: self−>frequentFlyerLegs.Miles−>avg() context Customer::avgMilesPerFlightLeg():Real post: result = self−>frequentFlyerLegs.Miles−>sum() / self−>frequentFlyerLegs.Miles−>size() class Customer { int id; String name; Vector<FrequentFlyerLegs> f; ... public floatavgMiles() { return sumMiles(f)/f.size(); } } (b) Java code
21. Validation Our OCL extension has been validated by using the UML Specification Environment (USE)tool Our functions have been added to USE as new user-defined functions 2-phase analysis: Syntactic analysis: USE parses the OCL operations and checks their syntactic correctness Semantic analysis: USE executes the operations on sample scenarios. Analyzing the results we can check if the operations behave as expected
23. Conclusions Complex aggregation functions should be part of the predefined constructs provided by modelinglanguages We made this possible by extending OCL Queries written with this “extended OCL” can be animated and validated at design-time and automatically implemented along with the rest of DWH CS
24. Further Work Giving mechanisms for defining/validating multidimensional queries at conceptual level in a more intuitive manner Natural language OCL <-> Semantics of Business Vocabulary and Business Rules (SBVR) [Cabot et al, Inf. Syst. 2010] Verifying the proper use of the aggregation function chosen by the designer. The kind of aggregation functions to be applied depends on the kind of measure and the kind of dimension. E.g.: Temperatures cannot be aggregated along the time nor location dimension
25. Continuing the discussion jtrujillo@dlsi.ua.es jordi.cabot@inria.fr http://modeling-languages.com @softmodeling