SlideShare uma empresa Scribd logo
1 de 66
Finding something different:
        Arrays in database systems,
             the next frontier ?

                   Martin Kersten
                       CWI



© M Kersten 2012
Science applications




© M Kersten 2012
Public database of 4-40 TB
Relational schema of around 200 pages SQL
Relational tables up to 20B elements
Finding closely related sky objects

                             446	
  
                         columns	
  
                            >585	
  
                        million	
  rows	
  


                       6	
  columns	
  
                       >	
  20	
  Billion	
  
 © M Kersten 2012            rows	
  
The LOFAR radio telescope
 Complex image processing pipeline (Blue-gene )
 Transient Sky Objects database (50TB/yr)
 Finding transients within 4 seconds timeframe




© M Kersten 2012
Datawarehouse of seismic data
Highly compressed file repository
 (>3.5M files and 15- 150 TB)
About to explode due to sensor network
Finding warning signals




© M Kersten 2012
Remote sensing
Processing pipeline to interpret images < 1TB/ yr


Finding and detecting forest fires




© M Kersten 2012
Matlab
               RDBMS                  Python
               SQL                    C R
               *-API

                                                   SciQL


                                          Interdependent
                                          Software libaries
                       FITS, mSEED,
                       geoTIFF,…
                         HDF5,
                         NETCFD
                        Datavault
© M Kersten 2012
Agenda

Array support in database systems


SciQL array query language


A crash course on column-stores


SciQL implementation approach


© M Kersten 2012
What is an array?
An array is a systematic arrangement of objects
 addressed by dimension values.
      Get(A, X, Y,…) => Value
      Set(A, X, Y,…) <= Value


There are many species:
 vector, bit array, dynamic array, parallel array,
 sparse array, variable length array, jagged array



© M Kersten 2012
Who needs them anyway ?
Seismology         – partial time-series
Climate simulation – temporal ordered grid
Astronomy          – temporal ordered images
Remote sensing     – image processing
Social networks    – graph algorithms
Genomics           – ordered strings
Forensics          – images, strings, graphs
Scientists ‘love them’ : MSEED, NETCDF, FITS,
 CSV,..
© M Kersten 2012
Arrays in DBMS
Relational prototype built on arrays, Peterlee IS
 Vehicle(1975)


Persistent programming languages, Astral (1980), Plain
  (1980)


Object-orientation and persistent languages were the
 make belief to handle them, O2(1992)


Several array algebras AML(2002), Aquery(2003), RAM
  (2004), SRAM(2012)

© M Kersten 2012
Array declarations:
CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]);
CREATE TABLE tictactoe ( squares integer[3][3] );



Array operations: denotation ([]), contains (@>), is
  contained in (<@), append, concat (||),
  dimension, lower, upper, prepend, to-string, from-
  string, …


Array constraints: none, no enforcement of
  dimensions.
 © M Kersten 2012
SQL 2003
Arrays are attribute type constructors
Arrays can be declared without a maximum cardinality
Array nesting is unrestricted.
Query results can be converted into arrays.


CREATE TABLE listbox( choices CHAR(3) ARRAY[1000] NOT NULL);
INSERT INTO listbox_choices
VALUES( 'Department Names',
ARRAY(SELECT name FROM sales.depts ORDER BY 1));




 © M Kersten 2012
Breaks large C++ arrays (rasters) into disjoint chunks

Maps chunks into large binary objects (blob)

Provide function interface to access them

RASCAL, a SQL92 extension

Known to work up to 12 TBs.


© M Kersten 2012
Breaks large C++ arrays (rasters) into overlapping
  chunks

Built storage manager from scratch

Map-reduce processing model

Provide function interface to access them

AQL, a crippled SQL92


© M Kersten 2012
What is the problem?

-  Appropriate array denotations? Query language
-  Functional complete operation set ?
-  Mature implementations? Systems
-  Size limitations due to (blob) representations ?
-  Scale out?
-  Community awareness? Education



© M Kersten 2012
Agenda

Array support in database systems


SciQL array query language


A crash course on column-stores


SciQL implementation approach


© M Kersten 2012
MonetDB SciQL

SciQL (pronounced ‘cycle’ )
•  A backward compatible extension of SQL’03
•  Symbiosis of relational and array paradigm
•  Flexible structure-based grouping
•  Capitalizes the MonetDB physical array storage
  •  Recycling, an adaptive ‘materialized view’
  •  Zero-cost attachment contract for cooperative clients
                   http://www.scilens.org/Resources/SciQL


© M Kersten 2012
Table vs Arrays

CREATE TABLE tmp
A collection of tuples


Indexed by a (primary) key


Default handling


Explicitly created using
  INS/UPD/DEL


 © M Kersten 2012
Table vs arrays

CREATE TABLE tmp             CREATE ARRAY tmp
A collection of tuples       A collection of a priori defined tuples


Indexed by a (primary) key   Indexed by dimension expressions


Default handling             Implicitly defined by default value,


Explicitly created using     To be updated with INS/DEL/UPD
  INS/UPD/DEL


 © M Kersten 2012
SciQL examples
CREATE TABLE matrix (
  x integer,
  y integer,
  value float
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
         0      0    0
         0      1    0
         1      1    0
         1      0    0
  © M Kersten 2012
SciQL examples
CREATE TABLE matrix (             CREATE ARRAY matrix (
  x integer,                        x integer DIMENSION[2],
  y integer,                        y integer DIMENSION[2],
  value float                       value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
                                           null   …      …      …
         0      0    0
                                           null   null   null   …
         0      1    0
                                            0      0
                                                   0     null   …
         1      1    0                 1
                                       0    0      0
                                                   0     null   null
         1      0    0
                                            0      1
  © M Kersten 2012
SciQL examples
CREATE TABLE matrix (         CREATE ARRAY matrix (
  x integer,                     x integer DIMENSION[2],
  y integer,                     y integer DIMENSION[2],
  value float                    value float DEFAULT 0);
PRIMARY KEY (x,y) );


DELETE matrix WHERE y=1       DELETE matrix WHERE y=1
                              A hole in the array

        0       0   0
                                                    null   null
        1       0   0                        1
                                             0       0      0
                                                     0      1
 © M Kersten 2012
SciQL examples
CREATE TABLE matrix (          CREATE ARRAY matrix (
  x integer,                     x integer DIMENSION[2],
  y integer,                     y integer DIMENSION[2],
  value float                    value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES      INSERT INTO matrix VALUES
(0,1,1), (1,1,2)               (0,1,1), (1,1,2)
         0      0    0
         1      0    0
                                                  1   2
                                           1
         0      1    1
                                           0      0   0
         1      1    2
                                                  0   1
  © M Kersten 2012
SciQL unbounded arrays
CREATE TABLE matrix (       CREATE ARRAY matrix (
  x integer,                  x integer DIMENSION,
  y integer,                  y integer DIMENSION,
  value float                 value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES   INSERT INTO matrix VALUES
(0,2,1), (0,1,2)            (0,2,1), (0,1,2)


         0      2    1                  2      1   0

         0      1    2                  1      0   0
                                        0      0   2
  © M Kersten 2012                             0   1
SciQL Dimensions
Unbounded Dimensions
  scalar-type DIMENSION


Bounded Dimensions
  scalar-type DIMENSION[stop]
  scalar-type DIMENSION[first: step: stop]
  scalar-type DIMENSION[*: *: *]


timestamp DIMENSION [ ‘2010-01-19’ : ‘1’ minute : *]

© M Kersten 2012
SciQL table queries
-- Dimension names make query formulation easier
CREATE ARRAY matrix (
  x integer DIMENSION,
  y integer DIMENSION,
  value float DEFAULT 0 );


-- simple checker boarding aggregation
SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0




© M Kersten 2012
SciQL array queries
CREATE ARRAY matrix (           CREATE ARRAY result(
  x integer DIMENSION,            x integer DIMENSION,
  y integer DIMENSION,            value float DEFAULT 0 );
  value float DEFAULT 0 );



-- group based aggregation to construct an unbounded vector
SELECT [x], sum(value) FROM matrix
  WHERE (x + y) % 2 = 0
  GROUP BY x;

© M Kersten 2012
SciQL array views
CREATE ARRAY vmatrix (
  x integer DIMENSION[-1:5],
  y integer DIMENSION[-1:5],
  value float DEFAULT -1 )
AS SELECT x, y, value FROM matrix;


                   -1   -1   -1    -1
                   -1   0      0   -1
                   -1   0      0   -1
                   -1   -1   -1    -1



© M Kersten 2012
SciQL tiling examples
                   V0,3   V1,3   V2,3   V3,3


                   V0,2   V1,2   V2,2   V3,2


                   V0,1   V1,1   V2,1   V3,1

Anchor
Point              V0,0   V1,0   V2,0   V3,0




           SELECT x, y, avg(value)
           FROM matrix
           GROUP BY matrix[x : 1 : x+2][y : 1 : y+2];


© M Kersten 2012
SciQL tiling examples
                   V0,3   V1,3   V2,3   V3,3


                   V0,2   V1,2   V2,2   V3,2


                   V0,1   V1,1   V2,1   V3,1

Anchor
Point              V0,0   V1,0   V2,0   V3,0




         SELECT x, y, avg(value)
         FROM matrix
         GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2];


© M Kersten 2012
SciQL tiling examples
                   V0,3   V1,3   V2,3   V3,3

       Anchor
       Point       V0,2   V1,2   V2,2   V3,2


                   V0,1   V1,1   V2,1   V3,1
           null

                   V0,0   V1,0   V2,0   V3,0
           null                                null



     SELECT x, y, avg(value)
     FROM matrix
     GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2];


© M Kersten 2012
SciQL tiling examples
                   V0,3   V1,3   V2,3   V3,3

  Anchor
  Point            V0,2   V1,2   V2,2   V3,2


                   V0,1   V1,1   V2,1   V3,1


                   V0,0   V1,0   V2,0   V3,0




           SELECT x, y, avg(value)
           FROM matrix
           GROUP BY matrix[x][y],
            matrix[x-1][y], matrix[x+1][y],
            matrix[x][y-1], matrix[x][y+1];
© M Kersten 2012
SciQL, A Query Language for Science Applications


•  Seamless integration of array-, set-, and sequence-
   semantics.
•  Dimension constraints as a declarative means for
   indexed access to array cells.
•  Structural grouping to generalize the value-based
   grouping towards selective access to groups of cells
   based on positional relationships for aggregation.




© M Kersten 2012
Agenda
Array support in database systems

SciQL array query language

Use-case exercise

A crash course on column-stores

SciQL implementation approach

© M Kersten 2012
Seismology use case
Rietbrock: Chili earthquake
  … 2TB of wave fronts
  … filter by sta/lta
  … remove false positives
  … window-based 3 min cuts
  … heuristic tests
  … interactive response required …


How can a database system help?
  Scanning 2TB on modern pc takes >3 hours

© M Kersten 2012
Use case, a SciQL dream
Rietbrock: Chili earthquake
create array mseed (
 tick     timestamp dimension[ ‘2010’:*],
 data decimal(8,6),
 station string );




© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … filter by sta/lta


--- average by window of 5 seconds
select A.tick, avg(A.data)
from mseed A
group by A[tick:1:tick + 5 seconds]




© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … filter by sta/lta
select A.tick
from mseed A, mseed B
where A.tick = B.tick
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
  B[tick:tick + 15 seconds]



© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … filter by sta/lta
create view candidates(
  station string,
  tick timestamp,
  ratio float ) as
select A.station, A.tick, avg(A.data) / avg(B.data) as ratio
  from mseed A, mseed B
  where A.tick = B.tick
  and avg(A.data) / avg(B.data) > delta
  group by A[tick:tick + 5 seconds],
   B[tick:tick + 15 seconds]
© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … remove false positives
-- remove isolated errors by direct environment
-- using wave propagation statics

create table neighbors(
  head string,
  tail string,
  delay timestamp,
  weight float)

© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … remove false positives
select A.tick, B.tick
  from candidates A, candidates B, neighbors N
 where A.station = N.head
 and B.station = N.tail
 and B.tick = A.tick + N.delay
 and B.ratio * N.weight < A.ratio;




© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … remove false positives
delete from candidates
 select A.tick
 from candidates A, candidates B, neighbors N
 where A.station = N.head
 and B.station = N.tail
 and B.tick = A.tick + N.delay
 and B.ratio * N.weight < A.ratio;



© M Kersten 2012
Use case, a SciQL dream
Rietbrock: … window-based 3 min cuts
  … heuristic tests


select B.station, myfunction(B.data)
  from candidates A, mseed B
 where A.tick = B.tick
 group by distinct B[tick:tick + 3 minutes];


-- using a User Defined Function written in C.

© M Kersten 2012
Agenda

Array support in database systems


SciQL array query language


A crash course on column-stores


SciQL implementation approach


© M Kersten 2012
Storing Relations in MonetDB




Void          Void            Void        Void   Void
1000           1000           1000        1000   1000
  .             .               .           .      .

  .             .               .           .      .

  .             .               .           .      .

  .             .               .           .      .

  .             .               .           .      .




Virtual OID: seqbase=1000 (increment=1)
   © M Kersten 2012
BAT Data Structure




                                          BAT:
                                          binary association table
                   Head   Tail
                                          BUN:
                                          binary unit

  Hash tables,                            Head & Tail:
                                          BUN heap:
  T-trees,                                - consecutive memory
  R-trees,                                  blocks (arrays)‫‏‬
                                            block (array)‫‏‬
  ...                                     - memory-mapped file
                                                             files

                                          Tail Heap:
                                           - best-effort duplicate
                                             elimination for strings
© M Kersten 2012                            (~ dictionary encoding)
Processing Model (MonetDB Kernel)‫‏‬

  l    Bulk processing:
         l  full materialization of all intermediate results

  l    Binary (i.e., 2-column) algebra core:
         l  select, join, semijoin, outerjoin
         l  union, intersection, diff (BAT-wise & column-wise)‫‏‬
         l  group, count, max, min, sum, avg
         l  reverse, mirror, mark

  l    Runtime operational optimization:
         l  Choosing optimal algorithm & implementation according to
             input properties and system status


© M Kersten 2012
The Software Stack

                                  Strategic optimization

Front-ends            SQL 03               MAL

                     Optimizers   Tactical optimization:
                                  MAL -> MAL rewrites

Back-end(s)          MonetDB 5             MAL

                                       Runtime
  Kernel        MonetDB kernel        operational
                                      optimization




  © M Kersten 2012
MonetDB Front-end: SQL
    EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x;
                   function user.s2_1():void;
                   barrier _73 := language.dataflow();
                     _2:bat[:oid,:int] := sql.bind("sys","t","c",0);
                     _7:bat[:oid,:int] := sql.bind("sys","s","x",0);
                     _10 := bat.reverse(_7);
                     _11 := algebra.join(_2,_10);
                     _13 := algebra.markT(_11,0@0);
                     _14 := bat.reverse(_13);
                     _15:bat[:oid,:int] := sql.bind("sys","t","a",0);
                     _17 := algebra.leftjoin(_14,_15);
                     _18 := bat.reverse(_11);
                     _19 := algebra.markT(_18,0@0);
                     _20 := bat.reverse(_19);
                     _21:bat[:oid,:int] := sql.bind("sys","s","z",0);
                     _23 := algebra.leftjoin(_20,_21);
                   exit _73;
                     _24 := sql.resultSet(2,1,_17);
                     sql.rsColumn(_24,"sys.t","a","int",32,0,_17);
                     sql.rsColumn(_24,"sys.s","z","int",32,0,_23);
                     _33 := io.stdout();
                     sql.exportResult(_33,_24);
                   end s2_1;
© M Kersten 2012
Agenda

Array support in database systems


SciQL array query language


A crash course on column-stores


SciQL implementation approach


© M Kersten 2012
SciQL implementation
•  Use the complete MonetDB software stack
  •  Extend the SQL catalog to support SciQL
  •  Extend the Kernel to support array processing
  •  Extend the optimizer stack for performance


•  Aim for a functional implementation first
  •  Use tabular representation of arrays
  •  Reuse the SQL code generator




© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
Slicing a portion of an array is a ‘selection’




© M Kersten 2012
˜




© M Kersten 2012
It works




© M Kersten 2012
Conclusions
•  The language definition is ‘finished’
•  Functional prototype is ‘around the corner’
•  Exposure to real life cases and external libraries
•  MonetDB’s core technology was essential
•  Challenge:
                               ARRAYS




                      FILES
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
© M Kersten 2012
Science DBMS landscape
                    MonetDB 5.23                  SciDB 0.5              Rasdaman
Architecture        Server approach               Server approach        Plugin(Oracle, DB2, Informix,
                                                                         Mysql, Postgresql)
Open source         Mozilla License               GPL 3.0 Commercial     GPL 3.0 Dual license
Downloads           >12.000 /month                Tens up to now         ??
SQL                 SQL 2003                      ??                     SQL92++
Interoperability    {JO}DBC, C(++),Python, …      C++ UDF                C++, Java, OGC
Array language      SciQL                         AQL                    RASQL
Array model         Fixed+variable bounds         Fixed arrays           Fixed+variable bounds
Science             Linked libraries              Linked libraries       Linked libraries
Foreign files       Vaults of csv, FITS,          ??                     Tiff,png,jpg..,
                    NETCDF, MSEED                                        csv,,NETCDF,HDF4,
Distribution        50-200 node cluster           4 node cluster         20-node
Distribution tech   Dynamic partial replication   Static fragmentation   Static fragmentation
Executor            Various schemes               Map-reduce             Tile streaming
Largest demo        Skyserver SDSS 6 3TB          ---                    12TB, IGN –F (on Postgresql)
Storage tuning      Query adaptive                Schema definitions     Workload driven
    © M Kersten Heuristics + cost base
Optimization    2012                              ??                     Heuristics +cost based

Mais conteúdo relacionado

Mais procurados

FANNY HANIFAH (208700530)_INVENTORY_T1
FANNY HANIFAH (208700530)_INVENTORY_T1FANNY HANIFAH (208700530)_INVENTORY_T1
FANNY HANIFAH (208700530)_INVENTORY_T1Nifya Nafhhan
 
Disney Effects: Building web/mobile castle in OpenGL 2D & 3D
Disney Effects: Building web/mobile castle in OpenGL 2D & 3DDisney Effects: Building web/mobile castle in OpenGL 2D & 3D
Disney Effects: Building web/mobile castle in OpenGL 2D & 3DSVWB
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031frdos
 
Texto de matemática y lógica
Texto de matemática y lógicaTexto de matemática y lógica
Texto de matemática y lógicaOdín Zapata
 
Window on Teaching: Visualising students' feedback - Federico Botta
Window on Teaching: Visualising students' feedback - Federico BottaWindow on Teaching: Visualising students' feedback - Federico Botta
Window on Teaching: Visualising students' feedback - Federico BottaTeachingGrid
 

Mais procurados (7)

Metric Embeddings and Expanders
Metric Embeddings and ExpandersMetric Embeddings and Expanders
Metric Embeddings and Expanders
 
FANNY HANIFAH (208700530)_INVENTORY_T1
FANNY HANIFAH (208700530)_INVENTORY_T1FANNY HANIFAH (208700530)_INVENTORY_T1
FANNY HANIFAH (208700530)_INVENTORY_T1
 
Chapter 15
Chapter 15Chapter 15
Chapter 15
 
Disney Effects: Building web/mobile castle in OpenGL 2D & 3D
Disney Effects: Building web/mobile castle in OpenGL 2D & 3DDisney Effects: Building web/mobile castle in OpenGL 2D & 3D
Disney Effects: Building web/mobile castle in OpenGL 2D & 3D
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 
Texto de matemática y lógica
Texto de matemática y lógicaTexto de matemática y lógica
Texto de matemática y lógica
 
Window on Teaching: Visualising students' feedback - Federico Botta
Window on Teaching: Visualising students' feedback - Federico BottaWindow on Teaching: Visualising students' feedback - Federico Botta
Window on Teaching: Visualising students' feedback - Federico Botta
 

Semelhante a Finding patterns in scientific data arrays

SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabkrishna_093
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
principle component analysis.pptx
principle component analysis.pptxprinciple component analysis.pptx
principle component analysis.pptxwahid ullah
 
Principal Component Analysis PCA
Principal Component Analysis PCAPrincipal Component Analysis PCA
Principal Component Analysis PCAAbdullah al Mamun
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_publicLong Nguyen
 
SQLBits X SQL Server 2012 Spatial Indexing
SQLBits X SQL Server 2012 Spatial IndexingSQLBits X SQL Server 2012 Spatial Indexing
SQLBits X SQL Server 2012 Spatial IndexingMichael Rys
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdfAdvanced-Concepts-Team
 
Write appropriate SQL DDL statements (Create Table Statements) for d.pdf
Write appropriate SQL DDL statements (Create Table Statements) for d.pdfWrite appropriate SQL DDL statements (Create Table Statements) for d.pdf
Write appropriate SQL DDL statements (Create Table Statements) for d.pdfinfo961251
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with RKazuki Yoshida
 
Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)Fulvio Corno
 
Introducing Reactive Machine Learning
Introducing Reactive Machine LearningIntroducing Reactive Machine Learning
Introducing Reactive Machine LearningJeff Smith
 

Semelhante a Finding patterns in scientific data arrays (20)

SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
SciQL, A Query Language for Science Applications
SciQL, A Query Language for Science ApplicationsSciQL, A Query Language for Science Applications
SciQL, A Query Language for Science Applications
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
Yasser y thesis
Yasser y thesisYasser y thesis
Yasser y thesis
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
ICSM07.ppt
ICSM07.pptICSM07.ppt
ICSM07.ppt
 
principle component analysis.pptx
principle component analysis.pptxprinciple component analysis.pptx
principle component analysis.pptx
 
Principal Component Analysis PCA
Principal Component Analysis PCAPrincipal Component Analysis PCA
Principal Component Analysis PCA
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
Arrays
ArraysArrays
Arrays
 
SQLBits X SQL Server 2012 Spatial Indexing
SQLBits X SQL Server 2012 Spatial IndexingSQLBits X SQL Server 2012 Spatial Indexing
SQLBits X SQL Server 2012 Spatial Indexing
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
Write appropriate SQL DDL statements (Create Table Statements) for d.pdf
Write appropriate SQL DDL statements (Create Table Statements) for d.pdfWrite appropriate SQL DDL statements (Create Table Statements) for d.pdf
Write appropriate SQL DDL statements (Create Table Statements) for d.pdf
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)
 
An introduction to scala
An introduction to scalaAn introduction to scala
An introduction to scala
 
Introducing Reactive Machine Learning
Introducing Reactive Machine LearningIntroducing Reactive Machine Learning
Introducing Reactive Machine Learning
 

Mais de PlanetData Network of Excellence

A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoPlanetData Network of Excellence
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksPlanetData Network of Excellence
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamPlanetData Network of Excellence
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...PlanetData Network of Excellence
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchPlanetData Network of Excellence
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReducePlanetData Network of Excellence
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...PlanetData Network of Excellence
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...PlanetData Network of Excellence
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsPlanetData Network of Excellence
 

Mais de PlanetData Network of Excellence (20)

Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
 
Privacy-Preserving Schema Reuse
Privacy-Preserving Schema ReusePrivacy-Preserving Schema Reuse
Privacy-Preserving Schema Reuse
 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
 
CLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data ArchitectureCLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data Architecture
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
 
Heuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQLHeuristic based Query Optimisation for SPARQL
Heuristic based Query Optimisation for SPARQL
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Finding patterns in scientific data arrays

  • 1. Finding something different: Arrays in database systems, the next frontier ? Martin Kersten CWI © M Kersten 2012
  • 3. Public database of 4-40 TB Relational schema of around 200 pages SQL Relational tables up to 20B elements Finding closely related sky objects 446   columns   >585   million  rows   6  columns   >  20  Billion   © M Kersten 2012 rows  
  • 4. The LOFAR radio telescope Complex image processing pipeline (Blue-gene ) Transient Sky Objects database (50TB/yr) Finding transients within 4 seconds timeframe © M Kersten 2012
  • 5. Datawarehouse of seismic data Highly compressed file repository (>3.5M files and 15- 150 TB) About to explode due to sensor network Finding warning signals © M Kersten 2012
  • 6. Remote sensing Processing pipeline to interpret images < 1TB/ yr Finding and detecting forest fires © M Kersten 2012
  • 7. Matlab RDBMS Python SQL C R *-API SciQL Interdependent Software libaries FITS, mSEED, geoTIFF,… HDF5, NETCFD Datavault © M Kersten 2012
  • 8. Agenda Array support in database systems SciQL array query language A crash course on column-stores SciQL implementation approach © M Kersten 2012
  • 9. What is an array? An array is a systematic arrangement of objects addressed by dimension values. Get(A, X, Y,…) => Value Set(A, X, Y,…) <= Value There are many species: vector, bit array, dynamic array, parallel array, sparse array, variable length array, jagged array © M Kersten 2012
  • 10. Who needs them anyway ? Seismology – partial time-series Climate simulation – temporal ordered grid Astronomy – temporal ordered images Remote sensing – image processing Social networks – graph algorithms Genomics – ordered strings Forensics – images, strings, graphs Scientists ‘love them’ : MSEED, NETCDF, FITS, CSV,.. © M Kersten 2012
  • 11. Arrays in DBMS Relational prototype built on arrays, Peterlee IS Vehicle(1975) Persistent programming languages, Astral (1980), Plain (1980) Object-orientation and persistent languages were the make belief to handle them, O2(1992) Several array algebras AML(2002), Aquery(2003), RAM (2004), SRAM(2012) © M Kersten 2012
  • 12. Array declarations: CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]); CREATE TABLE tictactoe ( squares integer[3][3] ); Array operations: denotation ([]), contains (@>), is contained in (<@), append, concat (||), dimension, lower, upper, prepend, to-string, from- string, … Array constraints: none, no enforcement of dimensions. © M Kersten 2012
  • 13. SQL 2003 Arrays are attribute type constructors Arrays can be declared without a maximum cardinality Array nesting is unrestricted. Query results can be converted into arrays. CREATE TABLE listbox( choices CHAR(3) ARRAY[1000] NOT NULL); INSERT INTO listbox_choices VALUES( 'Department Names', ARRAY(SELECT name FROM sales.depts ORDER BY 1)); © M Kersten 2012
  • 14. Breaks large C++ arrays (rasters) into disjoint chunks Maps chunks into large binary objects (blob) Provide function interface to access them RASCAL, a SQL92 extension Known to work up to 12 TBs. © M Kersten 2012
  • 15. Breaks large C++ arrays (rasters) into overlapping chunks Built storage manager from scratch Map-reduce processing model Provide function interface to access them AQL, a crippled SQL92 © M Kersten 2012
  • 16. What is the problem? -  Appropriate array denotations? Query language -  Functional complete operation set ? -  Mature implementations? Systems -  Size limitations due to (blob) representations ? -  Scale out? -  Community awareness? Education © M Kersten 2012
  • 17. Agenda Array support in database systems SciQL array query language A crash course on column-stores SciQL implementation approach © M Kersten 2012
  • 18. MonetDB SciQL SciQL (pronounced ‘cycle’ ) •  A backward compatible extension of SQL’03 •  Symbiosis of relational and array paradigm •  Flexible structure-based grouping •  Capitalizes the MonetDB physical array storage •  Recycling, an adaptive ‘materialized view’ •  Zero-cost attachment contract for cooperative clients http://www.scilens.org/Resources/SciQL © M Kersten 2012
  • 19. Table vs Arrays CREATE TABLE tmp A collection of tuples Indexed by a (primary) key Default handling Explicitly created using INS/UPD/DEL © M Kersten 2012
  • 20. Table vs arrays CREATE TABLE tmp CREATE ARRAY tmp A collection of tuples A collection of a priori defined tuples Indexed by a (primary) key Indexed by dimension expressions Default handling Implicitly defined by default value, Explicitly created using To be updated with INS/DEL/UPD INS/UPD/DEL © M Kersten 2012
  • 21. SciQL examples CREATE TABLE matrix ( x integer, y integer, value float PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES (0,0,0),(0,1,0),(1,1,0)(1,0,0); 0 0 0 0 1 0 1 1 0 1 0 0 © M Kersten 2012
  • 22. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES (0,0,0),(0,1,0),(1,1,0)(1,0,0); null … … … 0 0 0 null null null … 0 1 0 0 0 0 null … 1 1 0 1 0 0 0 0 null null 1 0 0 0 1 © M Kersten 2012
  • 23. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); DELETE matrix WHERE y=1 DELETE matrix WHERE y=1 A hole in the array 0 0 0 null null 1 0 0 1 0 0 0 0 1 © M Kersten 2012
  • 24. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES INSERT INTO matrix VALUES (0,1,1), (1,1,2) (0,1,1), (1,1,2) 0 0 0 1 0 0 1 2 1 0 1 1 0 0 0 1 1 2 0 1 © M Kersten 2012
  • 25. SciQL unbounded arrays CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION, y integer, y integer DIMENSION, value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES INSERT INTO matrix VALUES (0,2,1), (0,1,2) (0,2,1), (0,1,2) 0 2 1 2 1 0 0 1 2 1 0 0 0 0 2 © M Kersten 2012 0 1
  • 26. SciQL Dimensions Unbounded Dimensions scalar-type DIMENSION Bounded Dimensions scalar-type DIMENSION[stop] scalar-type DIMENSION[first: step: stop] scalar-type DIMENSION[*: *: *] timestamp DIMENSION [ ‘2010-01-19’ : ‘1’ minute : *] © M Kersten 2012
  • 27. SciQL table queries -- Dimension names make query formulation easier CREATE ARRAY matrix ( x integer DIMENSION, y integer DIMENSION, value float DEFAULT 0 ); -- simple checker boarding aggregation SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0 © M Kersten 2012
  • 28. SciQL array queries CREATE ARRAY matrix ( CREATE ARRAY result( x integer DIMENSION, x integer DIMENSION, y integer DIMENSION, value float DEFAULT 0 ); value float DEFAULT 0 ); -- group based aggregation to construct an unbounded vector SELECT [x], sum(value) FROM matrix WHERE (x + y) % 2 = 0 GROUP BY x; © M Kersten 2012
  • 29. SciQL array views CREATE ARRAY vmatrix ( x integer DIMENSION[-1:5], y integer DIMENSION[-1:5], value float DEFAULT -1 ) AS SELECT x, y, value FROM matrix; -1 -1 -1 -1 -1 0 0 -1 -1 0 0 -1 -1 -1 -1 -1 © M Kersten 2012
  • 30. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 Anchor Point V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY matrix[x : 1 : x+2][y : 1 : y+2]; © M Kersten 2012
  • 31. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 Anchor Point V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2]; © M Kersten 2012
  • 32. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 Anchor Point V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 null V0,0 V1,0 V2,0 V3,0 null null SELECT x, y, avg(value) FROM matrix GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2]; © M Kersten 2012
  • 33. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 Anchor Point V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY matrix[x][y], matrix[x-1][y], matrix[x+1][y], matrix[x][y-1], matrix[x][y+1]; © M Kersten 2012
  • 34. SciQL, A Query Language for Science Applications •  Seamless integration of array-, set-, and sequence- semantics. •  Dimension constraints as a declarative means for indexed access to array cells. •  Structural grouping to generalize the value-based grouping towards selective access to groups of cells based on positional relationships for aggregation. © M Kersten 2012
  • 35. Agenda Array support in database systems SciQL array query language Use-case exercise A crash course on column-stores SciQL implementation approach © M Kersten 2012
  • 36. Seismology use case Rietbrock: Chili earthquake … 2TB of wave fronts … filter by sta/lta … remove false positives … window-based 3 min cuts … heuristic tests … interactive response required … How can a database system help? Scanning 2TB on modern pc takes >3 hours © M Kersten 2012
  • 37. Use case, a SciQL dream Rietbrock: Chili earthquake create array mseed ( tick timestamp dimension[ ‘2010’:*], data decimal(8,6), station string ); © M Kersten 2012
  • 38. Use case, a SciQL dream Rietbrock: … filter by sta/lta --- average by window of 5 seconds select A.tick, avg(A.data) from mseed A group by A[tick:1:tick + 5 seconds] © M Kersten 2012
  • 39. Use case, a SciQL dream Rietbrock: … filter by sta/lta select A.tick from mseed A, mseed B where A.tick = B.tick and avg(A.data) / avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds] © M Kersten 2012
  • 40. Use case, a SciQL dream Rietbrock: … filter by sta/lta create view candidates( station string, tick timestamp, ratio float ) as select A.station, A.tick, avg(A.data) / avg(B.data) as ratio from mseed A, mseed B where A.tick = B.tick and avg(A.data) / avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds] © M Kersten 2012
  • 41. Use case, a SciQL dream Rietbrock: … remove false positives -- remove isolated errors by direct environment -- using wave propagation statics create table neighbors( head string, tail string, delay timestamp, weight float) © M Kersten 2012
  • 42. Use case, a SciQL dream Rietbrock: … remove false positives select A.tick, B.tick from candidates A, candidates B, neighbors N where A.station = N.head and B.station = N.tail and B.tick = A.tick + N.delay and B.ratio * N.weight < A.ratio; © M Kersten 2012
  • 43. Use case, a SciQL dream Rietbrock: … remove false positives delete from candidates select A.tick from candidates A, candidates B, neighbors N where A.station = N.head and B.station = N.tail and B.tick = A.tick + N.delay and B.ratio * N.weight < A.ratio; © M Kersten 2012
  • 44. Use case, a SciQL dream Rietbrock: … window-based 3 min cuts … heuristic tests select B.station, myfunction(B.data) from candidates A, mseed B where A.tick = B.tick group by distinct B[tick:tick + 3 minutes]; -- using a User Defined Function written in C. © M Kersten 2012
  • 45. Agenda Array support in database systems SciQL array query language A crash course on column-stores SciQL implementation approach © M Kersten 2012
  • 46. Storing Relations in MonetDB Void Void Void Void Void 1000 1000 1000 1000 1000 . . . . . . . . . . . . . . . . . . . . . . . . . Virtual OID: seqbase=1000 (increment=1) © M Kersten 2012
  • 47. BAT Data Structure BAT: binary association table Head Tail BUN: binary unit Hash tables, Head & Tail: BUN heap: T-trees, - consecutive memory R-trees, blocks (arrays)‫‏‬ block (array)‫‏‬ ... - memory-mapped file files Tail Heap: - best-effort duplicate elimination for strings © M Kersten 2012 (~ dictionary encoding)
  • 48. Processing Model (MonetDB Kernel)‫‏‬ l  Bulk processing: l  full materialization of all intermediate results l  Binary (i.e., 2-column) algebra core: l  select, join, semijoin, outerjoin l  union, intersection, diff (BAT-wise & column-wise)‫‏‬ l  group, count, max, min, sum, avg l  reverse, mirror, mark l  Runtime operational optimization: l  Choosing optimal algorithm & implementation according to input properties and system status © M Kersten 2012
  • 49. The Software Stack Strategic optimization Front-ends SQL 03 MAL Optimizers Tactical optimization: MAL -> MAL rewrites Back-end(s) MonetDB 5 MAL Runtime Kernel MonetDB kernel operational optimization © M Kersten 2012
  • 50. MonetDB Front-end: SQL EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x; function user.s2_1():void; barrier _73 := language.dataflow(); _2:bat[:oid,:int] := sql.bind("sys","t","c",0); _7:bat[:oid,:int] := sql.bind("sys","s","x",0); _10 := bat.reverse(_7); _11 := algebra.join(_2,_10); _13 := algebra.markT(_11,0@0); _14 := bat.reverse(_13); _15:bat[:oid,:int] := sql.bind("sys","t","a",0); _17 := algebra.leftjoin(_14,_15); _18 := bat.reverse(_11); _19 := algebra.markT(_18,0@0); _20 := bat.reverse(_19); _21:bat[:oid,:int] := sql.bind("sys","s","z",0); _23 := algebra.leftjoin(_20,_21); exit _73; _24 := sql.resultSet(2,1,_17); sql.rsColumn(_24,"sys.t","a","int",32,0,_17); sql.rsColumn(_24,"sys.s","z","int",32,0,_23); _33 := io.stdout(); sql.exportResult(_33,_24); end s2_1; © M Kersten 2012
  • 51. Agenda Array support in database systems SciQL array query language A crash course on column-stores SciQL implementation approach © M Kersten 2012
  • 52. SciQL implementation •  Use the complete MonetDB software stack •  Extend the SQL catalog to support SciQL •  Extend the Kernel to support array processing •  Extend the optimizer stack for performance •  Aim for a functional implementation first •  Use tabular representation of arrays •  Reuse the SQL code generator © M Kersten 2012
  • 53. © M Kersten 2012
  • 54. © M Kersten 2012
  • 55. © M Kersten 2012
  • 56. © M Kersten 2012
  • 57. © M Kersten 2012
  • 58. © M Kersten 2012
  • 59. Slicing a portion of an array is a ‘selection’ © M Kersten 2012
  • 61. It works © M Kersten 2012
  • 62. Conclusions •  The language definition is ‘finished’ •  Functional prototype is ‘around the corner’ •  Exposure to real life cases and external libraries •  MonetDB’s core technology was essential •  Challenge: ARRAYS FILES © M Kersten 2012
  • 63. © M Kersten 2012
  • 64. © M Kersten 2012
  • 65. © M Kersten 2012
  • 66. Science DBMS landscape MonetDB 5.23 SciDB 0.5 Rasdaman Architecture Server approach Server approach Plugin(Oracle, DB2, Informix, Mysql, Postgresql) Open source Mozilla License GPL 3.0 Commercial GPL 3.0 Dual license Downloads >12.000 /month Tens up to now ?? SQL SQL 2003 ?? SQL92++ Interoperability {JO}DBC, C(++),Python, … C++ UDF C++, Java, OGC Array language SciQL AQL RASQL Array model Fixed+variable bounds Fixed arrays Fixed+variable bounds Science Linked libraries Linked libraries Linked libraries Foreign files Vaults of csv, FITS, ?? Tiff,png,jpg.., NETCDF, MSEED csv,,NETCDF,HDF4, Distribution 50-200 node cluster 4 node cluster 20-node Distribution tech Dynamic partial replication Static fragmentation Static fragmentation Executor Various schemes Map-reduce Tile streaming Largest demo Skyserver SDSS 6 3TB --- 12TB, IGN –F (on Postgresql) Storage tuning Query adaptive Schema definitions Workload driven © M Kersten Heuristics + cost base Optimization 2012 ?? Heuristics +cost based