SlideShare uma empresa Scribd logo
1 de 29
Andy Cobley
School of Computing
University of Dundee
Twitter: @andycobley
Who am I ?
 Lecturer at University of Dundee
 Program director of Business Intelligence and new
  program Data Science (http://goo.gl/ljl6N and
  http://goo.gl/uwHSi )
 Geek and Hacker
So what is Big Data?
From evil Wikipedia
 “In information technology, big data[1] consists of
  datasets that grow so large that they become awkward
  to work with using on-hand database management
  tools.”
 Which doesn’t tell us much
 Any definition that relies on data “size” will become
  obsolete very quickly as data storage capabilities grows.
Lets try something different
     The Three V’s
     Volume
        How Big is the data, Terabytes ? Petabytes?
     Variety
        Is it the same sort of data, what about blobs ? Does it
         change ?
     Velocity
        How fast is it coming in ? Can we store it fast enough
         and then use it ?
http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety-
velocity
The Twitter problem
 Twitpocalypse
 Overflow of status ids for 32 bit signed integers
 But beyond that, can we physically store data fast
  enough ?
 Suppose we are storing 16 columns of 16 bytes
 At 100 per second
 0.7 Terabyte per year
 Add at 1 million per second that’s
 7 petabytes per year
 This is volume
Variability
 Data is sparse and can be different sizes
 Over time the type of data changes
 Consider click through data, as pages evolve new data
  types and fields need to be stored
What about

id     MassSpec   Meta data   Meta data
1




2
We need UDF
 User Defined functions inside the dB
 Or a different way of dealing with it, such as Hadoop
 or MRSQL.
So what is NoSql
 Throws away everything you know about Databases
 Is a family of different databases
 Lots of different “products”
 BUT !
 http://nosql.mypopescu.com/post/1016320617/mongo
  db-is-web-scale (warning might offend)
 They should only be used when it’s sensible, they are
  not magic sauce.
NoSql types
 Key-Value
 Column-family
 Document databases
   Allow sharding across nodes
 Graph
    Fast for graph like data and operations
Some NoSQL databases
 CouchDb
 MongoDb
 Cassandra
 Riak
 Hbase
 Neo4j


 http://kkovacs.eu/cassandra-vs-mongodb-vs-
 couchdb-vs-redis
Sharding ?
 Distribution of data across nodes
 Allows performance to be spread across multiple
  machines
 SQL databases can be sharded
 Not all NoSQL databases can be sharded
Cap Theorem
    CAP (or Brewers) theorem says:
    It’s impossible for a web service to provide the
      following
        Consistency
        Availability
        Partition tolerance


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf


But see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-
changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/
http://blog.nahurst.com/visual-guide-to-nosql-systems
Partitions ?
 Essentially failing to achieve consistency within a set
  time causes a partition.
 You can sacrifice availability to ensure consistency
 Partitions are rare and if you have one server, almost
  never happen
 Partitions are caused by networks, failed nodees
Eventual Consistency
 Eventually all nodes will tell the same story
 Isn’t this a mad idea ?
 Facebook (Actually not)
 The Internet is based on and Eventual Consistency dB
 DNS
Introducing Cassandra
 Distributed / Decentralized
 Column Orientated
 Key Value Store
 Fault Tolerant
Network topology of a Cassandra
db
 Multiple nodes
 Cassandra can be Rack Aware
 Keys are replicated across nodes
 It’s essentially a DHT Distributed Hash Table
 Think BitTorrent
CQL
 Version 8 introduced CQL Cassandra Query Language
 Almost looks like SQL !
 http://crlog.info/2011/09/17/cassandra-query-
  language-cql-v2-0-reference/ Language ref
 http://www.datastax.com/docs/0.8/dml/using_cql
Demo
 Start Cassandra
 Open CQLSH
 Create Keyspace
 Create a columnfamily
 Now we can insert !
So why does this work ?
 Jsmith
    Password: ch@ngem3a


 Jbrown
    Gender: Male
    Phone: 01382 345078


Column store, keys with name: value pairs underneath
Interfacing to Cassandra
 Based on Thrift
    http://thrift.apache.org/
 Large number of Languages supported
    http://wiki.apache.org/cassandra/ClientOptions
 I’ve used Java and Hector
    http://prettyprint.me/
 Although there is a Csharp version
    http://hectorsharp.com/
Cassandra JDBC
 Very new, difficult to know how stable it is
 Needs compiling and libraries not in Cassandra !
http://code.google.com/a/apache-
extras.org/p/cassandra-jdbc/
Astyanax
 From Netflix
 Based on Hector but said to be a lot simpler!
 https://github.com/Netflix/astyanax/wiki
jBloggyAppy a demo app of
Cassandra
 All Source code on Github
 https://github.com/acobley/jBoggyAppy
 Feel free to use and abuse
 Simple blogging App
A word on using OpenSource
software
 Versioning !
 Things Change !
 Documentation is wrong !
    http://prettyprint.me/
 End up reading unit tests to actually program.
One Last thing
 Dundee DDD 17th November , Big Data track
 Anyone interested in speaking ?

Mais conteúdo relacionado

Mais procurados

PharoDAYS 2015: Connecting to Databases by Norbert Hartl
PharoDAYS 2015: Connecting to Databases by Norbert HartlPharoDAYS 2015: Connecting to Databases by Norbert Hartl
PharoDAYS 2015: Connecting to Databases by Norbert HartlPharo
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafkaZach Cox
 
HPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL EcosystemHPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL EcosystemAdam Marcus
 
The big data technology landscape-V.Janaki-II-M.Sc computer Science
The big data technology landscape-V.Janaki-II-M.Sc computer ScienceThe big data technology landscape-V.Janaki-II-M.Sc computer Science
The big data technology landscape-V.Janaki-II-M.Sc computer Sciencekarthikasivakumar3
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and ShardingTharun Srinivasa
 
MongoDB basics & Introduction
MongoDB basics & IntroductionMongoDB basics & Introduction
MongoDB basics & IntroductionJerwin Roy
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertOpening-up.eu
 

Mais procurados (9)

PharoDAYS 2015: Connecting to Databases by Norbert Hartl
PharoDAYS 2015: Connecting to Databases by Norbert HartlPharoDAYS 2015: Connecting to Databases by Norbert Hartl
PharoDAYS 2015: Connecting to Databases by Norbert Hartl
 
Rails meets no sql
Rails meets no sqlRails meets no sql
Rails meets no sql
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafka
 
HPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL EcosystemHPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL Ecosystem
 
Mongo db1
Mongo db1Mongo db1
Mongo db1
 
The big data technology landscape-V.Janaki-II-M.Sc computer Science
The big data technology landscape-V.Janaki-II-M.Sc computer ScienceThe big data technology landscape-V.Janaki-II-M.Sc computer Science
The big data technology landscape-V.Janaki-II-M.Sc computer Science
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
 
MongoDB basics & Introduction
MongoDB basics & IntroductionMongoDB basics & Introduction
MongoDB basics & Introduction
 
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter ColpaertBasics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
Basics of Open Data: what you need to know by Wouter Degadt & Pieter Colpaert
 

Semelhante a Whynosql

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopImpetus Technologies
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistJeremy Zawodny
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDBAhsan Bilal
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 

Semelhante a Whynosql (20)

Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Graph databases and OrientDB
Graph databases and OrientDBGraph databases and OrientDB
Graph databases and OrientDB
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 

Último

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Whynosql

  • 1. Andy Cobley School of Computing University of Dundee Twitter: @andycobley
  • 2. Who am I ?  Lecturer at University of Dundee  Program director of Business Intelligence and new program Data Science (http://goo.gl/ljl6N and http://goo.gl/uwHSi )  Geek and Hacker
  • 3. So what is Big Data?
  • 4. From evil Wikipedia  “In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools.”  Which doesn’t tell us much  Any definition that relies on data “size” will become obsolete very quickly as data storage capabilities grows.
  • 5. Lets try something different  The Three V’s  Volume  How Big is the data, Terabytes ? Petabytes?  Variety  Is it the same sort of data, what about blobs ? Does it change ?  Velocity  How fast is it coming in ? Can we store it fast enough and then use it ? http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety- velocity
  • 6. The Twitter problem  Twitpocalypse  Overflow of status ids for 32 bit signed integers  But beyond that, can we physically store data fast enough ?
  • 7.  Suppose we are storing 16 columns of 16 bytes  At 100 per second  0.7 Terabyte per year  Add at 1 million per second that’s  7 petabytes per year  This is volume
  • 8. Variability  Data is sparse and can be different sizes  Over time the type of data changes  Consider click through data, as pages evolve new data types and fields need to be stored
  • 9. What about id MassSpec Meta data Meta data 1 2
  • 10. We need UDF  User Defined functions inside the dB  Or a different way of dealing with it, such as Hadoop or MRSQL.
  • 11. So what is NoSql  Throws away everything you know about Databases  Is a family of different databases  Lots of different “products”  BUT !  http://nosql.mypopescu.com/post/1016320617/mongo db-is-web-scale (warning might offend)  They should only be used when it’s sensible, they are not magic sauce.
  • 12. NoSql types  Key-Value  Column-family  Document databases Allow sharding across nodes  Graph  Fast for graph like data and operations
  • 13. Some NoSQL databases  CouchDb  MongoDb  Cassandra  Riak  Hbase  Neo4j  http://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis
  • 14. Sharding ?  Distribution of data across nodes  Allows performance to be spread across multiple machines  SQL databases can be sharded  Not all NoSQL databases can be sharded
  • 15. Cap Theorem  CAP (or Brewers) theorem says:  It’s impossible for a web service to provide the following  Consistency  Availability  Partition tolerance http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf But see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have- changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/
  • 17. Partitions ?  Essentially failing to achieve consistency within a set time causes a partition.  You can sacrifice availability to ensure consistency  Partitions are rare and if you have one server, almost never happen  Partitions are caused by networks, failed nodees
  • 18. Eventual Consistency  Eventually all nodes will tell the same story  Isn’t this a mad idea ?  Facebook (Actually not)  The Internet is based on and Eventual Consistency dB  DNS
  • 19. Introducing Cassandra  Distributed / Decentralized  Column Orientated  Key Value Store  Fault Tolerant
  • 20. Network topology of a Cassandra db  Multiple nodes  Cassandra can be Rack Aware  Keys are replicated across nodes  It’s essentially a DHT Distributed Hash Table  Think BitTorrent
  • 21. CQL  Version 8 introduced CQL Cassandra Query Language  Almost looks like SQL !  http://crlog.info/2011/09/17/cassandra-query- language-cql-v2-0-reference/ Language ref  http://www.datastax.com/docs/0.8/dml/using_cql
  • 22. Demo  Start Cassandra  Open CQLSH  Create Keyspace  Create a columnfamily  Now we can insert !
  • 23. So why does this work ?  Jsmith  Password: ch@ngem3a  Jbrown  Gender: Male  Phone: 01382 345078 Column store, keys with name: value pairs underneath
  • 24. Interfacing to Cassandra  Based on Thrift  http://thrift.apache.org/  Large number of Languages supported  http://wiki.apache.org/cassandra/ClientOptions  I’ve used Java and Hector  http://prettyprint.me/  Although there is a Csharp version  http://hectorsharp.com/
  • 25. Cassandra JDBC  Very new, difficult to know how stable it is  Needs compiling and libraries not in Cassandra ! http://code.google.com/a/apache- extras.org/p/cassandra-jdbc/
  • 26. Astyanax  From Netflix  Based on Hector but said to be a lot simpler!  https://github.com/Netflix/astyanax/wiki
  • 27. jBloggyAppy a demo app of Cassandra  All Source code on Github  https://github.com/acobley/jBoggyAppy  Feel free to use and abuse  Simple blogging App
  • 28. A word on using OpenSource software  Versioning !  Things Change !  Documentation is wrong !  http://prettyprint.me/  End up reading unit tests to actually program.
  • 29. One Last thing  Dundee DDD 17th November , Big Data track  Anyone interested in speaking ?

Notas do Editor

  1. http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/
  2. http://news.softpedia.com/news/Twitpocalypse-039-s-Aftermath-114084.shtml
  3. http://www.dbshards.com/dbshards/
  4. http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changedPicture: http://www.datacenterknowledge.com/archives/2009/11/04/inside-a-cloud-computing-data-center/
  5. Larryeleison must be mad that his “free” software mysql is used on the biggest website in the world.
  6. create keyspace test with strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor=1;use test;create columnfamily users (KEY varchar Primary key, password varchar, gender varchar);INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');Select * from users;INSERT INTO users (KEY, gender) VALUES ('jbrown', 'male');INSERT INTO users (KEY, phone) VALUES ('jbrown', '01382 345078');What are we going to get ?