This presentation covers how open source technologies are being used to meet the specific needs of large scale problems on the Internet. No one solution meets all needs but open source provides a variety of solutions for different use cases.
Used by Mozilla in their Test Pilot project where they expected 1 million users to write 1.2TB of data mostly over a two day period with 75GB/hr sustained write loads.
Use where records are very sparse, where you may only have a handful of “required” fields on a form with many optional ones.Versioning is also a very powerful feature, Hbase uses timestamps called coordinates. Imagine a record of a person with a column for location and over time as the location changes you can keep that history along with the times it was updated.
The evolution of memcached, also a good way to store IPC depending on how your application is setup as an alternative to a more traditional ESB or message queue application.An example in production is Github using it to store the routing information between their smoke and chimney processes used for finding a specific users repos on disk.
LAMP of the next generation -> MongoDB replacing MySQL? It is used in production today by the Business Insider and by BoxedIce for their Server Density monitoring product, another group is actively working on support in Drupal 7
Originally developed by Facebook, in use there for inbox search. Also in use at Twitter for geo, user base data mining information, real time analytics, and more (it is not what they use to store tweets).It supports a rich set of features for a NoSQL DB, ColumnFamilies and indexes mean you don’t have to implement as much of the data manipulation in your application as you would with a more basic key-value store.
Unlike the other data stores I mentioned, this one is based on SQL. It is relational but it doesn’t support all of the features you may expect from an “enterprise database” as it isn’t trying to be one – it is being optimized for the web. No more 32-bit, no more 4-bit integer fields, no more bloat.You can try out Drizzle as a stand-alone DB and integrated with Wordpress at http://www.standingcloud.com
How big is your big data? Are you selecting a platform with known use cases above and beyond what you’re planning? Do you have the specifics of their configuration?
While a system may scale horizontally make sure you know the amount of time it takes to add a node to the cluster and what the load impact is on the cluster during that addition.
Reliability is about both the ability to protect the data once it is in the platform and the ability of the platform to stay online all the time. Which components are needed for your use case? For a web application 24/7 availability is probably more important than 100% accuracy guarantee on data integrity of records such as blog comments.
Does all of your data need to be “live” all the time? How “hot”, in memory? Local disk? Tiered archive in a remote cloud storage? What are the latency requirements in accessing data?
It is great to choose a system that can store all of your data but how do your users need to access it?
This is both hardware, and the people required to manage it.
What requirements do you have around security for reading and writing data? What type of input validation does it require on data from untrusted sources?
This platform may be ideal for the problem at hand, will you be able to use it to solve future problems? If you change out components of the proposed system would you still choose it?
How important is the ability to adapt or migrate to a new platform if your current one is not working out.
The amount of time you spend making the decision will be inversely proportional to the amount of time you spend reworking what you end up deploying.