Distributed Hash tables are nowadays commonly used as the storage underlying infrastructure for many applications because of their decentralized design, scalability and fault tolerance. Unlike traditional storage systems, they
offer an amazingly simple interface to store and retrieve streams of bytes, leaving the entire responsibility of the data semantics and manipulation to the application layer. This brief report suggests a series of improvements to the existing DHT design in order to provide type awareness and extended access semantics to such systems, transferring a significant part of the data management logic to the storage layer and relieving applications from the complexity derived from the nature of the data. The major goals of this approach are to impact the traditional programming style
when relying on DHTs as well as to facilitate the sharing of a single storage system by multiple applications with
diverse requirements due to the new level of flexibility introduced.
Active disks allow to associate code handlers to streams of data like a file or set of files which are triggered in response to an access operation. The aim of this solution was to reduce the central processor load and bus usage by running disklets on the hard disk processing unit. Used in the context of satellite image repositories for composing images from different sources. Used also for database applications for early processing of tuples. Active networks: Associate code with network data which is executed by infrastructure. Example: Routers executing code of Packages Extended access semantics in FS: used for compression schemes, thumbnails generation, display based on user, virtual files or filesystems (file is just a proxy used to access information returned by the handler)
Different needs in terms of access control, data availability, performance and consistency. Because the key-value store treats all data as stream of bits where some functionality could be stored at data store level like aggregation operators and functions are stored at database level. Traditional database systems offer type awareness by defining types for the columns as well as functions working on those types.
MIME ( Multipurpose Internet Mail Extensions) types offer a big list of data types for files but users could defined their own types.. including atomic ones like integers.. etc.. All keys with the same data type will be associated to the same handler. In that context one could define their own types like big integer, jpeg image or application dependent like shopping cart whose behaviour is described in the handlers. To provide more flexibility, a key policy can be defined to override the type policy.
The code handler in this case might apply an efficient algorithm for text compression in case we want to optimize storage. Moreover, in case of a vectorial image, some operations (like rotations or translations) involve small changes in the files content. A key-value pair could store data in terms of a base content and a series of diffs facilitating the access to different versions of the item. Statistics and information about who accessed some information can be trivially implemented. Basic access control
- Code handlers should run in such a way that do not interfere with code handlers of other keys within the same node. Moreover they should be designed to minimize the effect of malicious handlers so we need isolation as well as a list of security policies. - In the presence of replication, all replicas should execute the code in case of a put operations. - Guarantee negligible impact in performance given the execution of the handlers and the possible lookup for the code handlers. - With the standard DHTs, application developers have to use the storage system in such a way that values remain small. Benefit measures: Single extended DHT instance vs one instance per application. Performance penalty due to handlers exec and lookup.
Nodes and values are given keys in a huge random space. Nodes keep information about their neighbours. If a node receives a lookup for a key, he will try to serve the request or forward it to a node closer the key. Mention that for simplicity you will assume: random ids, huge and sparse space, routing tables with logarithmic size (even though it is independent of the proposed changes) and configurable level of replication.