SAS is a both a Language for processing data and an Application for doing Analytics. SAS has adapted to the Hadoop eco-system and intends to be a good citizen amongst the choices for processing large volumes of data on your cluster. As more people inside an organization want to access and process the accumulated data, the “schema on read” approach can degenerate into “redo work someone else might have done already”.
This talk begins comparing and contrasting different data storage strategies, and describes the flexibility provided by SAS to accommodate different approaches. These different storage techniques are ranked according to convenience, performance, interoperabilty – both practicality and cost of the translation. Techniques considered include:
· Storing the rawdata (weblogs, CSVs)
· Storing Hadoop metadata, then using Hive/Impala/Hawk
· Storing in Hadoop optimized formats (avro, protobufs, RCfile, parquet)
· Storing in Proprietary formats
The talk finishes up discussing the array of analytical techniques that SAS has converted to run on your cluster, with particular mention of situations where HDFS is just plain better than the RDBMS that came before it.