3. Office Hours IS Simply, Office Hours is a program the enables a technical audience the ability to interact with AWS technical experts. We look to improve this program by soliciting feedback from Office Hours attendees. Please let us know what you would like to see.
4. Office Hours is NOT Support If you have a problem with your current deployment please visit our forums or our support website http://aws.amazon.com/premiumsupport/ A place to find out about upcoming services We do not typically disclose information about our services until they are available
5. Agenda What’s New How-to Demonstrations Resize a running job flow Launch a Hive-based Data Warehouse (Contextual Advertising Example) Question and Answer Please begin submitting questions now
9. Parallel uploads and earlier start mean data-intensive applications finish significantly fasterUsage: ./elastic-mapreduce--create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop –args “-c,fs.s3n.multipart.uploads.enabled=true, -c,fs.s3n.multipart.uploads.split.size=524288000”
10.
11. Reference the same IP address for a long-running job flow even if it has to be restarted
12. Reference transient job flows in a consistent way each time they are launchedUsage: ./elastic-mapreduce --create --eip [existing_ip]
13.
14.
15.
16. Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight)Job Flow Job Flow Job Flow Data Warehouse (Batch Processing) Allocate 4 instances Expand to 25 instances Expand to 9 instances 3 Hours Data Warehouse (Steady State) Data Warehouse (Steady State) Shrink to 9 instances Expand to 25 instances Time remaining: Time remaining: 14 Hours Time remaining: 7 Hours
22. Launch a Hive Cluster(Contextual Advertising Example) // Launch a Hive cluster with cluster compute nodes ./elastic-mapreduce --create --alive --hive-interactive --name "Hive Job Flow" --instance-type cc1.4xlarge Created job flow j-ABABABABAB // SSH to Master Node ./elastic-mapreduce --ssh --jobflow j-ABABABABAB // Run a Hive Session on the Master Node hadoop@domU-12-31-39-07-D2-14:~$ hive -d SAMPLE=s3://elasticmapreduce/samples/hive-ads -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://mybucket/samples/output hive>
23. Define Impressions Table Show Tables hive> show tables; OK Time taken: 3.51 seconds hive> Define Impressions Table ADD JAR ${SAMPLE}/libs/jsonserde.jar ; CREATE EXTERNAL TABLE impressions ( requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ipstring ) PARTITIONED BY (dt string) ROW FORMAT serde'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION '${SAMPLE}/tables/impressions' ;
24. Recover Partitions The table is partitioned based on time, we can tell Hive about the existence of a single partition using the following statement. ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05') ; If we were to query the table at this point the results would contain data from just the single partition. We can instruct Hive to recover all partitions by inspecting the data stored in Amazon S3 using the RECOVER PARTITIONS statement. ALTER TABLE impressions RECOVER PARTITIONS ;
25. Define Clicks Table We follow the same process to define the clicks table and recover its partitions. CREATE EXTERNAL TABLE clicks ( impressionIdstring ) PARTITIONED BY (dt string) ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde' WITH SERDEPROPERTIES ( 'paths'='impressionId' ) LOCATION '${SAMPLE}/tables/clicks' ; ALTER TABLE clicks RECOVER PARTITIONS ;
26. Define Output Table We are going to combine the clicks and impressions tables so that we have a record of whether or not each impression resulted in a click. We'd like this data stored in Amazon S3 so that it can be used as input to other job flows. CREATE EXTERNAL TABLE joined_impressions ( requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ip string, clicked Boolean ) PARTITIONED BY (day string, hour string) STORED AS SEQUENCEFILE LOCATION '${OUTPUT}/joined_impressions' ;
27. Define Local Impressions Table Next, we create some temporary tables in the job flow's local HDFS partition to store intermediate impression and click data. CREATE TABLE tmp_impressions( requestBeginTimestring, adId string, impressionId string, referrer string, userAgentstring, userCookie string, ip string ) STORED AS SEQUENCEFILE ; We insert data from the impressions table for the time duration we're interested in. Note that because the impressions table is partitioned only the relevant partitions will be read. INSERT OVERWRITE TABLE tmp_impressions SELECT from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip FROM impressions i WHERE i.dt >= '${DAY}-${HOUR}-00' AND i.dt < '${NEXT_DAY}-${NEXT_HOUR}-00' ;
28. Define Local Clicks Table For clicks, we extend the period of time over which we join by 20 minutes. Meaning we accept a click that occurred up to 20 minutes after the impression. CREATE TABLE tmp_clicks( impressionIdstring ) STORED AS SEQUENCEFILE; INSERT OVERWRITE TABLE tmp_clicks SELECT impressionId FROM clicks c WHERE c.dt>= '${DAY}-${HOUR}-00' AND c.dt < '${NEXT_DAY}-${NEXT_HOUR}-20' ;
29. Join Tables Now we combine the impressions and clicks tables using a left outer join. This way any impressions that did not result in a click are preserved. INSERT OVERWRITE TABLE joined_impressions PARTITION (day='${DAY}', hour='${HOUR}') SELECT i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip, (c.impressionId is not null) clicked FROM tmp_impressionsi LEFT OUTER JOIN tmp_clicksc ON i.impressionId = c.impressionId ;