JDK.IO 2016 (http://jdk.io)
Java EE 7 introduced a new batch processing API. This session will go over how to use the batch processing API introduced with Java EE 7. This API is makes it easy to implement long running data/compute intensive jobs which need to be scheduled or initiated on-demand. Basics of the API will be demonstrated via code samples. The API will also be compared to Spring Batching and Hadoop to provide context and guidance on when these technologies are appropriate.
2. What is Batch Processing?
Batch jobs are typically:
• Bulk-oriented
• Non-interactive
• Potentially compute intensive
• May require parallel execution
• Maybe invoked, ad hoc, scheduled, on-demand etc.
3. Batching Examples
• Monthly reports/statements
• Daily data cleanup
• One-time data migrations
• Data synchronization
• Data analysis
• Portfolio rebalancing
4. Introducing Java EE Batching
• Introduced in Java EE 7
• JSR 352 - https://jcp.org/en/jsr/detail?id=352
• Reference implementation:
https://github.com/WASdev/standards.jsr352.jbatch
• Batch Framework:
• Batch container for execution of jobs
• XML Job Specification Language
• Batch annotations and interfaces
• Supporting classes and interfaces for interacting with the
container
• Depends on CDI
6. Java EE Batching Overview
JobInstance
Job
JobExecution
*
*
EndOfDayJob
EndOfDayJob for 9/1/2016
First attempt at EndOfDay
job for 9/1/2016
7. Java EE Batching Features
• Fault-tolerant – checkpoints and job persistence
• Transactions - chunks execute within a JTA transaction
• Listeners – notification of job status/completion/etc.
• Resource management – limits concurrent jobs
• Starting/stopping/restarting – job control API
8. Java EE Batching Deployment
WAR EAR JAR
Deploy batch jobs in:
Manage jobs – split application into modules
Server B
app.war
End of Day
Job
Cleanup Job
Server C
app2.war
Analytics Job
Server A
frontend.war
10. Exit Codes
Code Description
STARTING Job has been submitted to runtime.
STARTED Batch job has started executing.
STOPPING Job termination has been requested.
STOPPED Job has been stopped.
FAILED Job has thrown an error or failured triggered by
<failure>
COMPLETED Job has completed normally.
ABANDONDED Job cannot be restarted
17. IDs and Names
instanceId
• ID represents an instance of a job.
• Created when JobOperator start method invoked.
executionId
• ID that represents the next attempt to run a particular job
instance.
• Created when a job is started/restarted.
• Only one executionId for a job can be started at a time
stepExecutionId
• ID for an attempt to execute a particular step in a job
jobName
• name of the job from XML (actually id) <job id=“”>
jobXMLName
• name of the config file in META-INF/batch-jobs
19. Managing Jobs
• JobOperator – interface for operating on batch jobs.
• BatchRuntime.getJobOperator()
• JobOperator:
• Provides information on current and completed jobs
• Used to start/stop/restart/abandon jobs
• Security is implementation dependent
• JobOperator interacts with JobRepository
• JobRepository
• Implementation out-side scope of JSR
• No API for deleting old jobs
• Reference implementation provides no API for cleanup!
20. JobOperator Methods
Type Method
void Abandon(long executionId)
JobExecution getJobExecution(long executionId)
List<JobExecution> getJobExecutions(JobInstance instance)
JobInstance getJobInstance(long executionId)
int getJobInstanceCount(String jobName)
List<JobInstance> getJobInstances(String jobName,int start, in count)
Set<String> getJobNames()
Properties getParameters(long executionId)
List<Long> getRunningExecutions(String jobName)
List<StepExecution> getStepExecutions(long jobExecutionId)
long Restart(long executionId, Properties restartParams)
long start(String jobXMLName, Properties jobParams)
void Stop(long executionId)
22. Chunking
• Chunking is primary pattern for batch processing in JSR-
352.
• Encapsulates the ETL pattern:
• Pieces: Reader/Processor/Writer
• Reader/Processor invoked until an entire chuck of data is
processed.
• Output is written atomically
• Implementation:
• Interfaces: ItemReader/ItemWriter/ItemProcessor
• Classes: AbstractReader/AbstractWriter/AbstractProcessor
Reader Processor Writer
24. Chunk Configuration
Parameter Description
checkpoint-policy Possible values: item or custom
item-count Number of items to be processed per
chunk. Default is 10.
time-limit Time in seconds before taking a
checkpoint. Default is 0 (means after
each chunk)
skip-limit Number of exceptions a step will skip
if there are configured skippable
exceptions.
retry-limit Number of times a step will be retried
if it has throw a skippable exception.
42. Split
updateExisting processNewStorms
Flow & Splits JCL
• <flow> element is used to implement process workflows.
• <split> element is used to run jobs in parallel
retrieveTracking
processDecider
stormReader
stormProcessor
stormWriter
updateExisting
Storms
46. Hadoop Overview
• Massively scalable storage and batch data processing
system
• Written in Java
• Huge ecosystem
• Meant for massive data processing jobs
• Horizontally scalable
• Uses MapReduce programming model
• Handles processing of petabytes of data
• Started at Yahoo! In 2005.
48. Hadoop
Typically Hadoop is used when:
• Analysis is performed on unstructured datasets
• Data is stored across multiple servers (HDFS)
• Non-Java processes are fed data and managed
Ex. https://svi.nl/HuygensSoftware
49. Spring vs. Java EE Batching
• Spring Batch 3.0 implements JSR-352!
• Batch artifacts developed against JSR-352 won’t work
within a traditional Spring Batch Job
• Same two processing models as Spring Batch:
• Item – aka chunking
• Task - aka Batchlet
51. Scaling Batch Jobs
• Traditional Spring Batch Scaling:
• Split – running multiple steps in parallel
• Multiple threads – executing a single step via multiple threads
• Partitioning – dividing data up for parallel processing
• Remote Chunking – executing the processor logic remotely
• JSR-352 Job Scaling
• Split – running multiple steps in parallel
• Partitioning – dividing data up – implementation slightly different.
52. JSR-352/Spring/Hadoop
Hadoop
• Massively parallel / large jobs
• Processing petabytes of data (BIG DATA)
JSR-352/Spring
• Traditional batch processing jobs
• Structured data/business processes
JSR-352 vs. Spring
• Java EE versus Spring containers
• Spring has better job scaling capabilities
54. Best Practices
• Package/deploy batch jobs separately
• Implement logic to cleanup old jobs
• Implement logic for auto-restart
• Test restart and checkpoint logic
• Configure database to store jobs
• Configure thread pool for batch jobs
• Only invoke batch jobs from logic that is secured (@Role
etc.)
55. Resources
• JSR-352
https://jcp.org/en/jsr/detail?id=352
• Java EE Support
http://javaee.support/contributors/
• Spring Batch
http://docs.spring.io/spring-batch/reference/html/spring-batch-
intro.html
• Spring JSR-352 Support
http://docs.spring.io/spring-batch/reference/html/jsr-352.html
56. Resources
• Java EE 7 Batch Processing and World of Warcraft
http://tinyurl.com/gp8yls8
• Three Key Concepts for Understanding JSR-352
http://tinyurl.com/oxe2dhu
• Java EE Tutorial
https://docs.oracle.com/javaee/7/tutorial/batch-
processing.htm
We’ve got our class, next we need to create a XML configuration file for the job.
Starting a job is relatively easy. We access the batch runtime and get a job operator.
Initialize properties, which would contain configuration settings.
Start the job – the “simpleJob”