This workshop presentation was given by Rich Dill, Solutions Engineer at SnapLogic at the GigaOm Structure Data Conference, March 20-21, 2013 in New York City, NY.
What are the Top Ten Challenges?
1. A miracle occurs here - Of course we can connect to it…
2. There is always more data than you expected - Unless there is not enough data to be meaningful
3. Never mistake a memo for reality - Did you hear what I said or what I meant?
4. It is logically impossible to schedule for the unknown
5. There is life beyond American English - Eventually you will have to deal with other languages
6. Of course the data is accurate, clean and ready - Data quality issues can kill project schedules
7. Dealing with unstructured data is fun - Somewhere buried inside is your delimiter where you least expect it
8. The data and process is subject to… Pick your acronym PCI, FIX, HIPAA, SOX
9. The requirements once defined are set in stone - Requirements almost always evolve
10. The most critical data will be on the most difficult platform to access - “a good deal of our case data is on Notes running on AS400”
Top 10 Challenges of Making Big Data Real and Tips to Overcome Them
1. Top 10 challenges of making big data real
– and tips to overcome them
Rich Dill
Solutions Engineer, SnapLogic
rdill@snaplogic.com
2. A play on Dave Letterman’s top 10
• 1. A miracle occurs here
- Of course we can connect to it…
• 2. There is always more data than you expected
- Unless there is not enough data to be meaningful
• 3. Never mistake a memo for reality
- Did you hear what I said or what I meant?
• 4. It is logically impossible to schedule for the unknown
- Or the relationship between developers and weathermen
• 5. There is life beyond American English
- Eventually you will have to deal with other languages
2
3. A play on Dave Letterman’s top 10
• 6. Of course the data is accurate, clean and ready
- Data quality issues can kill project schedules
• 7. Dealing with unstructured data is fun
- Somewhere buried inside is your delimiter where you least
expect it
• 8. The data and process is subject to…
- Pick your acronym PCI, FIX, HIPAA, SOX
• 9. The requirements once defined are set in stone
- Requirements almost always evolve
• 10. The most critical data will be on the most difficult
platform to access
- “a good deal of our case data is on Notes running on AS400”
3
6. SnapLogic Solution
Users
ESB RDBMS
Data Center Mobile
Enterprise
Amazon Redshift
Cloud Big Data
7. There is always more data than you expected
• Unless there is not enough data to be
meaningful
- It’s feast or famine
- Distributed systems replicate data
• At the site level and at the network level
- 3x at the data center in Houston and 3x in Chicago
- Replicated data can increase the cost of hardware,
network and software
- We are far from normal
• Data is organized for performance and reliability
not space efficiency
7
8. It is logically impossible to schedule for the unknown
• Or my theory of the relationship between developers
and weathermen
• The accuracy of an estimate is a function of the
number of variables and the length of the project
8
9. Never mistake a memo for reality
• Did you hear what I said or what I meant?
• Are you a literal listener?
- Psycholinguistics should be required reading for project managers
• Waterfall process
- Allows you to build something the user wants today that you deliver in
9 months or two years
• Iterative process
- We’ll figure it out as we go along
- Not really suited for deep architectural designs
• Process
- Listen
- Process
- Repeat back “this is what I heard you say”
• Nothing beats showing a functioning prototype, demo or wireframe
9
10. There is life beyond American English
• Eventually you will have to deal with other languages
- German will test your user interface spacing
- Cyrillic will add to the character set
• Middle eastern languages
- Read right to left
- Some languages don’t have consistent spelling
• Far eastern languages
- There is no such thing as Chinese
• Mandarin is the “Speech of Officials”
• Cantonese is used in Hong Kong
• Hangul is used in Korea
• Japanese
- Kanji is adopted Chinese characters
- Kana is a combination of Hiragana & Katakana
10
11. Of course the data is accurate, clean and ready
• How good is the data?
- Profiling the data is key to accurate project estimates
- What percentage of the data is null, blank, invalid?
• Data lifecycle includes
- Acquisition or creation
- Validation
• Business rules
• Which may result in…
• Data cleansing
- Zip code tables, barcodes, D & B credit ratings
- Public data resources: www.data.gov
• Storage in an accessible format/location
• Archiving
- Industry or legal rules for archiving
11
12. Dealing with unstructured data is fun
• Somewhere buried inside is your delimiter where you
least expect it
• Email is one of the most complex to handle
• Hierarchal data structures must be mapped or
navigated
• XML is not the end all, be all of structure data
formatting
- JSON
- BSON
- SomethingImissedSON
12
13. Big Data Reference Architecture
1 2 3
Collect Translate & Enrich Distribute
DB
Structured Data
DB
Data
View
Unstructured
Data
14. The data and process is subject to…
• Pick your acronym: PCI, FIX, HIPAA, SOX
• Almost every industry has some form or another of data
handling protocols that must be addressed
• These protocols are a combination of
- Data creation
- Data access
- Technology and workflow
- It is not just encryption and access
• Know your customers requirements!
14
15. The requirements once defined are set in stone
• What your users know today is not what they will know
tomorrow…
• Requirements evolve
• Why do you think they call them users?
- If you are successful they will want more
• Things change
- Economy
- Budgets
- Timeframe
- Management
• Feature creep is not a bad thing if budgets and
timelines also creep
15
16. The most critical data will be on the most difficult
platform to access
• “A good deal of our case data is on Notes running on AS400”
• Discover where the data is first
• When can you access it?
- 24x7, after hours, on demand
• Throughput is key
- Either during business hours of afterwards
• What conditions?
- One time download
- Scheduled
- Event based
- Stream
• What about security requirements?
- There is a performance impact of encryption during transmission
16
17. Containerization with Snaps
BUY BUILD
• SnapStore • SDK + API
• Certified and supported • Java, Python
by SnapLogic • Customer, Partner or
SnapLogic
18. The eleventh rule
• Free software sometimes is worth the cost
- Or the money you save on licenses is multiplied by
the cost of training and consultants
- In most cases labor is the one of the biggest costs of
most software projects
• Open source is NOT the same as free!
- Subscription vs. perpetual licenses
- Does the customer need to
• Expense or capitalize software licenses
18
19. Thank you
For more information
www.snaplogic.com
BDaaS - BigData as a Service
Notas do Editor
1990sValuable data was being generated but was really living in silo’d environments. The term MDM was not even coined till 2003As long as you could connect different systems together via a nightly, or sometimes even a weekly feed, that was pretty darn awesome!Technologies like ESBs, EAIs, ETLs… flourished.Data was mostly structured. Sitting in RDBMS systems2000sNetwork speeds increasedCosts went downPlayers like Salesforce and NetSuite started getting traction from SMB marketImmense value on cost and agilityFlexibility of to subscribe vs. perpetual licenses2005: Consumer / Social dataFB, Twitter, LinkedIn, amazon.com consumer reviews…Humans generating massive amounts of preference data, likes and dislikes, Data was different: Non-relational unstructured. Real-time dataHuge volumes: PetabytesProviding immense value to the business on their customers2010: MachineRFID tags. Various other sensors, weblogs. ArcSight got bought out for $1.5B by HPMassive amounts of dataExabytesSplunk had a successful IPO last monthSnap LogicThese 4 sources create an Impendence mismatch!Good luck doing all of this with an ESB Structured vs. unstructuredStreaming vs. batchPetabytes and Exabytes vs. GigaBytesPull vs. pushHub and spokeUnprecedented opportunity & desire to use dataData silos (data fragmentation) unavoidableLegacy Apps, Cloud Apps, and Hadoop are driving thisDifferent locations, protocols, formats, and architecturesData is more distributed & less accessible (less useful)Compounding due to volume & variety of apps & dataESB is just another connectionEnterprises must share data between their appsCollect, combine, process data into valuable informationCompetitive advantage will become necessity for survivalsnapLogic = data sharing platform
Apple Like Model – we offer an API and about 200 SnapsBuild or BuyEasy to build w Java or Phython – An intern out of school built snaps in 4 daysBuild or Buy – Containerazation of accessAbstraction of the end point – so you do not need to know everything