SlideShare uma empresa Scribd logo
1 de 8
High Level System Overview
Crawler Preprocessor DBProcessor Store in DB
Infrastructure Requirements
• Component Independence
• Messaging
• Scalability
• Minimize code written, use as much open
source code as possible
Infrastructure Choices
• Jboss/Java vs .NET
• Spring Framework vs Plain-old Java
• Oracle vs MySql
• Hibernate ORM vs Plain-old SQL
Logging Structure
• Entering and exiting methods (of
reasonable importance)
• Catching Java checked exceptions
• Uniform structure
• org.fydproject.component.mainMethod.su
bMethod1.subMethod2…
• Ex. org.projectnlp.preprocessor.stemmer
Pseudo-Code for Crawler Manager
• Begin infinite loop
– For each messageBoard in List
• crawlAll
– End For loop
• End infinite loop
High-Level Crawler Strategy
• Failed messages are
persisted
• Message markers
(right-hand side
labels) are persisted
• Algorithm prevents
crawling duplicate
messages
Old Message
Threshold
Oldest Message
Crawled
Last Successful
crawl
Last Successful
message extracted
Newest Message
Newly Crawled
Messages
Old successful
Crawled
Messages
Old Messages
Yet to be
Crawled
Messages from
Crash
Highest Message
Id
Lowest Message
Id
Crawler Strategy Algorithm
• Crawl all previous failed messages
• Crawl ‘crashed messages’
• Crawl new messages
• Crawl new failed messages
• Crawl old messages
Preprocessor Block Diagram
Lowercase
HTML
Parser
Cleanup
HTML
Parser
Cleanup
Contractions
Dictionary
Slang
Dictionary
Punctuation
Cleaner
Stop Words
Dictionary
Negation
Engine
Stemmer
Out to DB Processor
In from
Crawler

Mais conteúdo relacionado

Mais procurados (7)

FTL Write
FTL WriteFTL Write
FTL Write
 
Achieve the norm with Idiorm
Achieve the norm with IdiormAchieve the norm with Idiorm
Achieve the norm with Idiorm
 
MongoDB SF Python
MongoDB SF PythonMongoDB SF Python
MongoDB SF Python
 
Java-JSON-Jackson
Java-JSON-JacksonJava-JSON-Jackson
Java-JSON-Jackson
 
MongoDB for the SQL Server
MongoDB for the SQL ServerMongoDB for the SQL Server
MongoDB for the SQL Server
 
Introduction to Nodejs and Isomorphic javascript
Introduction to Nodejs and Isomorphic javascriptIntroduction to Nodejs and Isomorphic javascript
Introduction to Nodejs and Isomorphic javascript
 
Ruby Under The Hood - By Craig Lehmann and Robert Young - Ottawa Ruby Novembe...
Ruby Under The Hood - By Craig Lehmann and Robert Young - Ottawa Ruby Novembe...Ruby Under The Hood - By Craig Lehmann and Robert Young - Ottawa Ruby Novembe...
Ruby Under The Hood - By Craig Lehmann and Robert Young - Ottawa Ruby Novembe...
 

Destaque

This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 

Destaque (20)

This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 

Semelhante a This is a title

Metasploit Module Development
Metasploit Module DevelopmentMetasploit Module Development
Metasploit Module Development
kyaw thiha
 

Semelhante a This is a title (20)

CISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecurityCISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development Security
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
rspamd-slides
rspamd-slidesrspamd-slides
rspamd-slides
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Ch 18: Source Code Auditing
Ch 18: Source Code AuditingCh 18: Source Code Auditing
Ch 18: Source Code Auditing
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Wonders of Golang
Wonders of GolangWonders of Golang
Wonders of Golang
 
Reactive Software Systems
Reactive Software SystemsReactive Software Systems
Reactive Software Systems
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Building an Event Bus at Scale
Building an Event Bus at ScaleBuilding an Event Bus at Scale
Building an Event Bus at Scale
 
Javascript best practices
Javascript best practicesJavascript best practices
Javascript best practices
 
White and Black Magic on the JVM
White and Black Magic on the JVMWhite and Black Magic on the JVM
White and Black Magic on the JVM
 
Metasploit Module Development
Metasploit Module DevelopmentMetasploit Module Development
Metasploit Module Development
 
Overview of Message Queues
Overview of Message QueuesOverview of Message Queues
Overview of Message Queues
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
06.1 .Net memory management
06.1 .Net memory management06.1 .Net memory management
06.1 .Net memory management
 
Conceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producciónConceptos básicos. Seminario web 6: Despliegue de producción
Conceptos básicos. Seminario web 6: Despliegue de producción
 
Building large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor frameworkBuilding large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor framework
 
Oracle Fuson Middleware Diagnostics, Performance and Troubleshoot
Oracle Fuson Middleware Diagnostics, Performance and TroubleshootOracle Fuson Middleware Diagnostics, Performance and Troubleshoot
Oracle Fuson Middleware Diagnostics, Performance and Troubleshoot
 

Mais de sailias

Test Test's Presentation
Test Test's PresentationTest Test's Presentation
Test Test's Presentation
sailias
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
sailias
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
sailias
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
sailias
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 
This is a title
This is a titleThis is a title
This is a title
sailias
 

Mais de sailias (20)

Test Test's Presentation
Test Test's PresentationTest Test's Presentation
Test Test's Presentation
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
 
Josh Borts's Presentation
Josh Borts's PresentationJosh Borts's Presentation
Josh Borts's Presentation
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 
This is a title
This is a titleThis is a title
This is a title
 

This is a title