5. § Public Dataset: Phishing Websites Dataset
§ Some examples
§ Using URL Shortening Services “TinyURL”:
§ bit.ly/19DXSk4
§ Age of Domain:
§ minimum age of the legitimate domain is 6 months
§ Adding Prefix or Suffix Separated by (-) to the Domain:
§ http://www.confirme-paypal.com/
4
8. § Offline
§ Retrain the model with historical data + new data
§ Model fits the global distribution of the data better
§ Can be unpractical for large data sets
§ Online
§ Use new observation to further train your model
§ Model is more influenced by the recent data
§ Adapt to new trend faster
§ Batch / Mini-batch
§ Wait for a batch of observation to further train your model
7
13. § Online algorithms Pros:
§ Computationally much faster
§ Useful when dataset is too big
§ Adapt to new trend faster
§ Online algorithms Cons:
§ Majority of the algorithms only work in batch
§ Some feature extractions are slow
§ IP Geo lookup
§ Hard to always get it right in automatic way
12
20. § Challenges from data stream itself
§ How to handle time depend on use case
§ Event time: reflects real life but harder to implement
§ Challenges from your infrastructure
§ Exactly once delivery is critical for accurate streaming analytical results
§ You probably would want that for your online model
§ Streaming gives you more timely results
§ Not everything needs to be real-time
19
21. BATCH OR STREAMING?
§ Model Updating
§ Batch: improve accuracy everytime you retrain the model
§ Online: adapt to new data points as they comes in
§ Latency & Correctness
§ Batch: high latency but more control of the data
§ Streaming: low latency but less control
§ Monitoring
§ Maintainability
§ It is easier to maintain one pipeline rather than two
§ Lambda vs. Kappa
20
22. 21
Data Science Data Engineering
• Find the right features
• Get labeled data
• Manual labeling
• Develop a model
• Guarantee “exact once” for
the streaming pipeline
• Tuning Spark
• # of executors
• Memory
• Event & Processing
Time
• Spark Programming
• Spark connector or Python
connector
• Use MLlib
23. 22
§ Github Repo for the example
§ https://github.com/keiraqz/StreamingLogisticRegression
§ Phishing Websites Data Set
§ https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#
§ Spark Streaming ML Algorithm
§ https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
§ The World Beyond Batch: Streaming 101
§ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
§ Lambda Architecture
§ http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
§ Kappa Architecture
§ https://www.oreilly.com/ideas/questioning-the-lambda-architectur