SlideShare uma empresa Scribd logo
1 de 33
Counting Fast
      (Part I)

          Sergei Vassilvitskii
        Columbia University
Computational Social Science
              March 8, 2013
Computers are fast!

  Servers:
   – 3.5+ Ghz

  Laptops:
   – 2.0 - 3 Ghz

  Phones:
   – 1.0-1.5 GHz



  Overall: Executes billions of operations per second!




                              2                     Sergei Vassilvitskii
But Data is Big!

  Datasets are huge:
   – Social Graphs (Billions of nodes, each with hundreds of edges)
      • Terabytes (million million bytes)
   – Pictures, Videos, associated metadata:
      • Petabytes (million billion bytes!)




                                             3                  Sergei Vassilvitskii
Computers are getting faster
  Moore’s law (1965!):
   – Number of transistors on a chip doubles every two years.




                                    4                           Sergei Vassilvitskii
Computers are getting faster

  Moore’s law (1965!):
   – Number of transistors on a chip doubles every two years.



  For a few decades:
   – The speed of chips doubled every 24 months.


  Now:
   – The number of cores doubling
   – Speed staying roughly the same




                                      5                         Sergei Vassilvitskii
But Data is Getting Even Bigger

  Unknown author, 1981 (?):
   – “640K ought to be enough for anyone”



  Eric Schmidt, March 2013:
   – “There were 5 exabytes of information created between the dawn of
     civilization through 2003, but that much information is now created
     every 2 days, and the pace is increasing.”




                                    6                          Sergei Vassilvitskii
Data Sizes
  What is Big Data?
   – MB in 1980s          Hard Drive Capacity
   – GB in 1990s
   – TB in 2000s
   – PB in 2010s




                      7                         Sergei Vassilvitskii
Working with Big Data

  Two datasets of numbers:
   – Want to find the intersection (common values)
   – Why?
     • Data cleaning (these are missing values)
     • Data mining (these are unique in some way)




                                         8          Sergei Vassilvitskii
Working with Big Data

  Two datasets of numbers:
   – Want to find the intersection (common values)
   – Why?
      • Data cleaning (these are missing values)
      • Data mining (these are unique in some way)


   – How long should it take?
      •   Each   dataset   has   10 numbers?
      •   Each   dataset   has   10k numbers?
      •   Each   dataset   has   10M numbers?
      •   Each   dataset   has   10B numbers?
      •   Each   dataset   has   10T numbers?




                                                9    Sergei Vassilvitskii
How to Find Intersections?




                    10       Sergei Vassilvitskii
Idea 1: Scan

  Look at every number in list 1:
   – Scan through dataset 2, see if you find a match


  common_elements = 0
  for number in dataset1:
     for number2 in dataset2:
        if number1 == number2:
           common_elements +=1




                                   11                 Sergei Vassilvitskii
Idea 1: Scanning

 For each element in dataset 1, scan through dataset 2, see if it’s present


 common_elements = 0
 for number in dataset1:
    for number2 in dataset2:
       if number1 == number2:
          common_elements +=1


 Analysis: Number of times if statement executed?
 – |dataset2| for every iteration of outer loop
 – |dataset1| * |dataset2| in total




                                       12                     Sergei Vassilvitskii
Idea 1: Scanning

 Analysis: Number of times if statement executed?
 – |dataset2| for every iteration of outer loop
 – |dataset1| * |dataset2| in total


 Running time:
 – 100M * 100M = 1016 comparisons in total
 – At 1B (109) comparisons / second




                                       13           Sergei Vassilvitskii
Idea 1: Scanning

 Analysis: Number of times if statement executed?
 – |dataset2| for every iteration of outer loop
 – |dataset1| * |dataset2| in total


 Running time:
 – 100M * 100M = 1016 comparisons in total
 – At 1B (109) comparisons / second
 – 107 seconds ~ 4 months!


 – Even with 1000 computers: 104 seconds -- 2.5 hours!




                                       14                Sergei Vassilvitskii
Idea 2: Sorting

  Suppose both sets are sorted
   – Keep pointers to each
   – Check for match, increase the smaller pointer



  [Blackboard]




                                    15               Sergei Vassilvitskii
Idea 2: Sorting

sorted1 = sorted(dataset1)
sorted2 = sorted(dataset2)
pointer1, pointer2 = 0
common_elements = 0
while pointer1 < size(dataset1) and pointer2 < size(dataset2):
   if sorted[pointer1] == sorted[pointer2]:
      common_elements+=1
      pointer1+=1; pointer2+=1
   else if sorted[pointer1] < sorted[pointer2]:
      pointer1+=1
   else:
      pointer2+=1

Analysis:
– Number of times if statement executed?
– Increment a counter each time: |dataset1|+|dataset2|

                                     16                  Sergei Vassilvitskii
Idea 2: Sorting

Analysis:
– Number of times if statement executed?
– Increment a counter each time: |dataset1|+|dataset2|


Running time:
–   At most 100M + 100M comparisons
–   At 1B comparisons/second ~ 0.2 seconds
–   Plus cost of sorting! ~1 second per list
–   Total time = 2.2 seconds




                                       17                Sergei Vassilvitskii
Reasoning About Running Times (1)

  Worry about the computation as a function of input size:
  – “If I double my input size, how much longer will it take?”
     •   Linear time (comparisons after sorting): twice as long!
     •   Quadratic time (scan): four (22) times as long
     •   Cubic time (very slow): 8 (23) time as long
     •   Exponential time (untenable):
     •   Sublinear time (uses sampling, skips over input)




                                            18                     Sergei Vassilvitskii
Reasoning About Running Times (2)

  Worry about the computation as a function of input size.
  Worry about order of magnitude, not exact running time:
  – Difference between 2 seconds and 4 seconds much smaller than
    between 2 seconds and 3 months!
     • The scan algorithm does more work in the while loop (but only a constant more
       work) -- 3 comparisons instead of 1.
     • Therefore, still call it linear time




                                        19                              Sergei Vassilvitskii
Reasoning about running time

  Worry about the computation as a function of input size.
  Worry about order of magnitude, not exact running time.



  Captured by the Order notation: O(.)
  – For an input of size n, approximately how long will it take?
  – Scan: O(n2)
  – Comparisons after sorted: O(n)




                                     20                            Sergei Vassilvitskii
Reasoning about running time

  Worry about the computation as a function of input size.
  Worry about order of magnitude, not exact running time.



  Captured by the Order notation: O(.)
  – For an input of size n, approximately how long will it take?
  – Scan: O(n2)
  – Comparisons after sorted: O(n)
  – Sorting = O(n log n)
     • Slightly more than n,
     • But much less than n2.




                                     21                            Sergei Vassilvitskii
Avoiding Sort: Hashing

  Idea 3.
   – Store each number in list1 in a location unique to it
   – For each element in list2, check if its unique location is empty


  [Blackboard]




                                     22                           Sergei Vassilvitskii
Idea 3: Hashing

  table = {}
  for i in range(total):
     table.add(dataset1[i])
  common_elements = 0
  for i in range(total):
     if (table.has(dataset2[i])):
        common_elements+=1

  Analysis:
   – Number of additions to the table: |dataset1|
   – Number of comparisons: |dataset2|
   – If Additions to the table and comparisons are 1B/second
   – Total running time is: 0.2s




                                   23                          Sergei Vassilvitskii
Lots of Details

  Hashing, Sorting, Scanning:
   – All have their advantages
   – Scanning: in place, just passing through the data
   – Sorting: in place (no extra storage), much faster
   – Hashing: not in place, even faster




                                     24                  Sergei Vassilvitskii
Lots of Details

  Hashing, Sorting, Scanning:
   – All have their advantages
   – Scanning: in place, just passing through the data
   – Sorting: in place (no extra storage), much faster
   – Hashing: not in place, even faster


  Reasoning about algorithms:
   – Non trivial (and hard!)
   – A large part of computer science
   – Luckily mostly abstracted




                                     25                  Sergei Vassilvitskii
Break




        26   Sergei Vassilvitskii
Distributed Computation

  Working with large datasets:
  – Most datasets are skewed
  – A few keys are responsible for most of the data
  – Must take skew into account, since averages are misleading




                                   27                        Sergei Vassilvitskii
Additional Cost

  Communication cost
   – Prefer to do more on a single machine (even if it’s doing more work) to
     constantly communicating


   – Why? If you have 1000 machines talking to 1000 machines --- that’s
     1M channels of communication
   – The overall communication cost grows quadratically, which we have
     seen does not scale...




                                    28                          Sergei Vassilvitskii
Analysis at Scale




                    29   Sergei Vassilvitskii
Doing the study

  Suppose you had the data available. What would you do?


  If you have a hypothesis:
   – “Taking both Drug A and Drug B causes a side effect C”?




                                   30                          Sergei Vassilvitskii
Doing the study

  If you have a hypothesis:
   – “Taking both Drug A and Drug B causes a side effect C”?
                                Look at the ratio of observed
                                symptoms over expected
                                  - Expected: fraction of people who
                                  took drug A and saw effect C.
           A           B          - Observed: fraction of people who
                                  took drugs A and B and saw effect C.


                 C




                                   31                          Sergei Vassilvitskii
Doing the study

  If you have a hypothesis:
   – “Taking both Drug A and Drug B causes a side effect C”?
                                Look at the ratio of observed
                                symptoms over expected
                                  - Expected: fraction of people who
                                  took drug A and saw effect C.
           A           B          - Observed: fraction of people who
                                  took drugs A and B and saw effect C.

                                  This is just counting!
                 C




                                   32                          Sergei Vassilvitskii
Doing the study

  Suppose you had the data available. What would you do?


  Discovering hypotheses to test:
   – Many pairs of drugs, some co-occur very often
   – Some side effects are already known




                                   33                Sergei Vassilvitskii

Mais conteúdo relacionado

Destaque

Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIjakehofman
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part Ijakehofman
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part Ijakehofman
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIjakehofman
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Sentidos pablo j
Sentidos pablo jSentidos pablo j
Sentidos pablo jrosayago
 
Netegem el nostre pati
Netegem el nostre patiNetegem el nostre pati
Netegem el nostre patiPabloLopez9716
 
Ejercicios.especificacion 01 29
Ejercicios.especificacion 01 29Ejercicios.especificacion 01 29
Ejercicios.especificacion 01 29jgguevara2010
 
La anorexia. vera
La anorexia. veraLa anorexia. vera
La anorexia. verarosayago
 
Material De Laboratorio
Material De LaboratorioMaterial De Laboratorio
Material De Laboratorioguest12be2d8
 
Tic ted bravo nicolás
Tic ted bravo nicolásTic ted bravo nicolás
Tic ted bravo nicolásNico Bravo
 

Destaque (20)

Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part IIComputational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 06: Networks, Part II
 
Computational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part IComputational Social Science, Lecture 05: Networks, Part I
Computational Social Science, Lecture 05: Networks, Part I
 
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part IComputational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 03: Counting at Scale, Part I
 
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part IIComputational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 04: Counting at Scale, Part II
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Sentidos pablo j
Sentidos pablo jSentidos pablo j
Sentidos pablo j
 
Grupos en Google
Grupos en GoogleGrupos en Google
Grupos en Google
 
Sistema solar
Sistema solarSistema solar
Sistema solar
 
Netegem el nostre pati
Netegem el nostre patiNetegem el nostre pati
Netegem el nostre pati
 
лабар7
лабар7лабар7
лабар7
 
Ejercicios.especificacion 01 29
Ejercicios.especificacion 01 29Ejercicios.especificacion 01 29
Ejercicios.especificacion 01 29
 
La anorexia. vera
La anorexia. veraLa anorexia. vera
La anorexia. vera
 
Material De Laboratorio
Material De LaboratorioMaterial De Laboratorio
Material De Laboratorio
 
Tic ted bravo nicolás
Tic ted bravo nicolásTic ted bravo nicolás
Tic ted bravo nicolás
 
Making Data Work Better
Making Data Work BetterMaking Data Work Better
Making Data Work Better
 

Semelhante a Computational Social Science, Lecture 07: Counting Fast, Part I

Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysisData Science London
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big datajins0618
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
World widetelescopetecfest
World widetelescopetecfestWorld widetelescopetecfest
World widetelescopetecfestPREMKUMAR
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learningcomifa7406
 
2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptxssuser1fb3df
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Grigory Yaroslavtsev
 
Telling Stories With Data: Class 1
Telling Stories With Data: Class 1Telling Stories With Data: Class 1
Telling Stories With Data: Class 1David Newbury
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 

Semelhante a Computational Social Science, Lecture 07: Counting Fast, Part I (20)

Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
Data streaming fundamentals- EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Data structures
Data structuresData structures
Data structures
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
World widetelescopetecfest
World widetelescopetecfestWorld widetelescopetecfest
World widetelescopetecfest
 
Neural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep LearningNeural Networks for Machine Learning and Deep Learning
Neural Networks for Machine Learning and Deep Learning
 
2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)
 
Telling Stories With Data: Class 1
Telling Stories With Data: Class 1Telling Stories With Data: Class 1
Telling Stories With Data: Class 1
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 

Mais de jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09jakehofman
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brainjakehofman
 

Mais de jakehofman (17)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Using Data to Understand the Brain
Using Data to Understand the BrainUsing Data to Understand the Brain
Using Data to Understand the Brain
 

Último

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 

Último (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 

Computational Social Science, Lecture 07: Counting Fast, Part I

  • 1. Counting Fast (Part I) Sergei Vassilvitskii Columbia University Computational Social Science March 8, 2013
  • 2. Computers are fast! Servers: – 3.5+ Ghz Laptops: – 2.0 - 3 Ghz Phones: – 1.0-1.5 GHz Overall: Executes billions of operations per second! 2 Sergei Vassilvitskii
  • 3. But Data is Big! Datasets are huge: – Social Graphs (Billions of nodes, each with hundreds of edges) • Terabytes (million million bytes) – Pictures, Videos, associated metadata: • Petabytes (million billion bytes!) 3 Sergei Vassilvitskii
  • 4. Computers are getting faster Moore’s law (1965!): – Number of transistors on a chip doubles every two years. 4 Sergei Vassilvitskii
  • 5. Computers are getting faster Moore’s law (1965!): – Number of transistors on a chip doubles every two years. For a few decades: – The speed of chips doubled every 24 months. Now: – The number of cores doubling – Speed staying roughly the same 5 Sergei Vassilvitskii
  • 6. But Data is Getting Even Bigger Unknown author, 1981 (?): – “640K ought to be enough for anyone” Eric Schmidt, March 2013: – “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing.” 6 Sergei Vassilvitskii
  • 7. Data Sizes What is Big Data? – MB in 1980s Hard Drive Capacity – GB in 1990s – TB in 2000s – PB in 2010s 7 Sergei Vassilvitskii
  • 8. Working with Big Data Two datasets of numbers: – Want to find the intersection (common values) – Why? • Data cleaning (these are missing values) • Data mining (these are unique in some way) 8 Sergei Vassilvitskii
  • 9. Working with Big Data Two datasets of numbers: – Want to find the intersection (common values) – Why? • Data cleaning (these are missing values) • Data mining (these are unique in some way) – How long should it take? • Each dataset has 10 numbers? • Each dataset has 10k numbers? • Each dataset has 10M numbers? • Each dataset has 10B numbers? • Each dataset has 10T numbers? 9 Sergei Vassilvitskii
  • 10. How to Find Intersections? 10 Sergei Vassilvitskii
  • 11. Idea 1: Scan Look at every number in list 1: – Scan through dataset 2, see if you find a match common_elements = 0 for number in dataset1: for number2 in dataset2: if number1 == number2: common_elements +=1 11 Sergei Vassilvitskii
  • 12. Idea 1: Scanning For each element in dataset 1, scan through dataset 2, see if it’s present common_elements = 0 for number in dataset1: for number2 in dataset2: if number1 == number2: common_elements +=1 Analysis: Number of times if statement executed? – |dataset2| for every iteration of outer loop – |dataset1| * |dataset2| in total 12 Sergei Vassilvitskii
  • 13. Idea 1: Scanning Analysis: Number of times if statement executed? – |dataset2| for every iteration of outer loop – |dataset1| * |dataset2| in total Running time: – 100M * 100M = 1016 comparisons in total – At 1B (109) comparisons / second 13 Sergei Vassilvitskii
  • 14. Idea 1: Scanning Analysis: Number of times if statement executed? – |dataset2| for every iteration of outer loop – |dataset1| * |dataset2| in total Running time: – 100M * 100M = 1016 comparisons in total – At 1B (109) comparisons / second – 107 seconds ~ 4 months! – Even with 1000 computers: 104 seconds -- 2.5 hours! 14 Sergei Vassilvitskii
  • 15. Idea 2: Sorting Suppose both sets are sorted – Keep pointers to each – Check for match, increase the smaller pointer [Blackboard] 15 Sergei Vassilvitskii
  • 16. Idea 2: Sorting sorted1 = sorted(dataset1) sorted2 = sorted(dataset2) pointer1, pointer2 = 0 common_elements = 0 while pointer1 < size(dataset1) and pointer2 < size(dataset2): if sorted[pointer1] == sorted[pointer2]: common_elements+=1 pointer1+=1; pointer2+=1 else if sorted[pointer1] < sorted[pointer2]: pointer1+=1 else: pointer2+=1 Analysis: – Number of times if statement executed? – Increment a counter each time: |dataset1|+|dataset2| 16 Sergei Vassilvitskii
  • 17. Idea 2: Sorting Analysis: – Number of times if statement executed? – Increment a counter each time: |dataset1|+|dataset2| Running time: – At most 100M + 100M comparisons – At 1B comparisons/second ~ 0.2 seconds – Plus cost of sorting! ~1 second per list – Total time = 2.2 seconds 17 Sergei Vassilvitskii
  • 18. Reasoning About Running Times (1) Worry about the computation as a function of input size: – “If I double my input size, how much longer will it take?” • Linear time (comparisons after sorting): twice as long! • Quadratic time (scan): four (22) times as long • Cubic time (very slow): 8 (23) time as long • Exponential time (untenable): • Sublinear time (uses sampling, skips over input) 18 Sergei Vassilvitskii
  • 19. Reasoning About Running Times (2) Worry about the computation as a function of input size. Worry about order of magnitude, not exact running time: – Difference between 2 seconds and 4 seconds much smaller than between 2 seconds and 3 months! • The scan algorithm does more work in the while loop (but only a constant more work) -- 3 comparisons instead of 1. • Therefore, still call it linear time 19 Sergei Vassilvitskii
  • 20. Reasoning about running time Worry about the computation as a function of input size. Worry about order of magnitude, not exact running time. Captured by the Order notation: O(.) – For an input of size n, approximately how long will it take? – Scan: O(n2) – Comparisons after sorted: O(n) 20 Sergei Vassilvitskii
  • 21. Reasoning about running time Worry about the computation as a function of input size. Worry about order of magnitude, not exact running time. Captured by the Order notation: O(.) – For an input of size n, approximately how long will it take? – Scan: O(n2) – Comparisons after sorted: O(n) – Sorting = O(n log n) • Slightly more than n, • But much less than n2. 21 Sergei Vassilvitskii
  • 22. Avoiding Sort: Hashing Idea 3. – Store each number in list1 in a location unique to it – For each element in list2, check if its unique location is empty [Blackboard] 22 Sergei Vassilvitskii
  • 23. Idea 3: Hashing table = {} for i in range(total): table.add(dataset1[i]) common_elements = 0 for i in range(total): if (table.has(dataset2[i])): common_elements+=1 Analysis: – Number of additions to the table: |dataset1| – Number of comparisons: |dataset2| – If Additions to the table and comparisons are 1B/second – Total running time is: 0.2s 23 Sergei Vassilvitskii
  • 24. Lots of Details Hashing, Sorting, Scanning: – All have their advantages – Scanning: in place, just passing through the data – Sorting: in place (no extra storage), much faster – Hashing: not in place, even faster 24 Sergei Vassilvitskii
  • 25. Lots of Details Hashing, Sorting, Scanning: – All have their advantages – Scanning: in place, just passing through the data – Sorting: in place (no extra storage), much faster – Hashing: not in place, even faster Reasoning about algorithms: – Non trivial (and hard!) – A large part of computer science – Luckily mostly abstracted 25 Sergei Vassilvitskii
  • 26. Break 26 Sergei Vassilvitskii
  • 27. Distributed Computation Working with large datasets: – Most datasets are skewed – A few keys are responsible for most of the data – Must take skew into account, since averages are misleading 27 Sergei Vassilvitskii
  • 28. Additional Cost Communication cost – Prefer to do more on a single machine (even if it’s doing more work) to constantly communicating – Why? If you have 1000 machines talking to 1000 machines --- that’s 1M channels of communication – The overall communication cost grows quadratically, which we have seen does not scale... 28 Sergei Vassilvitskii
  • 29. Analysis at Scale 29 Sergei Vassilvitskii
  • 30. Doing the study Suppose you had the data available. What would you do? If you have a hypothesis: – “Taking both Drug A and Drug B causes a side effect C”? 30 Sergei Vassilvitskii
  • 31. Doing the study If you have a hypothesis: – “Taking both Drug A and Drug B causes a side effect C”? Look at the ratio of observed symptoms over expected - Expected: fraction of people who took drug A and saw effect C. A B - Observed: fraction of people who took drugs A and B and saw effect C. C 31 Sergei Vassilvitskii
  • 32. Doing the study If you have a hypothesis: – “Taking both Drug A and Drug B causes a side effect C”? Look at the ratio of observed symptoms over expected - Expected: fraction of people who took drug A and saw effect C. A B - Observed: fraction of people who took drugs A and B and saw effect C. This is just counting! C 32 Sergei Vassilvitskii
  • 33. Doing the study Suppose you had the data available. What would you do? Discovering hypotheses to test: – Many pairs of drugs, some co-occur very often – Some side effects are already known 33 Sergei Vassilvitskii