This document discusses machine learning and large data sets. It covers types of algorithms, including deterministic and adaptive models. It describes the enormous growth of digital data and challenges of data analysis. Machine learning is defined, and applications like credit analysis, autonomous vehicles and medical diagnosis are mentioned. Analyzing large data sets through manual and automated methods like data mining is also discussed. Examples of large data set analysis include images, videos and medical areas like mammography and colonoscopy. The conclusion is that computational analysis of large amounts of data presents opportunities.
2. 1 Tipos de algoritmos
1. Determin´
ısticos (ou cl´ssicos, convencionais)
a
2. Adaptativos (ou estoc´sticos, ”avan¸ados”)
a c
1.1 Determin´
ısticos
• Detec¸˜o de colis˜o
ca a
• Fatora¸˜o de n´meros primos
ca u
• Invers˜o de matrizes (esparsas)
a
• Ordena¸˜o (quicksort, mergesort)
ca
• Page Rank
• Um pouco mais avan¸ados
c
– A*
´
– ”Arvore de jogos”
2
3. 1.1.1 ´
”Arvore de jogos”
• Jogo da velha
– Qual a quantidade total de possibilidades?
∗ 9 × 8 × . . . × 1 = 9! = 362.880
• Jogo de damas
3
5. • Qual a quantidade total de possibilidades?1
– Se for considerado uma profundidade P, e ramifica¸˜o R, a quantidade poss´ de n´s N pode
ca ıvel o
ser calculado com a f´rmula
o
N = RP
– O tamanho m´dio de uma partida de xadrez ´ de 50 lances, ou seja, 100 jogadas, sendo 50
e e
jogadas realizadas pelas pe¸as brancas e 50 pelas pe¸as negras.
c c
– Como o fator de ramifica¸˜o ´ em m´dia de 35, pode-se ent˜o estimar a quantidade de n´s de
ca e e a o
uma ´rvore correspondente a uma partida, como sendo N = 35100 = 2, 55155207 ∗ 10154 .
a
– Caso um computador percorra dois milh˜es de posi¸˜es por segundo, seriam necess´rios mais
o co a
de 5, 3 ∗ 10109 anos para esgotar toda a ´rvore.
a
• Surge ent˜o a famosa pergunta: o que ´ um programa ”inteligente” ?
a e
• Quem se lembra da disputa homem (Garry Kasparov) contra m´quina (IBM Deep Blue) [7, 8] ?
a
• Mais uma pergunta: xadrez ´, neste sentido, o jogo mais ”dif´
e ıcil” j´ criado pelo homem?
a
1 Resposta obtida na internet.
5
6. • Go [11]
• Ver tamb´m [9, 10]
e
• H´ sinais de esperan¸a [12]
a c
6
7. 1.2 Adaptativos
• O que ´ um programa ”inteligente”?
e
´
• E um programa ”que aprende”?
1.2.1 Alguns exemplos
• Reconhecimento de face
• An´lise de cr´dito
a e
• Navega¸˜o autˆnoma
ca o
• Diagn´stico m´dico
o e
• Proje¸˜o financeira (progn´stico)
ca o
• Sistemas de recomenda¸˜o
ca
• Log´
ıstica
7
8. • Text processing
– Spam
– News
– Pl´gio
a
• Aprendizado de m´quina
a
– Supervisionado (aprende com exemplos), que possui 2 fases: treinamento e opera¸˜o
ca
∗ NN
∗ Classifica¸˜o (Discriminante Linear - DL)
ca
∗ Regress˜o [66, 67]
a
– N˜o supervisionado (aprende sozinho), que s´ possui a fase de opera¸˜o
a o ca
∗ An´lise de aglomera¸˜o (K-means clustering)
a ca
8
12. 2.1 Data centers
• Google [73]
• Facebook [72]
2.2 Tratamento dos dados
• O que fazer com esses dados? Apenas armazenar? Indexar?
• Ou deve-se extrair informa¸˜o util?
ca ´
12
13. 3 Aprendizado de M´quina
a
• Defini¸˜o de Machine Learning (ML): ver [38]
ca
• Outra defini¸˜o de ML: ver [39]
ca
• Sobre Support Vector Machines (SVM): ver [38, 51]
– Support vector machines represent a powerful new class of models invented by Vladimir Vapnik
in the early 1990s
13
14. • 3 exemplos de aplica¸˜es de ML [39]
co
3.1 Tarefa t´
ıpica de data mining
• An´lise de risco de cr´dito
a e
14
15. 3.2 Problemas muito dif´
ıceis para serem programados
• A competi¸˜o DARPA Grand Challenge: vers˜o urbana [42, 43, 44, 45]
ca a
• A experiˆncia Google Car [41]
e
15
16. • Mais alguns detalhes
• Um pequeno problema?
• Outras referˆncias [52, 54, 55]
e
16
18. 4 Grandes conjuntos de dados
• An´lise de dados
a
– Manual
– Autom´tica
a
4.1 Outros temas correlatos
• Data mining
– Manual
∗ Visual data mining [63]
– Autom´tica
a
4.2 Exemplos
• An´lise de risco de cr´dito
a e
• A experiˆncia IBM Watson [40, 46, 47]
e
18
20. 4.2.2 Imagens
• Acesso por conte´do [13, 14, 15, 16, 17, 20, 21, 24]
u
• PhotoLib [19]
• Games with a purpose (GWAP) [18, 26]
• Pixazza → Luminate
• Semantics [22, 23]
• Learning [23, 25]
4.2.3 V´
ıdeos
• An´lise
a
20
21. 4.2.4 ´
Area m´dica
e
• Mamografia
• Colonoscopia [30, 31, 35]
– As gera¸˜es dos equipamentos de tomografia computadorizada
co
– Tipos: convencional e ”virtual” - vantagens e inconvenientes / limita¸˜es
co
– Visualiza¸˜o simples [29]
ca
21
22. • Display modes for CT colonography [32, 33]
• Computer-Aided Diagnosis (CAD): detecting polyps at CT colonography [34]
22
23. • Quantification of Distention in CT Colonography [36]
• Computerized Detection of Colonic Polyps at CT Colonography [37]
23
24. 5 Conclus˜es
o
• Tratamento computacional de grandes quantidades de dados ´ uma oportunidade, segundo a con-
e
sultoria McKinsey [27, 28]
24
25. References
[1] Richard G. Baraniuk. More is less: Signal processing and the data deluge. Science, 331:717–719,
February 2011.
[2] Peter Fox and James Hendler. Changing the equation on scientific data visualization. Science,
331:705–708, February 2011.
[3] Trudie Lang. Advancing global health research through digital technology and sharing data. Science,
331:714–717, February 2011.
[4] Kenneth Cukier. Data, data everywhere. The Economist, February 2010.
[5] Philip Ross. The meaning of computers and chess. Spectrum, March 2003.
[6] Philip E. Ross. Silicon shows its mettle. Spectrum, 40(3):24–26, March 2003.
[7] Feng-Hsiung Hsu. Behind Deep Blue. Princeton University Press, 2002.
[8] Yasser Seirawan, Herbert A. Simon, and Toshinori Munakata. The implications of kasparov vs. deep
blue. Communications of the ACM, 40(8):21–25, August 1997.
[9] John A. Bate. A beginners introduction to go. Technical report, Department of Computer Science,
University of Manitoba, 1997.
[10] James Hendler. Computers play chess; humans play go. IEEE Intelligent Systems, 21(4):2–3,
July/August 2006.
[11] Feng-Hsiung Hsu. Cracking go. Spectrum, 44(10):44–49, October 2007.
[12] Kirk L. Kroeker. A new benchmark for artificial intelligence. Communications of the ACM, 54(8):13–
15, August 2011.
[13] Charles A. Jacobs, Adam Finkelstein, and David H. Salesin. Fast multiresolution image querying.
Computer Graphics, pages 277 – 286, August 6 – 11 1995. ACM SIGGRAPH Annual Conference
Series.
[14] Everest Mathias and Aura Conci. Comparing the influence of color spaces and metrics in content-
based image retrieval. In Luciano da Fontoura Costa and Gilberto Cˆmara, editors, SIBGRAPI 98
a
– International Symposium on Computer Graphics, Image Processing and Vision, pages 371 – 378,
Rio de Janeiro – RJ, October 20 – 23 1998. Sociedade Brasileira de Computa¸˜o (SBC) – Instituto
ca
de Matem´tica Pura e Aplicada (IMPA).
a
[15] N. Sebe, M. Lew, and D. P. Huijsmans. Towards optimal ranking metrics. In Luciano da Fon-
toura Costa and Gilberto Cˆmara, editors, SIBGRAPI 98 – International Symposium on Computer
a
Graphics, Image Processing and Vision, pages 379–386, Rio de Janeiro – RJ, October 20 – 23 1998.
Sociedade Brasileira de Computa¸ao (SBC) – Instituto de Matem´tica Pura e Aplicada (IMPA).
c˜ a
[16] Henry Lieberman, Elizabeth Rozenweig, and Push Singh. Aria: An agent for annotating and re-
trieving images. Computer, 34(7):57–62, July 2001.
[17] Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom,
Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Query
by image content: the qbic system. Computer, 28(9):23 – 32, September 1995.
[18] Luis von Ahn. Games with a purpose. Computer, 39(6):92–94, June 2006.
[19] Ben Shneiderman, Benjamin B. Bederson, and Steven M. Drucker. Find that photo!: interface
strategies to annotate, browse, and share. Communications of the ACM, 49(4):69–71, April 2006.
[20] Sixto Ortiz Jr. Searching the visual web. Computer, 40(6):12–14, June 2007.
25
26. [21] Ricardo da Silva Torres and Alexandre Xavier Falcao. Content-based image retrieval: Theory and
applications. Revista de Inform´tica Te´rica e Aplicada (RITA), 13(2):161–185, 2006.
a o
[22] Nuno Vasconcelos. From pixels to semantic spaces: Advances in content-based image retrieval.
Computer, 40(7):20–26, July 2007.
[23] Victor Lavrenko, R. Manmatha, and Jiwoon Jeon. A model for learning the semantics of pictures. In
Sebastian Thrun, Lawrence Saul, and Bernhard Sch¨lkopf, editors, Advances in Neural Information
o
Processing Systems 16. MIT Press, 2003.
[24] Arnold W.M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain.
Content-based image retrieval at the end of the early years. Transactions on Pattern Analysis and
Machine Intelligence, 22(12):1349–1380, December 2000.
[25] Tao Wang, Yong Rui, Shi min Hu, and Jia guang Sun. Adaptive tree similarity learning for image
retrieval. Multimedia Systems, 9(2):131–143, August 2003.
[26] Luis von Ahn and Laura Dabbish. Designing games with a purpose. Communications of the ACM,
51(8):58–67, August 2008.
[27] McKinsey Global Institute. The challenge – and opportunity – of ’big data’. The McKinsey Quarterly,
May 2011.
[28] McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
McKinsey & Company, May 2011.
[29] Suya You, Lichan Hong, Ming Wan, Kittiboon Junyaprasert an Arie Kaufman, Shigeru Muraki,
Yong Zhou, Mark Wax, and Zhengrong Liang. Interactive volume rendering for virtual colonoscopy.
In Roni Yagel and Hans Hagen, editors, Visualization ’97, pages 433 – 436, Phoenix, AZ, October
19 – 24 1997. IEEE Computer Society Press.
[30] Lichan Hong, Shigeru Muraki, Arie Kaufman, Dirk Bartz, and Taosong He. Virtual voyage: Inter-
active navigation in the human colon. Computer Graphics, pages 27 – 34, August 3 – 8 1997. ACM
SIGGRAPH Annual Conference Series.
[31] David Essex. Did somebody say virtual colonoscopy? Communications of the ACM, 52(4):16–18,
April 2009.
[32] Chandu Karadi, Christopher F. Beaulieu, Jr R. Brooke Jeffrey, David S. Paik, and Sandy Napel.
Display modes for ct colonography: Part i. synthesis and insertion of polyps into patient ct data.
Radiology, 212(1):195–201, July 1999.
[33] Christopher F. Beaulieu, Jr R. Brooke Jeffrey, Chandu Karadi, David S. Paik, and Sandy Napel.
Display modes for ct colonography: Part ii. blinded comparison of axial ct and virtual endoscopic
and panoramic endoscopic volume-rendered studies. Radiology, 212(1):203–212, July 1999.
[34] Hiroyuki Yoshida, Janne Nappi, Peter MacEneaney, David T. Rubin, and Abraham H. Dachman.
Computer-aided diagnosis scheme for detection of polyps at ct colonography. RadioGraphics,
22(4):963–979, July–August 2002.
[35] Arie E. Kaufman, Sarang Lakare, Kevin Kreeger, and Ingmar Bitter. Virtual colonoscopy. Commu-
nications of the ACM, 48(2):37–41, February 2005.
[36] Peter W. Hung, David S. Paik, Sandy Napel, Judy Yee, R. Brooke Jeffrey Jr, Andreas Steinauer-
Gebauer, Juno Min, Ashwin Jathavedam, and Christopher F. Beaulieu. Quantification of distention
in ct colonography: Development and validation of three computer algorithms. Radiology, 222:543–
554, February 2002.
[37] Hiroyuki Yoshida, Yoshitaka Masutani, Peter MacEneaney, David T. Rubin, and Abraham H.
Dachman. Computerized detection of colonic polyps at ct colonography on the basis of volumetric
features: Pilot study. Radiology, 222:327–336, February 2002.
26
27. [38] Lutz H. Hamel. Knowledge Discovery with Support Vector Machines. Wiley-Interscience, 2009.
[39] Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
[40] Kirk L. Kroeker. Weighing watson’s impact. Communications of the ACM, 54(7):13–15, July 2011.
[41] Alex Wright. Automotive autonomy. Communications of the ACM, 54(7):16–18, July 2011.
[42] L. D. Jackel, Douglas Hackett, Eric Krotkov, Michal Perschbacher, James Pippine, and Charles
Sullivan. How darpa structures its robotics programs to improve locomotion and navigation. Com-
munications of the ACM, 50(11):55–59, November 2007.
[43] Sebastian Thrun. Why we compete in darpa’s urban challenge autonomous robot race. Communi-
cations of the ACM, 50(10):29–31, October 2007.
[44] Jean Kumagai. Dusted: No winners in darpa’s $1 million robotic race across the mojave desert.
Spectrum, March 2004.
[45] Guna Seetharaman, Arun Lakhotia, and Erik Philip Blasch. Unmanned vehicles come of age: The
darpa grand challenge. Computer, 39(12):26–29, December 2006.
[46] Stephen Baker. The programmers dilemma: Building a jeopardy! champion. The McKinsey Quar-
terly, February 2011.
[47] Greg Lindsay. Changing the game: ”how i beat watson and came out a different player”. The
McKinsey Quarterly, February 2011.
[48] Gary Anthes. Automated translation of indian languages. Communications of the ACM, 53(1):24–26,
January 2010.
[49] Thomas Lengauer, Andre Altman, Alexander Thielen, and Rolf Kaiser. Chasing the aids virus.
Communications of the ACM, 53(3):66–74, March 2010.
[50] Joseph MacInnes, Stephanie Santosa, and William Wright. Visual classification: Expert knowledge
guides machine learning. IEEE Computer Graphics and Applications, 30(1):8–14, January / February
2010.
[51] Robert P. Schumaker and Hsinchun Chen. A discrete stock price prediction engine based on financial
news. Computer, 43(1):51–56, January 2010.
[52] Leslie Pack Kaelbling. New bar set for intelligent vehicles. Communications of the ACM, 53(4):98,
April 2010.
[53] Gregory Goth. Turning data into knowledge. Communications of the ACM, 53(11):13–15, November
2010.
[54] Sebastian Thrun. Toward robotic cars. Communications of the ACM, 53(4):99–106, April 2010.
[55] Kathy Kowalenko. Keeping cars from crashing. The Institute, 34(3):5, September 2010.
[56] Ariel Bleicher. Eyes in the sky that see too much. Spectrum, 47(10):16, October 2010.
[57] Yair Weiss and Judea Pearl. Belief propagation. Communications of the ACM, 53(10):94, October
2010.
[58] Erik B. Sudderth, Alexander T. Ihler, Michael Isard, William T. Freeman, and Allan S. Willsky.
Nonparametric belief propagation. Communications of the ACM, 53(10):95–103, October 2010.
[59] Fernando Pereira. A model sequence memoizer. Communications of the ACM, 54(2):90, February
2011.
27
28. [60] Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James, and Yee Whye Teh. The sequence
memoizer. Communications of the ACM, 54(2):91–98, February 2011.
[61] Gabor Szabo and Bernardo A. Huberman. Predicting the popularity of online content. Communi-
cations of the ACM, 53(8):80–88, August 2010.
[62] J. M. Mendel and K. S. Fu. Adaptive, Learning and Pattern Recognition Systems. Academic Press,
1970.
[63] Kwan-Liu Ma. Machine learning to boost the next generation of visualization technology. IEEE
Computer Graphics and Applications, 27(5):6–9, September / October 2007.
[64] Fabio A. Gonzalez and Eduardo Romero. Biomedical Image Analysis and Machine Learning Tech-
nologies: Applications and Techniques. IGI Global, 2010.
[65] Amos J. Storkey. Machine learning and pattern recognition: Introduction. Technical report, Institute
for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, 2009.
[66] Amos J. Storkey. Machine learning and pattern recognition: Regression and linear parameter models.
Technical report, Institute for Adaptive and Neural Computation, School of Informatics, University
of Edinburgh, 2009.
[67] Leˆnidas Concei¸˜o Barroso, Magali Maria de Ara´jo Barroso, Frederico Ferreira Campos Filho,
o ca u
M´rcio Luiz Bunte de Carvalho, and Miriam Louren¸o Maia. C´lculo Num´rico (com Aplica¸˜es).
a c a e co
Editora Harbra Ltda., 1987.
[68] Amos J. Storkey. Machine learning and pattern recognition: Preliminaries. Technical report, Insti-
tute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, 2009.
´
[69] Garrett Birkhoff and Saunders MacLane. Algebra Moderna B´sica. Guanabara Dois, 1980.
a
[70] Jieping Ye, Teresa Wu, Jing Li, and Kewei Chen. Machine learning approaches for the neuroimaging
study of alzheimer’s disease. Computer, 44(4):99–101, April 2011.
[71] Ting Liu, Charles Rosenberg, and Henry A. Rowley. Clustering billions of images with large scale
nearest neighbor search. In Eighth IEEE Workshop on Applications of Computer Vision (WACV),
2007.
[72] David Schneider. Under the hood at google and facebook. Spectrum, 48(5):54–59, May 2011.
[73] Randy H. Katz. Tech titans building boom. Spectrum, 46(2):36–39; 46–49, February 2009.
28