SlideShare uma empresa Scribd logo
1 de 88
Large-scale information extraction and integration infrastructure
for supporting financial decision making (FP7-ICT-257928)
http://project-first.eu




                  Text Mining and Text Stream
                        Mining Tutorial
                                                                Miha Grčar
                                                                miha.grcar@ijs.si

                                        Department of Knowledge Technologies
                                           Jožef Stefan Institute, Ljubljana
                                                                    http://kt.ijs.si
Text and text stream mining
                            tutorial


• Part I: Text mining

• Part II: Text stream mining




Lucca, Oct 2012          Miha Grčar: Text and text stream mining   2
PART I • PART II


                      Part I:
                   Text mining
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                What is text mining?
        • Text mining provides a set of methodologies and tools for
          discovering, presenting, and evaluating knowledge from
          large collections of textual documents
        • Text mining employs adopts and adapts methodologies and
          tools from …
                   –   Data mining (DM)
                   –   Machine learning (ML)
                   –   Information retrieval (IR)
                   –   Natural language processing (NLP)
                   –   Visualization
                   –   Social network analysis and graph mining
                   –   Knowledge management
                   –   …

        Lucca, Oct 2012                 Miha Grčar: Text and text stream mining   4
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                       Typical text mining process
                                                            Feedback loop




                                                                                                    - Performance and
                                                                                     Evaluation /
                                                                                                    - utility assessment
                                                                                      validation
                                                                                                    - Feedback loop

            Data                  Text pre-
                                                            Modeling
         acquisition             processing


                                                                                                    - Presentation
       - Acquisition            - Transformation      - Discover                     Application
                                                                                                    - Interaction
       - Cleaning                                     - Extract
                                                      - Organize knowledge




        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining                               5
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                    What do we cover in Part 1?
                                                             Feedback loop




                                                                                                     - Cross validation
                                                                                      Evaluation /
                                                                                                     - Precision
                                                                                       validation
                                                                                                     - Recall …

            Data                   Text pre-
                                                             Modeling
         acquisition              processing                                                         - Search & browse
                                                                                                     - Categorization
                                                                                                     - Recommendation
                                - Vector spc model     - Machine learning             Application    - Advertising
                                - (bags-of-words)        - Classification                            - Spam detection
                                                         - Clustering                                - Summarization
                                                                                                     - Visualization …


        Lucca, Oct 2012                     Miha Grčar: Text and text stream mining                               6
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                   Bags of words
                    • Tokenize • Remove stop words


                                              the
                                              quick
                                              brown
                     The quick
                                              dog
                    brown dog
                                              jumps
                    jumps over
                                              over
                   the lazy dog.
                                              the
                                              lazy
                                              dog




        Lucca, Oct 2012               Miha Grčar: Text and text stream mining   7
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                   Bags of words
                    • Tokenize • Remove stop words • Lemmatize • Compute weights


                                              the
                                              quick




                                                                                brown
                                                                                quick



                                                                                jump
                                              brown




                                                                                lazy
                                                                                dog
                     The quick
                                              dog
                    brown dog
                                              jumps  jump                      1 1 2 1 1
                    jumps over
                                              over
                   the lazy dog.
                                              the
                                              lazy
                                              dog




        Lucca, Oct 2012               Miha Grčar: Text and text stream mining               8
PART I • PART II
INTRO • BOW • ML • EVAL • APP

                                     Bags of words
                                Tokenization & stop word removal
        Original text:                                      Simple tokenizer (alphanumeric
                                                            strings only):
        After ripping 14% higher from
        June until the first week of                        After | ripping | 14 | higher | from
        October, stocks ran headfirst into                  | June | until | the | first | week |
        a wall of worry seemingly too                       of | October | stocks | ran |
        large to climb. Europe, China, the                  headfirst | into | a | wall | of |
        fiscal cliff, etc aren't new                        worry | seemingly | too | large |
        concerns but that doesn't mean                      to | climb | Europe | China | the |
        they aren't real. Investors                         fiscal | cliff | etc | aren | t | new |
        suddenly care and are behaving                      concerns | but | that | doesn | t |
        accordingly, selling some of their                  mean | they | aren | t | real |
        more aggressive names and                           Investors | suddenly | care | and |
        rotating into defensives.                           are | behaving | accordingly |
                                                            selling | some | of | their | more |
                                                            aggressive | names | and |
                                                            rotating | into | defensives
        Lucca, Oct 2012               Miha Grčar: Text and text stream mining                    9
PART I • PART II
INTRO • BOW • ML • EVAL • APP

                                     Bags of words
                                Tokenization & stop word removal
        Original text:                                      Regex tokenizer ([p{L}']+):
        After ripping 14% higher from                       After | ripping | higher | from |
        June until the first week of                        June | until | the | first | week |
        October, stocks ran headfirst into                  of | October | stocks | ran |
        a wall of worry seemingly too                       headfirst | into | a | wall | of |
        large to climb. Europe, China, the                  worry | seemingly | too | large |
        fiscal cliff, etc aren't new                        to | climb | Europe | China | the
        concerns but that doesn't mean                      | fiscal | cliff | etc | aren't | new
        they aren't real. Investors                         | concerns | but | that | doesn't
        suddenly care and are behaving                      | mean | they | aren't | real |
        accordingly, selling some of their                  Investors | suddenly | care | and
        more aggressive names and                           | are | behaving | accordingly |
        rotating into defensives.                           selling | some | of | their | more
                                                            | aggressive | names | and |
                                                            rotating | into | defensives

        Lucca, Oct 2012               Miha Grčar: Text and text stream mining                   10
PART I • PART II
INTRO • BOW • ML • EVAL • APP

                                Bags of words
                                      Lemmatization
        Original text:                                 Lemmatized:
        After ripping 14% higher from                  After | rip | high | from | June |
        June until the first week of                   until | the | first | week | of |
        October, stocks ran headfirst into             October | stock | run | headfirst
        a wall of worry seemingly too                  | into | a | wall | of | worry |
        large to climb. Europe, China, the             seemingly | too | large | to |
        fiscal cliff, etc aren't new                   climb | Europe | China | the |
        concerns but that doesn't mean                 fiscal | cliff | etc | aren't | new |
        they aren't real. Investors                    concern | but | that | doesn't |
        suddenly care and are behaving                 mean | they | aren't | real |
        accordingly, selling some of their             Investor | suddenly | care | and |
        more aggressive names and                      are | behave | accordingly | sell |
        rotating into defensives.                      some | of | their | more |
                                                       aggressive | name | and | rotate
                                                       | into | defensive

        Lucca, Oct 2012          Miha Grčar: Text and text stream mining                  11
PART I • PART II
INTRO • BOW • ML • EVAL • APP

                                 Bags of words
                                        Lemmatization
        Original text:                                   Lemmatized:

        È uno dei punti più contestati                   E | uno | dei | puntare | più |
        della legge di Stabilità approvata               contestato | della | legge | di |
        da poco dal governo: il taglio alle              Stabilità | approvare | da | poco |
                                                         dal | governo | il | tagliare | alle |
        detrazioni fiscali, ossia gli "sconti"           detrazione | fiscale | ossia | gli |
        che ogni contribuente può                        scontare | che | ogni | contribuire |
        vantare sulla propria                            può | vantare | sulla | proprio |
        dichiarazione dei redditi. Secondo               dichiarazione | dei | reddito |
        una bozza aggiornata del disegno                 Secondo | una | bozzare |
        di legge, il taglio si applicherebbe             aggiornare | del | disegnare | di |
        a decorrere dal periodo di                       legge | il | tagliare | si | applicare | a
        imposta al 31 dicembre 2012. Un                  | decorrere | dal | periodare | di |
        dettaglio che aveva creato, nei                  impostare | al | dicembre | Un |
        giorni scorsi, non poche                         dettagliare | che | aveva | creare |
                                                         nei | giorno | scorrere | non | poca |
        polemiche.                                       polemico

        Lucca, Oct 2012            Miha Grčar: Text and text stream mining                       12
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                 Computing weights
        • TF
                   – Term Frequency
                   – The number of times a lemma (stem) occurs in a document
        • DF
                   – Document Frequency
                   – The number of documents in which a lemma (stem) occurs at least
                     once
        • TFIDF
                                                                        • Higher TF means higher TFIDF
                                                                        • Higher DF means lower TFIDF




        Lucca, Oct 2012                 Miha Grčar: Text and text stream mining                      13
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                Computing weights




                                                                                 DF
                                                                                 TF
                                                                                         IDF TFIDF
                                                                     quick       1   1    0   0
                                  The quick
                                 brown dog                          brown        1   1    0   0
                                 jumps over                            dog       2   1    0   0
                                the lazy dog.                        jump        1   1    0   0
                                                                       lazy      1   1    0   0




        Lucca, Oct 2012                Miha Grčar: Text and text stream mining                       14
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                 Computing weights




                                                                                  DF
                                                                                  TF
                          jump                                                            IDF TFIDF
                                                                      quick       1   1   0.69   0.69
                                   The quick
                                  brown dog                          brown        1   1   0.69   0.69
                                  jumps over                            dog       2   1   0.69   1.39
                                 the lazy dog.                        jump        1   2    0      0
                                                                        lazy      1   1   0.69   0.69




        Lucca, Oct 2012                 Miha Grčar: Text and text stream mining                         15
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                     Cosine similarity


                                d1


                                                                         d2




                   0
        Lucca, Oct 2012                Miha Grčar: Text and text stream mining   16
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                        Cosine similarity


                                   d1
                   1
                            d1 '
                                                                            d2



                                                d2'




                   0
        Lucca, Oct 2012                   Miha Grčar: Text and text stream mining   17
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                  Centroids




                                                                          • Determine characteristic
                                                                            words in a cluster
                                                                          • Nearest centroid classifier
                                                                          • k-means clustering
                                                                          • …



        Lucca, Oct 2012         Miha Grčar: Text and text stream mining                            18
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                      Where are we?
                                                             Feedback loop




                                                                                                     - Cross validation
                                                                                      Evaluation /
                                                                                                     - Precision
                                                                                       validation
                                                                                                     - Recall …

            Data                   Text pre-
                                                             Modeling
         acquisition              processing                                                         - Search & browse
                                                                                                     - Categorization
                                                                                                     - Recommendation
                                - Vector spc model     - Machine learning             Application    - Advertising
                                - (bags-of-words)        - Classification                            - Spam detection
                                                         - Clustering                                - Summarization
                                                                                                     - Visualization …


        Lucca, Oct 2012                     Miha Grčar: Text and text stream mining                              19
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                Machine learning
        • Machine learning is concerned with the development of
          algorithms that allow computer programs to learn from past
          experience [Mitchell]
        • Machine learning refers to a collection of algorithms that take
          as input empirical data (e.g., from databases or sensors) and
          try to discover some characteristics (rules, constraints,
          patterns, features) of the process that generated the data
          [Wikipedia]
        • Learning from past experience = learning from past examples
        • Examples (instances) = document vectors (normalized sparse
          vectors)

        Lucca, Oct 2012           Miha Grčar: Text and text stream mining   20
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                   Machine learning
        • We will look at two commonly used
          machine learning techniques
                   – Classification
                      • Assigning instances (documents) to two or
                        more predefined (discrete) classes
                      • Supervised learning method

                   – Clustering
                      • Arranging instances (documents) into
                        groups (clusters) so that instances in the
                        same group are more similar to each other
                        than to those in other groups
                      • Unsupervised learning method


        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining   21
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                                    Classification
              • Labeled documents
                   Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout
                   Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services
                   Economy & Government • Gasoline fuels inflation, but Fed policy seen steady
                   Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely
                   ...
                   Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory
                   Investing Picks • The Fresh Market: A Strong Buy


              • Learn to classify
                                     Labeled                                 Training                              Classification
                                     dataset                                Algorithm                                 Model


              • Classify unlabeled documents
                                    Unlabeled                              Classification                           Predictions
                                     dataset                                Algorithm                                 (Labels)

                           Fresh Del Monte Produce Inc.
                                                                                                                    Investing Picks
                             Enters Oversold Territory
                                                                           Classification
                                                                              Model

        Lucca, Oct 2012                                   Miha Grčar: Text and text stream mining                                     22
PART I • PART II
INTRO • BOW • ML • EVAL • APP
                                        Classification
                                     with k-Nearest Neighbors
                   Investing Picks

                                                         Mergers & Acquisitions




      Economy & Government
                                                                    Investing Picks: 4
                                                            Mergers & Acquisitions: 1
                                                          Economy & Government: 0

        Lucca, Oct 2012                                                             23
PART I • PART II
INTRO • BOW • ML • EVAL • APP
                                      Classification
                                with Nearest Centroid Classifier
                   Investing Picks

                                                         Mergers & Acquisitions




                                          s1
                                                    s2



                                               s3
      Economy & Government                                  Similarity s2 > s1 > s3
                                                            s2: Mergers & Acquisitions
                                                            s1: Investing Picks
                                                            s3: Economy & Government

        Lucca, Oct 2012                                                               24
PART I • PART II
INTRO • BOW • ML • EVAL • APP
                                            Classification
                                with Support Vector Machine (SVM)
                                                                     w
                                  Investing Picks

                                                                                       • Maximize w
                                                                                       • Minimize     tradeoff




                                                                      Mergers & Acquisitions

        Lucca, Oct 2012                      Miha Grčar: Text and text stream mining                      25
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                          Classification algorithms
                                                                    Nearest              SVM
                                                    k-NN            centroid        (linear kernel)
                   Multiclass?                       yes               yes                no
                   Explains decisions?               no                yes               yes
                   Explains model?                   no                yes               yes
                   Number of parameters               1                  0                1
                   Model size                        big              small             small
                   Training speed                     0                fast              slow
                   Classification speed             slow               fast              fast
                   Accuracy (on texts)               low            medium               high




        Lucca, Oct 2012                   Miha Grčar: Text and text stream mining                     26
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                Clustering




        Lucca, Oct 2012                      27
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                 Clustering

        • k-means clustering

        • Agglomerative hierarchical clustering




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   28
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                k-means clustering
        Input: k
        Output: k clusters (and their centroids)
        1. Randomly select k instances for initial centroids
        2. Assign step
           Assign each instance to the nearest centroid
        3. If the assignments did not change, end the
           algorithm
        4. Update step
           Recompute (update) centroids
        5. Repeat at Step 2

        Lucca, Oct 2012            Miha Grčar: Text and text stream mining   29
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                k-means clustering



                    This video is available at http://first.ijs.si/tutorial/video/kmeans.html




        Lucca, Oct 2012                   Miha Grčar: Text and text stream mining               30
PART I • PART II
INTRO • BOW • ML • EVAL • APP



            Agglomerative hierarchical clustering

                                1.     Find the two most similar instances
                                2.     Connect them
                                3.     Replace them with their centroid
                                4.     Repeat …




                                             “Dendrogram”
        Lucca, Oct 2012              Miha Grčar: Text and text stream mining   31
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                      Where are we?
                                                             Feedback loop




                                                                                                     - Cross validation
                                                                                      Evaluation /
                                                                                                     - Precision
                                                                                       validation
                                                                                                     - Recall …

            Data                   Text pre-
                                                             Modeling
         acquisition              processing                                                         - Search & browse
                                                                                                     - Categorization
                                                                                                     - Recommendation
                                - Vector spc model     - Machine learning             Application    - Advertising
                                - (bags-of-words)        - Classification                            - Spam detection
                                                         - Clustering                                - Summarization
                                                                                                     - Visualization …


        Lucca, Oct 2012                     Miha Grčar: Text and text stream mining                              32
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                               Evaluation
        • Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)
                   – 10-fold cross validation
                   – Stratified
        • Accuracy
        • Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall |
              http://en.wikipedia.org/wiki/F1_Score)

        • Micro and macro-averaging (http://nlp.stanford.edu/IR-
              book/html/htmledition/evaluation-of-text-classification-1.html |
              http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization)

        • Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing)

        Lucca, Oct 2012                       Miha Grčar: Text and text stream mining               33
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                      Where are we?
                                                             Feedback loop




                                                                                                     - Cross validation
                                                                                      Evaluation /
                                                                                                     - Precision
                                                                                       validation
                                                                                                     - Recall …

            Data                   Text pre-
                                                             Modeling
         acquisition              processing                                                         - Search & browse
                                                                                                     - Categorization
                                                                                                     - Recommendation
                                - Vector spc model     - Machine learning             Application    - Advertising
                                - (bags-of-words)        - Classification                            - Spam detection
                                                         - Clustering                                - Summarization
                                                                                                     - Visualization …


        Lucca, Oct 2012                     Miha Grčar: Text and text stream mining                              34
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                Applications
        • Enhanced Web search                         • Text summarization
          (SearchPoint)                                    Leskovec et al. (2005): Extracting Summary
                                                           Sentences Based on the Document Semantic
        • Social browsing (LiveNetLife)                    Graph. Microsoft Research Technical Report
                                                           MSR-TR-2005-07.
        • Content categorization                      • Sentiment analysis
        • Content-based recommender                        (demo later)
          systems                                     • News aggregation
        • Advertising                                      http://emm.newsexplorer.eu

        • Blogging assistance (Zemanta)               • Knowledge engineering
                                                           http://ontogen.ijs.si
        • Spam detection                              • …
        • Visualization / summarization
          of large corpora




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining                            35
Enhanced Web search (http://www.searchpoint.com)
Lucca, Oct 2012        Miha Grčar: Text and text stream mining       36
Hi!
                                                                                            Hello




                  Social browsing (http://www.livenetlife.com) @ http://videolectures.net
Lucca, Oct 2012                   Miha Grčar: Text and text stream mining                                 37
Content categorization @ http://videolectures.net
Lucca, Oct 2012       Miha Grčar: Text and text stream mining         38
Recommender system @ http://videolectures.net
Lucca, Oct 2012       Miha Grčar: Text and text stream mining     39
Contextualized advertising
Lucca, Oct 2012   Miha Grčar: Text and text stream mining   40
PART I • PART II
INTRO • BOW • ML • EVAL • APP




                                Blogging assistant (http://www.zemanta.com)

        Lucca, Oct 2012           Miha Grčar: Text and text stream mining     41
PART I • PART II
INTRO • BOW • ML • EVAL • APP


                                Pump & dump
                                   Siering, Muntermann, Grčar (2012)




        Lucca, Oct 2012          Miha Grčar: Text and text stream mining   42
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                                Visualizations
        • Document space
          visualization


        • Canyon flows


        • Tag clouds
                                                  http://www.jasondavies.com/wordcloud/
        Lucca, Oct 2012          Miha Grčar: Text and text stream mining                  43
PART I • PART II




                                                   Recap
        • Basics                                                 • Applications
                   –   What is text mining?                             – Enhanced Web search
                   –   TF-IDF bag-of-words vectors                        (SearchPoint)
                   –   Cosine similarity                                – Social browsing (LiveNetLife)
                   –   Centroids                                        – Content categorization
        • Machine learning                                              – Content-based recommender
                                                                          systems
                   –   k-NN
                                                                        – Advertising
                   –   Nearest centroid classifier
                                                                        – Writing assistance (Zemanta)
                   –   SVM
                                                                        – Spam detection
                   –   k-means
                                                                        – Visualization / summarization
                   –   Agglomerative clustering                           of large corpora …



        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining                    44
PART I • PART II


                          Part II:
                   Text stream mining
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                     What is text stream mining?

                          Same as text mining but on streams

                      Text stream mining provides a set of
                    methodologies and tools for discovering,
                   presenting, and evaluating knowledge from
                         streams of textual documents


        Lucca, Oct 2012            Miha Grčar: Text and text stream mining   46
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                                              Remember
                                  Typical text mining process
                                                            Feedback loop




                                                                                                    - Performance and
                                                                                     Evaluation /
                                                                                                    - utility assessment
                                                                                      validation
                                                                                                    - Feedback loop

            Data                  Text pre-
                                                            Modeling
         acquisition             processing


                                                                                                    - Presentation
       - Acquisition            - Transformation      - Discover                     Application
                                                                                                    - Interaction
       - Cleaning                                     - Extract
                                                      - Organize knowledge


        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining                              47
PART I • PART II
INTRO • DACQ • BOW • ML • APP



          Typical text stream mining process
                                                            Feedback loop




                                                                                                    - Performance and
                                                                                                    - utility assessment
                                                                                     Evaluation /
                                                                                                    - Obtaining new
                                                                                      validation
                                                                                                    - labels
                                                                                                    - Feedback loop
           Stream
                                  Text pre-
            data                                            Modeling
                                 processing
         acquisition

                                                                                                    - Presentation
       - Acquisition            - Transformation      - Discover                     Application
                                                                                                    - Interaction
       - Cleaning                                     - Extract
                                                      - Organize knowledge


        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining                              48
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                     Text stream mining pipelines
        • Pipelining and parallelization




                                                                        Parallelization
                   – Enables concurrent processing
                   – Increases throughput                                                 Pipelining

                   – Enables distributed execution (cluster)
        • Near-realtime online systems
                   – Stream cannot be paused or slowed down
                     (e.g., newsfeeds)
                   – [Near-realtime] Time between reception and
                     utilization of data should be as short as possible
                   – [Online] Stream is infinite and (sooner or later)
                     outdated data needs to be deleted

        Lucca, Oct 2012              Miha Grčar: Text and text stream mining                           49
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                    What do we cover in Part II?
                                                          Feedback loop




                                                                                   Evaluation /
                                                                                    validation

           Stream
                                 Text pre-
            data                                          Modeling
                                processing
         acquisition
                                                                                                  - Online document
                                                                                                  - space visualization
       - RSS feeds           - Online BOW             - Online ML                  Application
                                                                                                  - Online tweeter
       - Boilerplate remover                            - Incr. NCC
                                                                                                  - sentiment classif.
       - Language detection                             - Incr. k-means
                                                        - Incr. SVM

        Lucca, Oct 2012                  Miha Grčar: Text and text stream mining                              50
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                              Text stream acquisition and
                                     preprocessing
                       RSS                       Boilerplate              Language
                     reader                        remover                 detector



                       RSS                       Boilerplate              Language
                                Load balancing




                     reader                        remover                 detector
                                                                                                Online




                                                                                      Sync
                                                                                                         ...
                                                                                                 BOW
                          .                           .
                                                               Preprocessing
                          .                           .
                                                                 pipelines
                          .                           .



                       RSS                       Boilerplate              Language
                     reader                        remover                 detector




        Lucca, Oct 2012                               Miha Grčar: Text and text stream mining                  51
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                   RSS (Really Simple Syndication)




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   52
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                   RSS (Really Simple Syndication)
                          <rss version="2.0">
                          <channel>
                          <generator>NFE/1.0</generator>
                          <title>Top Stories - Google News</title>
                          <link>http://news.google.com/news?pz=1&amp;ned=us&amp;hl=en</link>
                          <language>en</language>
                          <webMaster>news-feedback@google.com</webMaster>
                          <copyright>&amp;copy;2011 Google</copyright>
                          <item>
                                    <title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster -
                                    Bloomberg</title>
                                    <link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNEF9B
                                    7Q8C7_TBDKPEMFjb83fcuNfQ&amp;url=http://www.bloomberg.com/news/2011-
                                    02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link>
                                    <category>Top Stories</category>
                                    <pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate>
                                    <description>The ouster of Hosni Mubarak from Egypt’s presidency today, after
                                    protests that started Jan. 25, prompted the following comments from analysts:
                                    “The army needs to move quickly to remove obstacles to ...</description>
                          </item>
                          ...
                          </channel>
                          </rss>
        Lucca, Oct 2012                           Miha Grčar: Text and text stream mining                           53
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                              Text stream acquisition and
                                     preprocessing
                       RSS                       Boilerplate              Language
                     reader                        remover                 detector



                       RSS                       Boilerplate              Language
                                Load balancing




                     reader                        remover                 detector
                                                                                                Online




                                                                                      Sync
                                                                                                         ...
                                                                                                 BOW
                          .                           .
                                                               Preprocessing
                          .                           .
                                                                 pipelines
                          .                           .



                       RSS                       Boilerplate              Language
                     reader                        remover                 detector




        Lucca, Oct 2012                               Miha Grčar: Text and text stream mining                  54
PART I • PART II
INTRO • DACQ • BOW • ML • APP




                   http://www.bbc.co.uk/news/world-us-canada-15051554
                                                                        Boilerplate removal




        Lucca, Oct 2012                                                     Miha Grčar: Text and text stream mining   55
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                                Boilerplate removal
                                                    URL tree
                                protocol :// domain / path / file ? query


                                 http://          kt.ijs.si    /a/b/ c.html ?pg=0


                                                   Tree branch:

                                       #  si  ijs  kt  a  b


                                     root          domain                path


                                http://www.bbc.co.uk/news/world-us-canada-15051554

                                        #  uk  co  bbc  www  news


        Lucca, Oct 2012                    Miha Grčar: Text and text stream mining   56
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                                Boilerplate removal
                                               URL tree
                                                                                How many times
                                                                                 did I see “About
                                                                                Us” in this part of
                                                                                    the tree?
                                                                Path
                                      Domain

                                 Root
                     Stream       #




                                                                    This method is …
                                                                    • Unsupervised
                                                                    • Online
                                                                    • Incremental
                                                                       (consumes one document at a time)
        Lucca, Oct 2012               Miha Grčar: Text and text stream mining                         57
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                              Text stream acquisition and
                                     preprocessing
                       RSS                       Boilerplate              Language
                     reader                        remover                 detector



                       RSS                       Boilerplate              Language
                                Load balancing




                     reader                        remover                 detector
                                                                                                Online




                                                                                      Sync
                                                                                                         ...
                                                                                                 BOW
                          .                           .
                                                               Preprocessing
                          .                           .
                                                                 pipelines
                          .                           .



                       RSS                       Boilerplate              Language
                     reader                        remover                 detector




        Lucca, Oct 2012                               Miha Grčar: Text and text stream mining                  58
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Language detection
        • Motivation: language-specific text analysis
          components and applications
        • Solutions based on word lists and word or
          character sequences (n-grams)
        • Character n-gram model
                   – Build character n-gram histograms for many
                     languages (language models)
                   – Compare text document histogram to language
                     models

        Lucca, Oct 2012            Miha Grčar: Text and text stream mining   59
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Language detection
                                        English                         German
                                  E        1                     E            1
                                  T        2                     N            2
                                  O        3                     R            3
                                  A        4                     I            4
                                  N        5                     T            5
                                  I        6                     S            6
                                  H        7                     A            7
                                  S        8                     D            8
                                  R        9                     U            9
                                  D        10                    EN           10
                            THE                                                     DER, DEN
                                  E_       11                    G            11
                                  L        12                    ER           12
                                  _T       13                    H            13
                                  TH       14                    L            14
                                  HE       15                    N_           15
                                  U        16                    O            16
                                  W        17                    M            17
                                  C        18                    _D           18
                                  M        19                    C            19
                                  ...      ...                   ...          ...
        Lucca, Oct 2012                  Miha Grčar: Text and text stream mining               60
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                                               Language detection
                                                                 Article “Egypt rejoices at Mubarak departure”
                                        450                                                                                              350


                                        400
                                                                                                                                         300

                                        350
                                                                                                                                         250
        English article (n-gram rank)




                                                                                                         English article (n-gram rank)
                                        300


                                        250                                                                                              200


                                        200                                                                                              150

                                        150
                                                                                                                                         100
                                        100

                                                                                                                                          50
                                         50


                                          0                                                                                                0
                                              0    100         200         300           400                                                   0   50       100     150    200     250    300   350
                                                  English language model (n-gram rank)                                                                  German language model (n-gram rank)




        Lucca, Oct 2012                                                          Miha Grčar: Text and text stream mining                                                                        61
PART I • PART II
INTRO • DACQ • BOW • ML • APP

                              Text stream acquisition and
                                     preprocessing
                       RSS                       Boilerplate              Language
                     reader                        remover                 detector



                       RSS                       Boilerplate              Language
                                Load balancing




                     reader                        remover                 detector
                                                                                                Online




                                                                                      Sync
                                                                                                         ...
                                                                                                 BOW
                          .                           .
                                                               Preprocessing
                          .                           .
                                                                 pipelines
                          .                           .



                       RSS                       Boilerplate              Language
                     reader                        remover                 detector




        Lucca, Oct 2012                               Miha Grčar: Text and text stream mining                  62
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Online BOW

                      Stream                                                Outdated

                                             Queue
                                          of TF vectors
                                Add                                       Remove



                                                        DF
                                                      values




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining                63
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Online BOW

                      Stream                                              Outdated

                                             Queue
                                          of TF vectors




                                                        DF
                                                      values
                                  TF       DF


                                    TF-IDF


        Lucca, Oct 2012         Miha Grčar: Text and text stream mining              64
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                   Where are we?
                                                          Feedback loop




                                                                                   Evaluation /
                                                                                    validation

           Stream
                                 Text pre-
            data                                          Modeling
                                processing
         acquisition
                                                                                                  - Online document
                                                                                                  - space visualization
       - RSS feeds           - Online BOW             - Online ML                  Application
                                                                                                  - Online tweeter
       - Boilerplate remover                            - Incr. NCC
                                                                                                  - sentiment classif.
       - Language detection                             - Incr. k-means
                                                        - Incr. SVM

        Lucca, Oct 2012                  Miha Grčar: Text and text stream mining                              65
PART I • PART II
INTRO • DACQ • BOW • ML • APP




           Batch, incremental, offline, online
        • Batch learning
                   Consuming all training examples at once
        • Incremental learning
                   Consuming one example at a time
        • Mini-batch learning
                   Consuming several examples at a time
        • Offline learning (for datasets/finite streams)
                   All data is stored and can be accessed repeatedly
        • Online learning (for infinite streams)
                   Each example is discarded after being processed

        Lucca, Oct 2012             Miha Grčar: Text and text stream mining   66
PART I • PART II
INTRO • DACQ • BOW • ML • APP



         Incremental nearest centroid classifier
       Outdated
       instance                                   New
                                                instance




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   67
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                   Incremental k-means clustering




                                Converges in only a few iterations (warm start)


        Lucca, Oct 2012                  Miha Grčar: Text and text stream mining   68
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                     Other incremental methods
        • Incremental SVM
          A. Bordes, S. Ertekin, J. Weston, and L. Bottou
          (2005): Fast Kernel Classifiers with Online and
          Active Learning, Journal of Machine Learning
          Research, vol. 6, pp. 1579–1619
        • Incremental perceptron
          www.cs.columbia.edu/~jebara/4771/tutorials/pe
          rceptron.pdf
        • Incremental winnow
          http://en.wikipedia.org/wiki/Winnow_%28algorit
          hm%29

        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   69
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                   Where are we?
                                                          Feedback loop




                                                                                   Evaluation /
                                                                                    validation

           Stream
                                 Text pre-
            data                                          Modeling
                                processing
         acquisition
                                                                                                  - Online document
                                                                                                  - space visualization
       - RSS feeds           - Online BOW             - Online ML                  Application
                                                                                                  - Online tweeter
       - Boilerplate remover                            - Incr. NCC
                                                                                                  - sentiment classif.
       - Language detection                             - Incr. k-means
                                                        - Incr. SVM

        Lucca, Oct 2012                  Miha Grčar: Text and text stream mining                              70
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                   Document space visualization




                                                                              2D
                     Several 1000
                      dimensions



        Lucca, Oct 2012             Miha Grčar: Text and text stream mining        71
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                     Document space visualization


                                                            Neighborhoods
                                                             computation
                              Corpus       k-means                             Least-squares
                           preprocessing   clustering                           interpolation

          Document                                             Stress
           corpus                                            majorization


                                                                                                Layout




        Lucca, Oct 2012                           Miha Grčar: Text and text stream mining                72
PART I • PART II
INTRO • BOW • ML • EVAL • APP



                   Document space visualization




        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   73
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                 Document space visualization
                                                                      Maintaining
                                                                      sorted lists
                                                    Warm start

                                                                                         Warm start
    Parallelization




                                                                       Neighborhoods
                                                                        computation
                                       Corpus         k-means                             Least-squares
                                    preprocessing     clustering                           interpolation
                                                                          Stress
                      Document        Online                            majorization
                       corpus
                                      BOW
                                                                                                           Layout




                                                                      Warm start
                                                          Pipelining




                  Lucca, Oct 2012                            Miha Grčar: Text and text stream mining                74
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                   Document space visualization



                     This video is available at http://first.ijs.si/tutorial/video/ameba.html




        Lucca, Oct 2012                   Miha Grčar: Text and text stream mining               75
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                      Twitter

                                                • Platform for sending
                                                  short messages
                                                  (similar to SMS)
                                                • Est. 225 million users
                                                • 100 million accounts
                                                  added in 2010
                                                • 65 million tweets per day


        Lucca, Oct 2012         Miha Grčar: Text and text stream mining   76
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Financial tweets
                                               • Informal $ sign convention
                                               • Some examples (March 19):
                                                       –   User#1: $AAPL is making an announcement at 9am
                                                           on what it plans to do with its 97 billion in cash.We
                                                           expect a dividend announcement
                                                       –   User#2: $AAPL over 600.00 a share in the pre-market
                                                           on news of a dividend.
                                                       –   User#3: Will there be any other news besides $AAPL
                                                           dividend?

                                               • We acquire ~13,000 tweets per
                                                 weekday, for ~1,800 NASDAQ/NYSE
                                                 stocks ($GOOG, $MSFT…)
                                               • We analyze tweets to determine
                                                 whether they contain positive or
                                                 negative vocabulary


        Lucca, Oct 2012           Miha Grčar: Text and text stream mining                                          77
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                              Sentiment classification
              • Labeled documents
                   POS Financial markets are now officially open :)
                   POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research
                   POS $AAPL : trust me -- AAPL will soar tomorrow
                   NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon
                   NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!!
                   NEG @aekins that's just too bad
                   ...

              • Learn to classify
                                     Labeled                                Training                            Classification
                                     dataset                               Algorithm                               Model


              • Classify unlabeled documents
                                   Unlabeled                             Classification                          Predictions
                                    dataset                               Algorithm                                (Labels)

                       So Nickelodeon filed for bankruptcy
                     and announced that the next Kids Choice                                                         NEG
                             Awards will be it's last.
                                                                         Classification
                                                                            Model

        Lucca, Oct 2012                                Miha Grčar: Text and text stream mining                                         78
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                           Sentiment classification
        • Emoticons &
          SVM classifier
                                Goodnight everyoneeee :) Love yall
                                I have a good feeling about today ;)
                                ooo the ice cream van is here... yaaaaaay :D

        • Neutral zone
                                in the garden in the sun! Just about to fill the pool! happy days! :D
                                Finally got JSON in #processing to work. More playing around coming :)

                                @oanhLove I hate when that happens... :-/
                                No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :(
                                I hate when I have to call and wake people up :(
        • Explanations          I don't have any chalk! :-/ MY CHALKBOARD IS USELESS
                                UGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;(




        • Accuracy


        Lucca, Oct 2012                                                                                       79
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                           Sentiment classification
        • Emoticons &                                            –
                                             –
          SVM classifier
                                                         –                   –       +
                                                                                                 +
                                     –
        • Neutral zone                           –                   –           +

                                                     –                           +
                                 –                                                       +
                                                                                     +
        • Explanations                   –                   –       +                   +
                                                                             +


        • Accuracy                       –
                                                                     +
                                                                                 +           +
                                                 +
                                                                         +

        Lucca, Oct 2012                                                                      80
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                           Sentiment classification
        • Emoticons &                                            –
                                             –
          SVM classifier
                                                         –                   0       0
                                                                                                 +
                                     –
        • Neutral zone                           –                   –           0

                                                     –                           +
                                 –                                                       +
                                                                                     +
        • Explanations                   –                   0       0                   +
                                                                             +


        • Accuracy                       0
                                                                     +
                                                                                 +           +
                                                 0
                                                                         +

        Lucca, Oct 2012                                                                      81
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                           Sentiment classification
        • Emoticons &
          SVM classifier                   “Sovereign debt and unemployment are big
                                           issues in EU.”

        • Neutral zone                     unemployed, issues, debt, eu
                                           sovereign, big


        • Explanations

        • Accuracy


        Lucca, Oct 2012          Miha Grčar: Text and text stream mining              82
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                           Sentiment classification
        • Emoticons &            Replace
                                usernames
                                           Replace       Remove
                                                                      Replace    Replace
                                                                     negations exclamation
                                                                                             Replace
                                                                                             question
                                                                                                                                        Average accuracy


          SVM classifier
                                          URLs with a     letter                                         Accuracy    Precision/recall     10-fold cross
                                  with a                               with a  marks with a marks with
                                            token       repetition                                                                         validation
                                  token                                token      token       a token


                                               X            X                                            81.06%     81.32%/81.32%          76.98%

                                   X           X            X           X           X           X        80.22%     82.08%/78.02%          77.43%



        • Neutral zone
                                               X            X                                   X        79.94%     77.78%/84.62%          77.10%

                                               X            X           X                                79.94%     76.70%/86.81%          77.53%

                                               X            X                       X                    79.67%     80.79%/78.57%          76.85%

                                                            X                                            78.83%     77.60%/81.87%          77.29%


        • Explanations             X                        X                                            78.55%

                                                                                                         78.55%
                                                                                                                    75.86%/84.62%

                                                                                                                    77.78%/80.77%
                                                                                                                                           76.91%

                                                                                                                                           76.93%

                                               X            X                       X           X        78.27%     80.23%/75.82%          76.93%

                                   X           X            X                                            78.27%     76.53%/82.42%          77.04%


        • Accuracy                 X           X            X                       X           X        77.44%     75.12%/82.97%          76.86%




        Lucca, Oct 2012           Miha Grčar: Text and text stream mining                                                                     83
Grey:
                                                           Netflix stock closing price




                          Blue:
                  The number of positive
                         tweets
                                                                                    Yellow:
                                                                          The difference between the
                                                                         positive and negative tweets




                                                                                                         Green dots:
                                                                                                  Relevant events concerning
                                                                                                            Netflix




                                          Red:
                                  The number of negative
                                         tweets




Lucca, Oct 2012                   Miha Grčar: Text and text stream mining                                               84
First-quarter earnings
                                   release                  Plans to launch in 43
                                                          countries in Latin America
                                                             and the Caribbean




              Volume peaks likely
          represent important events                                                    Netflix loses TV shows and
                                                                                       films, Netflix loses the Starz
                                                                                                    deal
Lucca, Oct 2012                              Miha Grčar: Text and text stream mining                                    85
Sentiment cross-over
                                                        happens before price plunge
                       Sentiment cross-over




Lucca, Oct 2012   Miha Grčar: Text and text stream mining                             86
PART I • PART II
INTRO • DACQ • BOW • ML • APP



                                Presidential elections                         http://predsedniskevolitve.si




        Lucca, Oct 2012              Miha Grčar: Text and text stream mining                       87
PART I • PART II




                                                Recap
        • Basics                                              • Applications
                   – What is text stream                             – Online document space
                     mining?                                           visualization
                   – Pipelining, parallelization                     – Online tweeter sentiment
                   – Web data acquisition                              classifier
                   – Online BOWs                                            • Stock sentiment
                                                                              monitoring
        • Machine learning                                                  • Presidential elections
                   – Batch, incremental, offline,
                      online
                   – Incremental nearest
                     centroid classifier
                   – Incremental k-means
                   – Warm start

        Lucca, Oct 2012                 Miha Grčar: Text and text stream mining                        88

Mais conteúdo relacionado

Mais procurados

خدمات المكتبات : رؤية للمهنية
خدمات المكتبات : رؤية للمهنية خدمات المكتبات : رؤية للمهنية
خدمات المكتبات : رؤية للمهنية Mohamed Mahdy
 
Տարեկան հաշվետվություն
Տարեկան հաշվետվությունՏարեկան հաշվետվություն
Տարեկան հաշվետվությունdavtyansusanna
 
Изобретенията на Леонардо Да Винчи
Изобретенията на Леонардо Да ВинчиИзобретенията на Леонардо Да Винчи
Изобретенията на Леонардо Да ВинчиГеорги Петров
 
แผนการเรียนรู้งานช่าง 2
แผนการเรียนรู้งานช่าง 2แผนการเรียนรู้งานช่าง 2
แผนการเรียนรู้งานช่าง 2Utsani Yotwilai
 
المكتبة الرقمية
 المكتبة الرقمية المكتبة الرقمية
المكتبة الرقميةHeyam hayek
 
Adjektiivien vertailu kertaus
Adjektiivien vertailu kertausAdjektiivien vertailu kertaus
Adjektiivien vertailu kertausOlli Eloranta
 
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...abedelaziz benzine
 
բնապահպանական խնդիրներ
բնապահպանական խնդիրներբնապահպանական խնդիրներ
բնապահպանական խնդիրներEduard Bakoyan
 
Droga i psihofizičko zdravlje
Droga i psihofizičko zdravljeDroga i psihofizičko zdravlje
Droga i psihofizičko zdravljeSimonida Vukobrat
 
อาเซียนศึกษา
อาเซียนศึกษาอาเซียนศึกษา
อาเซียนศึกษาArt Nan
 
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdf
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdfเรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdf
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdfpeter dontoom
 
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดี
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดีเอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดี
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดีtearchersittikon
 
วิถีชีวิตคนเมืองในประเทศไทย
วิถีชีวิตคนเมืองในประเทศไทยวิถีชีวิตคนเมืองในประเทศไทย
วิถีชีวิตคนเมืองในประเทศไทยFURD_RSU
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالدكتور طلال ناظم الزهيري
 
Filogeneza i ontogeneza
Filogeneza i ontogenezaFilogeneza i ontogeneza
Filogeneza i ontogenezaMiconi doo
 

Mais procurados (20)

خدمات المكتبات : رؤية للمهنية
خدمات المكتبات : رؤية للمهنية خدمات المكتبات : رؤية للمهنية
خدمات المكتبات : رؤية للمهنية
 
Տարեկան հաշվետվություն
Տարեկան հաշվետվությունՏարեկան հաշվետվություն
Տարեկան հաշվետվություն
 
Изобретенията на Леонардо Да Винчи
Изобретенията на Леонардо Да ВинчиИзобретенията на Леонардо Да Винчи
Изобретенията на Леонардо Да Винчи
 
แผนการเรียนรู้งานช่าง 2
แผนการเรียนรู้งานช่าง 2แผนการเรียนรู้งานช่าง 2
แผนการเรียนรู้งานช่าง 2
 
Emocije dajana
Emocije dajanaEmocije dajana
Emocije dajana
 
المكتبة الرقمية
 المكتبة الرقمية المكتبة الرقمية
المكتبة الرقمية
 
Adjektiivien vertailu kertaus
Adjektiivien vertailu kertausAdjektiivien vertailu kertaus
Adjektiivien vertailu kertaus
 
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...
Raport de stage-تقرير تربص في مجال علم المكتبات-بن الزين عبد العزيز--منهجية ا...
 
ตรุษจีน
ตรุษจีนตรุษจีน
ตรุษจีน
 
Psihologija тт
Psihologija ттPsihologija тт
Psihologija тт
 
բնապահպանական խնդիրներ
բնապահպանական խնդիրներբնապահպանական խնդիրներ
բնապահպանական խնդիրներ
 
Droga i psihofizičko zdravlje
Droga i psihofizičko zdravljeDroga i psihofizičko zdravlje
Droga i psihofizičko zdravlje
 
อาเซียนศึกษา
อาเซียนศึกษาอาเซียนศึกษา
อาเซียนศึกษา
 
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdf
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdfเรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdf
เรื่องรายงานวิจัยในชั้นเรียนสีไม้64.pdf
 
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดี
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดีเอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดี
เอกสารประกอบการสอนอิเล็กทรอนิกส์ขั้นพื้นฐาน โดย อ.นาถวดี
 
วิถีชีวิตคนเมืองในประเทศไทย
วิถีชีวิตคนเมืองในประเทศไทยวิถีชีวิตคนเมืองในประเทศไทย
วิถีชีวิตคนเมืองในประเทศไทย
 
แผ่นพับ
แผ่นพับแผ่นพับ
แผ่นพับ
 
Moralna ocecanja
Moralna ocecanjaMoralna ocecanja
Moralna ocecanja
 
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلوماتالعوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
العوامل المؤثرة في كفاءة عمليات استرجاع المعلومات
 
Filogeneza i ontogeneza
Filogeneza i ontogenezaFilogeneza i ontogeneza
Filogeneza i ontogeneza
 

Destaque

Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationNiklas Elmqvist
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cGuido Schmutz
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects ElementaryCinci0987
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationNoeska Smit
 
What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?OneSpring LLC
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clustersjtedesco5
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics University of Maryland
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanUniversity of Maryland
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 

Destaque (14)

Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information Visualization
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects Elementary
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical Visualization
 
What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clusters
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneiderman
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 

Semelhante a Text and text stream mining tutorial

MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Findwise
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011SEO CAMP
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Analysis based Development
Analysis based DevelopmentAnalysis based Development
Analysis based DevelopmentFaiq Wyne
 
iAnua storymapping session @ ilean
iAnua storymapping session @ ileaniAnua storymapping session @ ilean
iAnua storymapping session @ ileanStefaan Roets
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Umesh Ramalingachar
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionInstitute for Knowledge Mobilization
 
Architecture of Search Systems and Measuring the Search Effectiveness
Architecture of Search Systems and Measuring the Search EffectivenessArchitecture of Search Systems and Measuring the Search Effectiveness
Architecture of Search Systems and Measuring the Search EffectivenessFindwise
 
LiquidPub: Services at Service of Science
LiquidPub: Services at Service of ScienceLiquidPub: Services at Service of Science
LiquidPub: Services at Service of ScienceAliaksandr Birukou
 
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanUsing BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanDavid J Rosenthal
 
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...Corro'll Driskell
 
We Know It (Newsfromthefront 2010)
We Know It (Newsfromthefront 2010)We Know It (Newsfromthefront 2010)
We Know It (Newsfromthefront 2010)STI International
 

Semelhante a Text and text stream mining tutorial (20)

MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
MACE
MACEMACE
MACE
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Analysis based Development
Analysis based DevelopmentAnalysis based Development
Analysis based Development
 
iAnua storymapping session @ ilean
iAnua storymapping session @ ileaniAnua storymapping session @ ilean
iAnua storymapping session @ ilean
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012
 
Knowledge mobilization
Knowledge mobilization Knowledge mobilization
Knowledge mobilization
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
 
Mooga app personalizer
Mooga app personalizerMooga app personalizer
Mooga app personalizer
 
Architecture of Search Systems and Measuring the Search Effectiveness
Architecture of Search Systems and Measuring the Search EffectivenessArchitecture of Search Systems and Measuring the Search Effectiveness
Architecture of Search Systems and Measuring the Search Effectiveness
 
LiquidPub: Services at Service of Science
LiquidPub: Services at Service of ScienceLiquidPub: Services at Service of Science
LiquidPub: Services at Service of Science
 
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanUsing BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
 
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...
SharePoint 2010 ECM: The Best Practices of Organizing and Finding Information...
 
We Know It (Newsfromthefront 2010)
We Know It (Newsfromthefront 2010)We Know It (Newsfromthefront 2010)
We Know It (Newsfromthefront 2010)
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Text and text stream mining tutorial

  • 1. Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu Text Mining and Text Stream Mining Tutorial Miha Grčar miha.grcar@ijs.si Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana http://kt.ijs.si
  • 2. Text and text stream mining tutorial • Part I: Text mining • Part II: Text stream mining Lucca, Oct 2012 Miha Grčar: Text and text stream mining 2
  • 3. PART I • PART II Part I: Text mining
  • 4. PART I • PART II INTRO • BOW • ML • EVAL • APP What is text mining? • Text mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from large collections of textual documents • Text mining employs adopts and adapts methodologies and tools from … – Data mining (DM) – Machine learning (ML) – Information retrieval (IR) – Natural language processing (NLP) – Visualization – Social network analysis and graph mining – Knowledge management – … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 4
  • 5. PART I • PART II INTRO • BOW • ML • EVAL • APP Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 5
  • 6. PART I • PART II INTRO • BOW • ML • EVAL • APP What do we cover in Part 1? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 6
  • 7. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words the quick brown The quick dog brown dog jumps jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 7
  • 8. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words • Tokenize • Remove stop words • Lemmatize • Compute weights the quick brown quick jump brown lazy dog The quick dog brown dog jumps  jump 1 1 2 1 1 jumps over over the lazy dog. the lazy dog Lucca, Oct 2012 Miha Grčar: Text and text stream mining 8
  • 9. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Simple tokenizer (alphanumeric strings only): After ripping 14% higher from June until the first week of After | ripping | 14 | higher | from October, stocks ran headfirst into | June | until | the | first | week | a wall of worry seemingly too of | October | stocks | ran | large to climb. Europe, China, the headfirst | into | a | wall | of | fiscal cliff, etc aren't new worry | seemingly | too | large | concerns but that doesn't mean to | climb | Europe | China | the | they aren't real. Investors fiscal | cliff | etc | aren | t | new | suddenly care and are behaving concerns | but | that | doesn | t | accordingly, selling some of their mean | they | aren | t | real | more aggressive names and Investors | suddenly | care | and | rotating into defensives. are | behaving | accordingly | selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 9
  • 10. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Tokenization & stop word removal Original text: Regex tokenizer ([p{L}']+): After ripping 14% higher from After | ripping | higher | from | June until the first week of June | until | the | first | week | October, stocks ran headfirst into of | October | stocks | ran | a wall of worry seemingly too headfirst | into | a | wall | of | large to climb. Europe, China, the worry | seemingly | too | large | fiscal cliff, etc aren't new to | climb | Europe | China | the concerns but that doesn't mean | fiscal | cliff | etc | aren't | new they aren't real. Investors | concerns | but | that | doesn't suddenly care and are behaving | mean | they | aren't | real | accordingly, selling some of their Investors | suddenly | care | and more aggressive names and | are | behaving | accordingly | rotating into defensives. selling | some | of | their | more | aggressive | names | and | rotating | into | defensives Lucca, Oct 2012 Miha Grčar: Text and text stream mining 10
  • 11. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: After ripping 14% higher from After | rip | high | from | June | June until the first week of until | the | first | week | of | October, stocks ran headfirst into October | stock | run | headfirst a wall of worry seemingly too | into | a | wall | of | worry | large to climb. Europe, China, the seemingly | too | large | to | fiscal cliff, etc aren't new climb | Europe | China | the | concerns but that doesn't mean fiscal | cliff | etc | aren't | new | they aren't real. Investors concern | but | that | doesn't | suddenly care and are behaving mean | they | aren't | real | accordingly, selling some of their Investor | suddenly | care | and | more aggressive names and are | behave | accordingly | sell | rotating into defensives. some | of | their | more | aggressive | name | and | rotate | into | defensive Lucca, Oct 2012 Miha Grčar: Text and text stream mining 11
  • 12. PART I • PART II INTRO • BOW • ML • EVAL • APP Bags of words Lemmatization Original text: Lemmatized: È uno dei punti più contestati E | uno | dei | puntare | più | della legge di Stabilità approvata contestato | della | legge | di | da poco dal governo: il taglio alle Stabilità | approvare | da | poco | dal | governo | il | tagliare | alle | detrazioni fiscali, ossia gli "sconti" detrazione | fiscale | ossia | gli | che ogni contribuente può scontare | che | ogni | contribuire | vantare sulla propria può | vantare | sulla | proprio | dichiarazione dei redditi. Secondo dichiarazione | dei | reddito | una bozza aggiornata del disegno Secondo | una | bozzare | di legge, il taglio si applicherebbe aggiornare | del | disegnare | di | a decorrere dal periodo di legge | il | tagliare | si | applicare | a imposta al 31 dicembre 2012. Un | decorrere | dal | periodare | di | dettaglio che aveva creato, nei impostare | al | dicembre | Un | giorni scorsi, non poche dettagliare | che | aveva | creare | nei | giorno | scorrere | non | poca | polemiche. polemico Lucca, Oct 2012 Miha Grčar: Text and text stream mining 12
  • 13. PART I • PART II INTRO • BOW • ML • EVAL • APP Computing weights • TF – Term Frequency – The number of times a lemma (stem) occurs in a document • DF – Document Frequency – The number of documents in which a lemma (stem) occurs at least once • TFIDF • Higher TF means higher TFIDF • Higher DF means lower TFIDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 13
  • 14. PART I • PART II INTRO • BOW • ML • EVAL • APP Computing weights DF TF IDF TFIDF quick 1 1 0 0 The quick brown dog brown 1 1 0 0 jumps over dog 2 1 0 0 the lazy dog. jump 1 1 0 0 lazy 1 1 0 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 14
  • 15. PART I • PART II INTRO • BOW • ML • EVAL • APP Computing weights DF TF jump IDF TFIDF quick 1 1 0.69 0.69 The quick brown dog brown 1 1 0.69 0.69 jumps over dog 2 1 0.69 1.39 the lazy dog. jump 1 2 0 0 lazy 1 1 0.69 0.69 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 15
  • 16. PART I • PART II INTRO • BOW • ML • EVAL • APP Cosine similarity d1 d2 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 16
  • 17. PART I • PART II INTRO • BOW • ML • EVAL • APP Cosine similarity d1 1 d1 ' d2 d2' 0 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 17
  • 18. PART I • PART II INTRO • BOW • ML • EVAL • APP Centroids • Determine characteristic words in a cluster • Nearest centroid classifier • k-means clustering • … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 18
  • 19. PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 19
  • 20. PART I • PART II INTRO • BOW • ML • EVAL • APP Machine learning • Machine learning is concerned with the development of algorithms that allow computer programs to learn from past experience [Mitchell] • Machine learning refers to a collection of algorithms that take as input empirical data (e.g., from databases or sensors) and try to discover some characteristics (rules, constraints, patterns, features) of the process that generated the data [Wikipedia] • Learning from past experience = learning from past examples • Examples (instances) = document vectors (normalized sparse vectors) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 20
  • 21. PART I • PART II INTRO • BOW • ML • EVAL • APP Machine learning • We will look at two commonly used machine learning techniques – Classification • Assigning instances (documents) to two or more predefined (discrete) classes • Supervised learning method – Clustering • Arranging instances (documents) into groups (clusters) so that instances in the same group are more similar to each other than to those in other groups • Unsupervised learning method Lucca, Oct 2012 Miha Grčar: Text and text stream mining 21
  • 22. PART I • PART II INTRO • BOW • ML • EVAL • APP Classification • Labeled documents Mergers & Acquisitions • Ingram Wraps Up Brightpoint Buyout Mergers & Acquisitions • State Street completes acquisition of Goldman Sachs Administration Services Economy & Government • Gasoline fuels inflation, but Fed policy seen steady Economy & Government • Euro Leads Majors Higher as Spanish Bailout Looks Increasingly Likely ... Investing Picks • Smith & Wesson Holding Corp. Enters Oversold Territory Investing Picks • The Fresh Market: A Strong Buy • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) Fresh Del Monte Produce Inc. Investing Picks Enters Oversold Territory Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 22
  • 23. PART I • PART II INTRO • BOW • ML • EVAL • APP Classification with k-Nearest Neighbors Investing Picks Mergers & Acquisitions Economy & Government Investing Picks: 4 Mergers & Acquisitions: 1 Economy & Government: 0 Lucca, Oct 2012 23
  • 24. PART I • PART II INTRO • BOW • ML • EVAL • APP Classification with Nearest Centroid Classifier Investing Picks Mergers & Acquisitions s1 s2 s3 Economy & Government Similarity s2 > s1 > s3 s2: Mergers & Acquisitions s1: Investing Picks s3: Economy & Government Lucca, Oct 2012 24
  • 25. PART I • PART II INTRO • BOW • ML • EVAL • APP Classification with Support Vector Machine (SVM) w Investing Picks • Maximize w • Minimize tradeoff Mergers & Acquisitions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 25
  • 26. PART I • PART II INTRO • BOW • ML • EVAL • APP Classification algorithms Nearest SVM k-NN centroid (linear kernel) Multiclass? yes yes no Explains decisions? no yes yes Explains model? no yes yes Number of parameters 1 0 1 Model size big small small Training speed 0 fast slow Classification speed slow fast fast Accuracy (on texts) low medium high Lucca, Oct 2012 Miha Grčar: Text and text stream mining 26
  • 27. PART I • PART II INTRO • BOW • ML • EVAL • APP Clustering Lucca, Oct 2012 27
  • 28. PART I • PART II INTRO • BOW • ML • EVAL • APP Clustering • k-means clustering • Agglomerative hierarchical clustering Lucca, Oct 2012 Miha Grčar: Text and text stream mining 28
  • 29. PART I • PART II INTRO • BOW • ML • EVAL • APP k-means clustering Input: k Output: k clusters (and their centroids) 1. Randomly select k instances for initial centroids 2. Assign step Assign each instance to the nearest centroid 3. If the assignments did not change, end the algorithm 4. Update step Recompute (update) centroids 5. Repeat at Step 2 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 29
  • 30. PART I • PART II INTRO • BOW • ML • EVAL • APP k-means clustering This video is available at http://first.ijs.si/tutorial/video/kmeans.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 30
  • 31. PART I • PART II INTRO • BOW • ML • EVAL • APP Agglomerative hierarchical clustering 1. Find the two most similar instances 2. Connect them 3. Replace them with their centroid 4. Repeat … “Dendrogram” Lucca, Oct 2012 Miha Grčar: Text and text stream mining 31
  • 32. PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 32
  • 33. PART I • PART II INTRO • BOW • ML • EVAL • APP Evaluation • Cross validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) – 10-fold cross validation – Stratified • Accuracy • Precision, recall, F1 (http://en.wikipedia.org/wiki/Precision_and_recall | http://en.wikipedia.org/wiki/F1_Score) • Micro and macro-averaging (http://nlp.stanford.edu/IR- book/html/htmledition/evaluation-of-text-classification-1.html | http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization) • Statistical tests (http://en.wikipedia.org/wiki/Statistical_hypothesis_testing) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 33
  • 34. PART I • PART II INTRO • BOW • ML • EVAL • APP Where are we? Feedback loop - Cross validation Evaluation / - Precision validation - Recall … Data Text pre- Modeling acquisition processing - Search & browse - Categorization - Recommendation - Vector spc model - Machine learning Application - Advertising - (bags-of-words) - Classification - Spam detection - Clustering - Summarization - Visualization … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 34
  • 35. PART I • PART II INTRO • BOW • ML • EVAL • APP Applications • Enhanced Web search • Text summarization (SearchPoint) Leskovec et al. (2005): Extracting Summary Sentences Based on the Document Semantic • Social browsing (LiveNetLife) Graph. Microsoft Research Technical Report MSR-TR-2005-07. • Content categorization • Sentiment analysis • Content-based recommender (demo later) systems • News aggregation • Advertising http://emm.newsexplorer.eu • Blogging assistance (Zemanta) • Knowledge engineering http://ontogen.ijs.si • Spam detection • … • Visualization / summarization of large corpora Lucca, Oct 2012 Miha Grčar: Text and text stream mining 35
  • 36. Enhanced Web search (http://www.searchpoint.com) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 36
  • 37. Hi! Hello Social browsing (http://www.livenetlife.com) @ http://videolectures.net Lucca, Oct 2012 Miha Grčar: Text and text stream mining 37
  • 38. Content categorization @ http://videolectures.net Lucca, Oct 2012 Miha Grčar: Text and text stream mining 38
  • 39. Recommender system @ http://videolectures.net Lucca, Oct 2012 Miha Grčar: Text and text stream mining 39
  • 40. Contextualized advertising Lucca, Oct 2012 Miha Grčar: Text and text stream mining 40
  • 41. PART I • PART II INTRO • BOW • ML • EVAL • APP Blogging assistant (http://www.zemanta.com) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 41
  • 42. PART I • PART II INTRO • BOW • ML • EVAL • APP Pump & dump Siering, Muntermann, Grčar (2012) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 42
  • 43. PART I • PART II INTRO • BOW • ML • EVAL • APP Visualizations • Document space visualization • Canyon flows • Tag clouds http://www.jasondavies.com/wordcloud/ Lucca, Oct 2012 Miha Grčar: Text and text stream mining 43
  • 44. PART I • PART II Recap • Basics • Applications – What is text mining? – Enhanced Web search – TF-IDF bag-of-words vectors (SearchPoint) – Cosine similarity – Social browsing (LiveNetLife) – Centroids – Content categorization • Machine learning – Content-based recommender systems – k-NN – Advertising – Nearest centroid classifier – Writing assistance (Zemanta) – SVM – Spam detection – k-means – Visualization / summarization – Agglomerative clustering of large corpora … Lucca, Oct 2012 Miha Grčar: Text and text stream mining 44
  • 45. PART I • PART II Part II: Text stream mining
  • 46. PART I • PART II INTRO • DACQ • BOW • ML • APP What is text stream mining? Same as text mining but on streams Text stream mining provides a set of methodologies and tools for discovering, presenting, and evaluating knowledge from streams of textual documents Lucca, Oct 2012 Miha Grčar: Text and text stream mining 46
  • 47. PART I • PART II INTRO • DACQ • BOW • ML • APP Remember Typical text mining process Feedback loop - Performance and Evaluation / - utility assessment validation - Feedback loop Data Text pre- Modeling acquisition processing - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 47
  • 48. PART I • PART II INTRO • DACQ • BOW • ML • APP Typical text stream mining process Feedback loop - Performance and - utility assessment Evaluation / - Obtaining new validation - labels - Feedback loop Stream Text pre- data Modeling processing acquisition - Presentation - Acquisition - Transformation - Discover Application - Interaction - Cleaning - Extract - Organize knowledge Lucca, Oct 2012 Miha Grčar: Text and text stream mining 48
  • 49. PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream mining pipelines • Pipelining and parallelization Parallelization – Enables concurrent processing – Increases throughput Pipelining – Enables distributed execution (cluster) • Near-realtime online systems – Stream cannot be paused or slowed down (e.g., newsfeeds) – [Near-realtime] Time between reception and utilization of data should be as short as possible – [Online] Stream is infinite and (sooner or later) outdated data needs to be deleted Lucca, Oct 2012 Miha Grčar: Text and text stream mining 49
  • 50. PART I • PART II INTRO • DACQ • BOW • ML • APP What do we cover in Part II? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 50
  • 51. PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 51
  • 52. PART I • PART II INTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 52
  • 53. PART I • PART II INTRO • DACQ • BOW • ML • APP RSS (Really Simple Syndication) <rss version="2.0"> <channel> <generator>NFE/1.0</generator> <title>Top Stories - Google News</title> <link>http://news.google.com/news?pz=1&amp;ned=us&amp;hl=en</link> <language>en</language> <webMaster>news-feedback@google.com</webMaster> <copyright>&amp;copy;2011 Google</copyright> <item> <title>Egypt Analysts Comment on Next Steps After Mubarak’s Ouster - Bloomberg</title> <link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNEF9B 7Q8C7_TBDKPEMFjb83fcuNfQ&amp;url=http://www.bloomberg.com/news/2011- 02-11/egypt-analysts-comment-on-next-steps-after-mubarak-s-ouster.html</link> <category>Top Stories</category> <pubDate>Fri, 11 Feb 2011 20:15:40 GMT+00:00</pubDate> <description>The ouster of Hosni Mubarak from Egypt’s presidency today, after protests that started Jan. 25, prompted the following comments from analysts: “The army needs to move quickly to remove obstacles to ...</description> </item> ... </channel> </rss> Lucca, Oct 2012 Miha Grčar: Text and text stream mining 53
  • 54. PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 54
  • 55. PART I • PART II INTRO • DACQ • BOW • ML • APP http://www.bbc.co.uk/news/world-us-canada-15051554 Boilerplate removal Lucca, Oct 2012 Miha Grčar: Text and text stream mining 55
  • 56. PART I • PART II INTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree protocol :// domain / path / file ? query http:// kt.ijs.si /a/b/ c.html ?pg=0 Tree branch: #  si  ijs  kt  a  b root domain path http://www.bbc.co.uk/news/world-us-canada-15051554 #  uk  co  bbc  www  news Lucca, Oct 2012 Miha Grčar: Text and text stream mining 56
  • 57. PART I • PART II INTRO • DACQ • BOW • ML • APP Boilerplate removal URL tree How many times did I see “About Us” in this part of the tree? Path Domain Root Stream # This method is … • Unsupervised • Online • Incremental (consumes one document at a time) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 57
  • 58. PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 58
  • 59. PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection • Motivation: language-specific text analysis components and applications • Solutions based on word lists and word or character sequences (n-grams) • Character n-gram model – Build character n-gram histograms for many languages (language models) – Compare text document histogram to language models Lucca, Oct 2012 Miha Grčar: Text and text stream mining 59
  • 60. PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection English German E 1 E 1 T 2 N 2 O 3 R 3 A 4 I 4 N 5 T 5 I 6 S 6 H 7 A 7 S 8 D 8 R 9 U 9 D 10 EN 10 THE DER, DEN E_ 11 G 11 L 12 ER 12 _T 13 H 13 TH 14 L 14 HE 15 N_ 15 U 16 O 16 W 17 M 17 C 18 _D 18 M 19 C 19 ... ... ... ... Lucca, Oct 2012 Miha Grčar: Text and text stream mining 60
  • 61. PART I • PART II INTRO • DACQ • BOW • ML • APP Language detection Article “Egypt rejoices at Mubarak departure” 450 350 400 300 350 250 English article (n-gram rank) English article (n-gram rank) 300 250 200 200 150 150 100 100 50 50 0 0 0 100 200 300 400 0 50 100 150 200 250 300 350 English language model (n-gram rank) German language model (n-gram rank) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 61
  • 62. PART I • PART II INTRO • DACQ • BOW • ML • APP Text stream acquisition and preprocessing RSS Boilerplate Language reader remover detector RSS Boilerplate Language Load balancing reader remover detector Online Sync ... BOW . . Preprocessing . . pipelines . . RSS Boilerplate Language reader remover detector Lucca, Oct 2012 Miha Grčar: Text and text stream mining 62
  • 63. PART I • PART II INTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors Add Remove DF values Lucca, Oct 2012 Miha Grčar: Text and text stream mining 63
  • 64. PART I • PART II INTRO • DACQ • BOW • ML • APP Online BOW Stream Outdated Queue of TF vectors DF values TF DF TF-IDF Lucca, Oct 2012 Miha Grčar: Text and text stream mining 64
  • 65. PART I • PART II INTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 65
  • 66. PART I • PART II INTRO • DACQ • BOW • ML • APP Batch, incremental, offline, online • Batch learning Consuming all training examples at once • Incremental learning Consuming one example at a time • Mini-batch learning Consuming several examples at a time • Offline learning (for datasets/finite streams) All data is stored and can be accessed repeatedly • Online learning (for infinite streams) Each example is discarded after being processed Lucca, Oct 2012 Miha Grčar: Text and text stream mining 66
  • 67. PART I • PART II INTRO • DACQ • BOW • ML • APP Incremental nearest centroid classifier Outdated instance New instance Lucca, Oct 2012 Miha Grčar: Text and text stream mining 67
  • 68. PART I • PART II INTRO • DACQ • BOW • ML • APP Incremental k-means clustering Converges in only a few iterations (warm start) Lucca, Oct 2012 Miha Grčar: Text and text stream mining 68
  • 69. PART I • PART II INTRO • DACQ • BOW • ML • APP Other incremental methods • Incremental SVM A. Bordes, S. Ertekin, J. Weston, and L. Bottou (2005): Fast Kernel Classifiers with Online and Active Learning, Journal of Machine Learning Research, vol. 6, pp. 1579–1619 • Incremental perceptron www.cs.columbia.edu/~jebara/4771/tutorials/pe rceptron.pdf • Incremental winnow http://en.wikipedia.org/wiki/Winnow_%28algorit hm%29 Lucca, Oct 2012 Miha Grčar: Text and text stream mining 69
  • 70. PART I • PART II INTRO • DACQ • BOW • ML • APP Where are we? Feedback loop Evaluation / validation Stream Text pre- data Modeling processing acquisition - Online document - space visualization - RSS feeds - Online BOW - Online ML Application - Online tweeter - Boilerplate remover - Incr. NCC - sentiment classif. - Language detection - Incr. k-means - Incr. SVM Lucca, Oct 2012 Miha Grčar: Text and text stream mining 70
  • 71. PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization 2D Several 1000 dimensions Lucca, Oct 2012 Miha Grčar: Text and text stream mining 71
  • 72. PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Document Stress corpus majorization Layout Lucca, Oct 2012 Miha Grčar: Text and text stream mining 72
  • 73. PART I • PART II INTRO • BOW • ML • EVAL • APP Document space visualization Lucca, Oct 2012 Miha Grčar: Text and text stream mining 73
  • 74. PART I • PART II INTRO • DACQ • BOW • ML • APP Document space visualization Maintaining sorted lists Warm start Warm start Parallelization Neighborhoods computation Corpus k-means Least-squares preprocessing clustering interpolation Stress Document Online majorization corpus BOW Layout Warm start Pipelining Lucca, Oct 2012 Miha Grčar: Text and text stream mining 74
  • 75. PART I • PART II INTRO • DACQ • BOW • ML • APP Document space visualization This video is available at http://first.ijs.si/tutorial/video/ameba.html Lucca, Oct 2012 Miha Grčar: Text and text stream mining 75
  • 76. PART I • PART II INTRO • DACQ • BOW • ML • APP Twitter • Platform for sending short messages (similar to SMS) • Est. 225 million users • 100 million accounts added in 2010 • 65 million tweets per day Lucca, Oct 2012 Miha Grčar: Text and text stream mining 76
  • 77. PART I • PART II INTRO • DACQ • BOW • ML • APP Financial tweets • Informal $ sign convention • Some examples (March 19): – User#1: $AAPL is making an announcement at 9am on what it plans to do with its 97 billion in cash.We expect a dividend announcement – User#2: $AAPL over 600.00 a share in the pre-market on news of a dividend. – User#3: Will there be any other news besides $AAPL dividend? • We acquire ~13,000 tweets per weekday, for ~1,800 NASDAQ/NYSE stocks ($GOOG, $MSFT…) • We analyze tweets to determine whether they contain positive or negative vocabulary Lucca, Oct 2012 Miha Grčar: Text and text stream mining 77
  • 78. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Labeled documents POS Financial markets are now officially open :) POS market intelligence GMI Interactive and Mintel Win ARF Great Minds Award for Quality in Research POS $AAPL : trust me -- AAPL will soar tomorrow NEG Oh how I miss the days with GBP was at least 2 times the AUD. Sterling forecast to hit all-time lows soon NEG omg! did you know BORDERS closed?! they went bankrupt last month and closed!! awww, too bad! i love borders!! NEG @aekins that's just too bad ... • Learn to classify Labeled Training Classification dataset Algorithm Model • Classify unlabeled documents Unlabeled Classification Predictions dataset Algorithm (Labels) So Nickelodeon filed for bankruptcy and announced that the next Kids Choice NEG Awards will be it's last. Classification Model Lucca, Oct 2012 Miha Grčar: Text and text stream mining 78
  • 79. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier Goodnight everyoneeee :) Love yall I have a good feeling about today ;) ooo the ice cream van is here... yaaaaaay :D • Neutral zone in the garden in the sun! Just about to fill the pool! happy days! :D Finally got JSON in #processing to work. More playing around coming :) @oanhLove I hate when that happens... :-/ No jobs, no money. how in the hell is min wage here 4 f'n clams an hour? :( I hate when I have to call and wake people up :( • Explanations I don't have any chalk! :-/ MY CHALKBOARD IS USELESS UGHHHHHHHHHHHHHHH.. life is NOT good all the time!!!!!! ;( • Accuracy Lucca, Oct 2012 79
  • 80. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – – + + – • Neutral zone – – + – + – + + • Explanations – – + + + • Accuracy – + + + + + Lucca, Oct 2012 80
  • 81. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & – – SVM classifier – 0 0 + – • Neutral zone – – 0 – + – + + • Explanations – 0 0 + + • Accuracy 0 + + + 0 + Lucca, Oct 2012 81
  • 82. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & SVM classifier “Sovereign debt and unemployment are big issues in EU.” • Neutral zone unemployed, issues, debt, eu sovereign, big • Explanations • Accuracy Lucca, Oct 2012 Miha Grčar: Text and text stream mining 82
  • 83. PART I • PART II INTRO • DACQ • BOW • ML • APP Sentiment classification • Emoticons & Replace usernames Replace Remove Replace Replace negations exclamation Replace question Average accuracy SVM classifier URLs with a letter Accuracy Precision/recall 10-fold cross with a with a marks with a marks with token repetition validation token token token a token X X 81.06% 81.32%/81.32% 76.98% X X X X X X 80.22% 82.08%/78.02% 77.43% • Neutral zone X X X 79.94% 77.78%/84.62% 77.10% X X X 79.94% 76.70%/86.81% 77.53% X X X 79.67% 80.79%/78.57% 76.85% X 78.83% 77.60%/81.87% 77.29% • Explanations X X 78.55% 78.55% 75.86%/84.62% 77.78%/80.77% 76.91% 76.93% X X X X 78.27% 80.23%/75.82% 76.93% X X X 78.27% 76.53%/82.42% 77.04% • Accuracy X X X X X 77.44% 75.12%/82.97% 76.86% Lucca, Oct 2012 Miha Grčar: Text and text stream mining 83
  • 84. Grey: Netflix stock closing price Blue: The number of positive tweets Yellow: The difference between the positive and negative tweets Green dots: Relevant events concerning Netflix Red: The number of negative tweets Lucca, Oct 2012 Miha Grčar: Text and text stream mining 84
  • 85. First-quarter earnings release Plans to launch in 43 countries in Latin America and the Caribbean Volume peaks likely represent important events Netflix loses TV shows and films, Netflix loses the Starz deal Lucca, Oct 2012 Miha Grčar: Text and text stream mining 85
  • 86. Sentiment cross-over happens before price plunge Sentiment cross-over Lucca, Oct 2012 Miha Grčar: Text and text stream mining 86
  • 87. PART I • PART II INTRO • DACQ • BOW • ML • APP Presidential elections http://predsedniskevolitve.si Lucca, Oct 2012 Miha Grčar: Text and text stream mining 87
  • 88. PART I • PART II Recap • Basics • Applications – What is text stream – Online document space mining? visualization – Pipelining, parallelization – Online tweeter sentiment – Web data acquisition classifier – Online BOWs • Stock sentiment monitoring • Machine learning • Presidential elections – Batch, incremental, offline, online – Incremental nearest centroid classifier – Incremental k-means – Warm start Lucca, Oct 2012 Miha Grčar: Text and text stream mining 88

Notas do Editor

  1. Applet at http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html
  2. - Vegas77 Entertainment SE- Spam normally sent on weekends, lines drawn at Fridays – exceptions 28.3. and 28.4. - Price on Monday higher in many cases
  3. http://www.bbc.co.uk/news/world-us-canada-15051554
  4. Taken from http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-2