SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Project group knowAAN
   Final presentation

         Adrian Wilke
   info[REMOVE]@adrianwilke.de


 Computer Science Education Group
     University of Paderborn


     October 20th 2011
Overview



Overview



    Introduction
    System components & Work flow
    Demonstration
    Development process
    Summary & Outlook
    Time for further questions of detail




                   PG knowAAN                    2
Overview



Overview: First part



    Goals
    Extraction & Storage (of data)
    Exploration (of data)
    System components & Work flow
    Analysis & Visualization (of data)




                PG knowAAN                     3
Goals



Goals

    Explore research networks
    Based on: Artifacts (scientific publications) and metadata
    Combination and analysis of data
    Computation of similarities of full texts
    Support for conference management system Ginkgo
    Data visualization
    Recommendations

              (Source: PG knowAAN project description)



                 PG knowAAN                                        4
Goals


Imagine you are interested in a conference.
You downloaded the papers of 2 or 3 years.
  Now you have nearly 100 publications.
       How do you explore them?




   100 publications. Do you know tools?
      PG knowAAN                                 5
Extraction & Storage



Extraction & Storage




           First step: Extract data and store it.




             PG knowAAN                                               6
Extraction & Storage




PG knowAAN                     7
Exploration



Exploration




               Second step: Explore data.




              PG knowAAN                             8
Exploration



Exploring a conference




             PG knowAAN            9
Exploration



Exploration




      Which extracted data is available for a publication?
                     → Database schema




                PG knowAAN                                           10
discipline                                     pub_dis                           pub_aff                                                                             affiliation
            id GUID                                        publication_id GUID               publication_id GUID                                                               id GUID
            text VARCHAR(512)                              discipline_id GUID                affiliation_id GUID                                                               text VARCHAR(512)
            parent_id GUID                               Indexes                           Indexes                                                                             location_id GUID
                                                                                                                                           aut_aff
           Indexes                                                                                                                                                            Indexes
                                                                                                                                         author_id GUID
                                                                                                                                         affiliation_id GUID
                                                                                                                                        Indexes
                                    pub_key                           publication
   keyword                        publication_id GUID               id GUID
 id GUID                          keyword_id GUID                   lucuid VARCHAR(512)
 text VARCHAR(512)                score DOUBLE                      title VARCHAR(512)                                                         author
                                                                                                                   pub_aut
Indexes                           source VARCHAR(512)               booktitle VARCHAR(512)                                                   id GUID
                                                                                                              publication_id GUID
                                 Indexes                            normtitle VARCHAR(512)                                                   text VARCHAR(512)
                                                                                                              author_id GUID                                                       location
                                                                    date VARCHAR(512)                                                        normtext VARCHAR(512)
                                                                                                           Indexes                                                             id GUID
                                    pub_con                         editor VARCHAR(512)                                                      firstname VARCHAR(512)
                                                                                                                                                                               latitude DOUBLE
   concept                        publication_id GUID               journal VARCHAR(512)                                                     lastname VARCHAR(512)
                                                                                                                                                                               longitude DOUBLE
 id GUID                          concept_id GUID                   note VARCHAR(512)                              citation                  created BIGINT
                                                                                                                                                                               text VARCHAR(512)
 text VARCHAR(512)                score DOUBLE                      pages VARCHAR(512)                        publication1_id GUID           modified BIGINT
                                                                                                                                                                              Indexes
Indexes                           source VARCHAR(512)               publisher VARCHAR(512)                                                 Indexes
                                                                                                              publication2_id GUID
                                 Indexes                            tech VARCHAR(512)                      Indexes
                                                                    volume VARCHAR(512)
                                    pub_cat                         number VARCHAR(512)
                                                                                                                                                          aut_add
   category                       publication_id GUID               rawstring VARCHAR(4096)                        pub_add
                                                                                                                                                        author_id GUID
 id GUID                          category_id GUID                  xmlfile VARCHAR(512)                      publication_id GUID
                                                                                                                                                        address_id GUID
 text VARCHAR(512)                score DOUBLE                      pdffile VARCHAR(512)                      address_id GUID
                                                                                                                                                       Indexes
Indexes                           source VARCHAR(512)               topicfile VARCHAR(512)                 Indexes
                                 Indexes                            created BIGINT
                                                                    modified BIGINT
   eventseries                                                    Indexes
                                                                                                                                                                         address
 id GUID
                                                                                                                                                                    id GUID
 text VARCHAR(512)
                                                                                               pub_evt                                                              text VARCHAR(512)
 filepath VARCHAR(512)
                                                                                             publication_id GUID                                                    location_id GUID
Indexes
                                                 event                                       event_id GUID                                                        Indexes

                                              id GUID                                      Indexes
                                              text VARCHAR(512)                                                                     category_count               bib_coupling
            evt_evs                           filepath VARCHAR(512)
           event_id GUID                      predecessor_id GUID                            discipline_count                       concept_count                co_author
           eventseries_id GUID                successor_id GUID
      Indexes                              Indexes                                           evt_pub_aut_count                      keyword_count                co_citation
System components & Work flow



System components & Work flow




           How is our system structured?
                  → Some examples.




            PG knowAAN                                              12
System components & Work flow



Components
                                                      Model                 << component >>
                      << component >>
                          Backend                                            ParscitTrainer


                                   << component >>
    << component >>
                                        Parscit
       Clustering
                                                     WebServices                  << component >>
                                                                            FrontendReferenceExtraction


    << component >>                << component >>
          DB                       TrendDetection

                                                     WebServices            << component >>
                                                                              DocBrowser


    << component >>                << component >>
       Roundtrip                    TF-Component

                                                                     JDBC


    << component >>                << component >>                          << component >>
      PDFToText                                       JDBC
                                   TopicExtraction                             DataBase




    << component >>                << component >>                          << component >>
                                                       WebServices
    Recommendation                   xmlBuilder                                   Solr




                                                       FileSystem           << component >>
                                                                              FileStorage




                              PG knowAAN                                                                  13
DocumentBrowser:              RoundTrip :                  RoundTripExecutor :             PDFToText :            Parscit:       Languagedetection:       Lemmatizer:   NounExtraction:   Solr:   DB:

             a / 1) .addPDF


                                            a / 2) .writeToFS




                                            a / 2) Path


                                            a / 3) .createThread

                                              .submitThread


                                            a / 3)

                   a / 1)




                                                                           b / 1) .run

                                                                         b / 2) .getText


                                                                           b / 2) Text
                                                                                 b / 3) .ParseFullText


                                                                                    b / 3) ParscitXML




                                                                            b / 4) .extractBodyAndAstract




                                                                            b / 4) BodyAndAbstract

                                                                                              b / 5) .getLanguage


                                                                                             b / 5) LanguageString
                                                                                                            b / 6) .lemmatize


                                                                                                         b / 6) LemmatizedText

                                                                                                                    b / 7) .extractNouns


                                                                                                                      b / 7) NounsList
                                                                                                     b / 8) .lemmatizeNounslist


                                                                                                         b / 8) LemmatizedNouns




                                                                            b / 9) .ReduceToTopNouns




                                                                            b / 9) TopNouns


                                                                            b / 10) .writeToFiles




                                                                            b / 10) Paths
                                                                                                                                 b / 11) .addTexts


                                                                                                                                   b / 11) Solrid


                                                                                                                                     b / 12) .addPublication


                                                                                                                                              b / 12)


                                                                           b / 1)
System components & Work flow



Work flow




           PG knowAAN                            15
Analysis & Visualization



Analysis & Visualization




           Third step: Analyze and visualize data.




               PG knowAAN                                                 16
Analysis & Visualization



Analysis of authors




              PG knowAAN                        17
Analysis & Visualization



Analysis of scientific publications




              PG knowAAN                                  18
Demonstration



Demonstration




                            Now: Demo.
           Image: http://www.flickr.com/photos/plaisanter/5525977163/


             PG knowAAN                                                          19
Development process



Technologies




                            Jersey



               PG knowAAN                            20
Development process



Methods of agile software development



     FDD                  XP
                                        Scrum




             PG knowAAN                                  21
Development process



Methods of agile software development




    Weekly meetings
    Sit together (as much as possible)
    Automated building system
    Continuous integration
    Issue tracking


                PG knowAAN                               22
Summary and Outlook



Summary and future work

 Summary
     Integrated processing of scientific papers
     Aggregated visualization of authors, publications and
     events
     Compute various analysis over the data
     Cleaning functionality for automated processed data

 Future work
     Parallelized Clustering
     Additional graphical visualization
     Improve extraction of metadata from PDF files
                 PG knowAAN                                           23
Summary and Outlook



Thank you for your attention




                           Questions?

              PG knowAAN                                24

Mais conteúdo relacionado

Mais de adrianwilke

OPAL - Open Data Portal Germany
OPAL - Open Data Portal GermanyOPAL - Open Data Portal Germany
OPAL - Open Data Portal Germanyadrianwilke
 
Algebraic Property Graphs
Algebraic Property GraphsAlgebraic Property Graphs
Algebraic Property Graphsadrianwilke
 
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...adrianwilke
 
36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Scienceadrianwilke
 
Zotero Visualisierungen
Zotero VisualisierungenZotero Visualisierungen
Zotero Visualisierungenadrianwilke
 
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15adrianwilke
 
INSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and ReferencesINSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and Referencesadrianwilke
 
Ant Colony Optimization: Routing
Ant Colony Optimization: RoutingAnt Colony Optimization: Routing
Ant Colony Optimization: Routingadrianwilke
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationenadrianwilke
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationenadrianwilke
 

Mais de adrianwilke (10)

OPAL - Open Data Portal Germany
OPAL - Open Data Portal GermanyOPAL - Open Data Portal Germany
OPAL - Open Data Portal Germany
 
Algebraic Property Graphs
Algebraic Property GraphsAlgebraic Property Graphs
Algebraic Property Graphs
 
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...Critical Incidents for Technology Enhanced Learning in Vocational Education a...
Critical Incidents for Technology Enhanced Learning in Vocational Education a...
 
36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science36. Bundeswettbewerb Informatik - DICE Data Science
36. Bundeswettbewerb Informatik - DICE Data Science
 
Zotero Visualisierungen
Zotero VisualisierungenZotero Visualisierungen
Zotero Visualisierungen
 
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
Assistenz der Ausbildung im Maschinenbau durch mobiles Lernen - OEB15
 
INSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and ReferencesINSPIRE: Insight to Scientific Publications and References
INSPIRE: Insight to Scientific Publications and References
 
Ant Colony Optimization: Routing
Ant Colony Optimization: RoutingAnt Colony Optimization: Routing
Ant Colony Optimization: Routing
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationen
 
Analyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher PublikationenAnalyse wissenschaftlicher Publikationen
Analyse wissenschaftlicher Publikationen
 

Último

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

knowAAN final presentation

  • 1. Project group knowAAN Final presentation Adrian Wilke info[REMOVE]@adrianwilke.de Computer Science Education Group University of Paderborn October 20th 2011
  • 2. Overview Overview Introduction System components & Work flow Demonstration Development process Summary & Outlook Time for further questions of detail PG knowAAN 2
  • 3. Overview Overview: First part Goals Extraction & Storage (of data) Exploration (of data) System components & Work flow Analysis & Visualization (of data) PG knowAAN 3
  • 4. Goals Goals Explore research networks Based on: Artifacts (scientific publications) and metadata Combination and analysis of data Computation of similarities of full texts Support for conference management system Ginkgo Data visualization Recommendations (Source: PG knowAAN project description) PG knowAAN 4
  • 5. Goals Imagine you are interested in a conference. You downloaded the papers of 2 or 3 years. Now you have nearly 100 publications. How do you explore them? 100 publications. Do you know tools? PG knowAAN 5
  • 6. Extraction & Storage Extraction & Storage First step: Extract data and store it. PG knowAAN 6
  • 8. Exploration Exploration Second step: Explore data. PG knowAAN 8
  • 10. Exploration Exploration Which extracted data is available for a publication? → Database schema PG knowAAN 10
  • 11. discipline pub_dis pub_aff affiliation id GUID publication_id GUID publication_id GUID id GUID text VARCHAR(512) discipline_id GUID affiliation_id GUID text VARCHAR(512) parent_id GUID Indexes Indexes location_id GUID aut_aff Indexes Indexes author_id GUID affiliation_id GUID Indexes pub_key publication keyword publication_id GUID id GUID id GUID keyword_id GUID lucuid VARCHAR(512) text VARCHAR(512) score DOUBLE title VARCHAR(512) author pub_aut Indexes source VARCHAR(512) booktitle VARCHAR(512) id GUID publication_id GUID Indexes normtitle VARCHAR(512) text VARCHAR(512) author_id GUID location date VARCHAR(512) normtext VARCHAR(512) Indexes id GUID pub_con editor VARCHAR(512) firstname VARCHAR(512) latitude DOUBLE concept publication_id GUID journal VARCHAR(512) lastname VARCHAR(512) longitude DOUBLE id GUID concept_id GUID note VARCHAR(512) citation created BIGINT text VARCHAR(512) text VARCHAR(512) score DOUBLE pages VARCHAR(512) publication1_id GUID modified BIGINT Indexes Indexes source VARCHAR(512) publisher VARCHAR(512) Indexes publication2_id GUID Indexes tech VARCHAR(512) Indexes volume VARCHAR(512) pub_cat number VARCHAR(512) aut_add category publication_id GUID rawstring VARCHAR(4096) pub_add author_id GUID id GUID category_id GUID xmlfile VARCHAR(512) publication_id GUID address_id GUID text VARCHAR(512) score DOUBLE pdffile VARCHAR(512) address_id GUID Indexes Indexes source VARCHAR(512) topicfile VARCHAR(512) Indexes Indexes created BIGINT modified BIGINT eventseries Indexes address id GUID id GUID text VARCHAR(512) pub_evt text VARCHAR(512) filepath VARCHAR(512) publication_id GUID location_id GUID Indexes event event_id GUID Indexes id GUID Indexes text VARCHAR(512) category_count bib_coupling evt_evs filepath VARCHAR(512) event_id GUID predecessor_id GUID discipline_count concept_count co_author eventseries_id GUID successor_id GUID Indexes Indexes evt_pub_aut_count keyword_count co_citation
  • 12. System components & Work flow System components & Work flow How is our system structured? → Some examples. PG knowAAN 12
  • 13. System components & Work flow Components Model << component >> << component >> Backend ParscitTrainer << component >> << component >> Parscit Clustering WebServices << component >> FrontendReferenceExtraction << component >> << component >> DB TrendDetection WebServices << component >> DocBrowser << component >> << component >> Roundtrip TF-Component JDBC << component >> << component >> << component >> PDFToText JDBC TopicExtraction DataBase << component >> << component >> << component >> WebServices Recommendation xmlBuilder Solr FileSystem << component >> FileStorage PG knowAAN 13
  • 14. DocumentBrowser: RoundTrip : RoundTripExecutor : PDFToText : Parscit: Languagedetection: Lemmatizer: NounExtraction: Solr: DB: a / 1) .addPDF a / 2) .writeToFS a / 2) Path a / 3) .createThread .submitThread a / 3) a / 1) b / 1) .run b / 2) .getText b / 2) Text b / 3) .ParseFullText b / 3) ParscitXML b / 4) .extractBodyAndAstract b / 4) BodyAndAbstract b / 5) .getLanguage b / 5) LanguageString b / 6) .lemmatize b / 6) LemmatizedText b / 7) .extractNouns b / 7) NounsList b / 8) .lemmatizeNounslist b / 8) LemmatizedNouns b / 9) .ReduceToTopNouns b / 9) TopNouns b / 10) .writeToFiles b / 10) Paths b / 11) .addTexts b / 11) Solrid b / 12) .addPublication b / 12) b / 1)
  • 15. System components & Work flow Work flow PG knowAAN 15
  • 16. Analysis & Visualization Analysis & Visualization Third step: Analyze and visualize data. PG knowAAN 16
  • 17. Analysis & Visualization Analysis of authors PG knowAAN 17
  • 18. Analysis & Visualization Analysis of scientific publications PG knowAAN 18
  • 19. Demonstration Demonstration Now: Demo. Image: http://www.flickr.com/photos/plaisanter/5525977163/ PG knowAAN 19
  • 20. Development process Technologies Jersey PG knowAAN 20
  • 21. Development process Methods of agile software development FDD XP Scrum PG knowAAN 21
  • 22. Development process Methods of agile software development Weekly meetings Sit together (as much as possible) Automated building system Continuous integration Issue tracking PG knowAAN 22
  • 23. Summary and Outlook Summary and future work Summary Integrated processing of scientific papers Aggregated visualization of authors, publications and events Compute various analysis over the data Cleaning functionality for automated processed data Future work Parallelized Clustering Additional graphical visualization Improve extraction of metadata from PDF files PG knowAAN 23
  • 24. Summary and Outlook Thank you for your attention Questions? PG knowAAN 24