SlideShare uma empresa Scribd logo
1 de 103
Baixar para ler offline
Using	
  Lucene/Solr	
  to	
  Build	
  CiteSeerX	
  and	
  
                  Friends	
  	
  
                 Dr. C. Lee Giles
       Information Sciences and Technology
        Computer Science and Engineering
        The Pennsylvania State University
            University Park, PA, USA
                giles@ist.psu.edu
             http://clgiles.ist.psu.edu
http://clgiles.ist.psu.edu
            Prof.	
  C.	
  Lee	
  Giles
                                      	
  
•  Intelligent	
  and	
  specialty	
  search	
  engines;	
  cyberinfrastructure	
  
   for	
  science,	
  academia	
  and	
  government	
  
     –  Modular,	
  scalable,	
  robust,	
  automaEc	
  cyberinfrastructure	
  and	
  
        search	
  engine	
  creaEon	
  and	
  maintenance	
  
     –  Large	
  heterogeneous	
  data	
  and	
  informaEon	
  systems	
  
     –  Specialty	
  search	
  engines	
  and	
  portals	
  for	
  knowledge	
  integraEon	
  
           •  CiteSeerx	
  (computer	
  and	
  informaEon	
  science)	
  
           •  ChemXSeer	
  (e-­‐chemistry	
  portal)	
  
           •  GrantSeer	
  (grant	
  search)	
  
           •  RefSeer	
  	
  (recommendaEon	
  of	
  paper	
  references)	
  
•  Scalable	
  intelligent	
  tools/agents/methods/algorithms	
  
     –  InformaEon,	
  knowledge	
  and	
  data	
  integraEon	
  
     –  InformaEon	
  and	
  metadata	
  extracEon;	
  enEty	
  disambiguaEon	
  
     –  Unique	
  search,	
  knowledge	
  discovery,	
  informaEon	
  integraEon,	
  
        data	
  mining	
  algorithms	
  
     –  Web	
  2.0	
  methods	
  
           •  Automated	
  tagging	
  for	
  search	
  and	
  informaEon	
  retrieval	
  
           •  Social	
  network	
  analysis	
  
SeerSuite	
  Contributors/Collaborators:	
  recent	
  
        past	
  and	
  present	
  (incomplete	
  list)	
  
Projects:	
  CiteSeer,	
  CiteSeerX,	
  ChemXSeer,	
  ArchSeer,	
  CollabSeer,	
  GrantSeer,	
  
   SeerSeer,	
  RefSeer,	
  AlgoSeer,	
  AckSeer,	
  BotSeer,	
  YouSeer,	
  …	
  

•  P.	
  Mitra,	
  V.	
  Bhatnagar,	
  L.	
  Bolelli,	
  J.	
  Carroll,	
  I.	
  Councill,	
  F.	
  Fonseca,	
  J.	
  Jansen,	
  
   D.	
  Lee,	
  W-­‐C.	
  Lee,	
  H.	
  Li,	
  J.	
  Li,	
  E.	
  Manavoglu,	
  A.	
  Sivasubramaniam,	
  P.	
  
   Teregowda,	
  H.	
  Zha,	
  S.	
  Zheng,	
  D.	
  Zhou,	
  Z.	
  Zhuang,	
  J.	
  Stribling,	
  D.	
  Karger,	
  S.	
  
   Lawrence,	
  J.	
  Gray,	
  G.	
  Flake,	
  S.	
  Debnath,	
  H.	
  Han,	
  D.	
  Pavlov,	
  E.	
  Fox,	
  M.	
  Gori,	
  
   E.	
  Blanzieri,	
  M.	
  Marchese,	
  N.	
  Shadbolt,	
  I.	
  Cox,	
  S.	
  Gauch,	
  A.	
  Bernstein,	
  L.	
  
   Cassel,	
  M-­‐Y.	
  Kan,	
  X.	
  Lu,	
  Y.	
  Liu,	
  A.	
  Jaiswal,	
  K.	
  Bai,	
  B.	
  Sun,	
  Y.	
  Sung,	
  J.	
  Z.	
  Wang,	
  
   K.	
  Mueller,	
  J.Kubicki,	
  B.	
  Garrison,	
  J.	
  Bandstra,	
  Q.	
  Tan,	
  J.	
  Fernandez,	
  P.	
  
   Treeratpituk,	
  W.	
  Brouwer,	
  U.	
  Farooq,	
  J.	
  Huang,	
  M.	
  Khabsa,	
  M.	
  Halm,	
  B.	
  
   Urgaonkar,	
  Q.	
  He,	
  D.	
  Kifer,	
  J.	
  Pei,	
  S.	
  Das,	
  S.	
  Kataria,	
  D.	
  Yuan,	
  T.	
  Suppawong,	
  
   others.	
  


•  Current	
  funding:	
  NSF,	
  Dow	
  Chemical	
  
Outline	
  
•  MoEvaEon	
  
    –  Data	
  science;	
  Cyberinfrastructure	
  
    –  Vast	
  growth	
  in	
  domain	
  science	
  data	
  and	
  documents	
  
•  SeerSuite	
  
    –  Tool	
  for	
  creaEng	
  Seers	
  
    –  Specialized	
  data	
  and	
  document	
  search	
  and	
  recommendaEons	
  
          •  Tables,	
  formulae,	
  figures,	
  references	
  …	
  
    –  Use	
  of	
  Solr/Lucene	
  
•  Disciplinary	
  sciences,	
  indexes	
  &	
  informaEon	
  extracEon	
  (the	
  
   Seers)	
  
    –  Computer	
  science	
  
    –  Chemistry	
  
    –  Briefly	
  other	
  Seers	
  
•  OpportuniEes	
  for	
  Research	
  
•  Conclusions	
  and	
  DirecEons	
  
The	
  Evolu3on	
  of	
  Science	
  -­‐	
  the	
  4th	
  
                 Paradigm	
   Jim Gray’s paradigm
•  Observa3onal	
  Science	
  	
  
     –  ScienEst	
  gathers	
  data	
  by	
  direct	
  
        observaEon	
  
     –  ScienEst	
  analyzes	
  data	
  
•  Analy3cal	
  Science	
  	
  
     –  ScienEst	
  builds	
  analyEcal	
  model	
  
     –  Makes	
  predicEons.	
  
•  Computa3onal	
  Science	
  	
  
     –  Simulate	
  analyEcal	
  model	
  
     –  Validate	
  model	
  and	
  makes	
  predicEons	
  	
  
•  Data	
  Driven	
  Science	
  
     –  Data	
  captured	
  from	
  the	
  web,	
  by	
  
        instruments,	
  or	
  from	
  documents	
  
     –  Data	
  generated	
  by	
  simulaEon	
  
     –  Placed	
  in	
  data	
  structures	
  /	
  files	
  
     –  ScienEst(s)	
  analyze(s)	
  data	
  
     –  Access	
  &	
  search	
  crucial	
  
Data	
  Access	
  Varies	
  with	
  Discipline	
  
                  or	
  Small	
  vs	
  Big	
  Science	
  
•  Small	
  vs	
  Big	
  science	
  
       –  “Data	
  from	
  Big	
  Science	
  is	
  …	
  easier	
  to	
  handle,	
  understand	
  and	
  archive.	
  
          Small	
  Science	
  is	
  horribly	
  heterogeneous	
  and	
  far	
  more	
  vast.	
  In	
  Eme	
  Small	
  
          Science	
  will	
  generate	
  2-­‐3	
  Emes	
  more	
  data	
  than	
  Big	
  Science.”	
  	
  	
  
              •  ‘Lost	
  in	
  a	
  Sea	
  of	
  Science	
  Data’	
  S.Carlson,	
  The	
  Chronicle	
  of	
  Higher	
  EducaEon	
  
                 (23/06/2006)	
  	
  
       –  Data	
  is	
  local	
  
       –  Data	
  will	
  not	
  be	
  shared	
  
•  At	
  some	
  point	
  there	
  will	
  be	
  needed	
  	
  
       –  indices	
  to	
  control	
  search	
  
       –  parallel	
  data	
  search	
  and	
  analysis	
  
•  Cyberinfrastructure	
  can	
  help	
  
       –  If	
  you	
  can’t	
  move	
  the	
  data	
  around,	
  
       –  Bandwidth	
  of	
  a	
  van	
  loaded	
  with	
  disks 	
                	
  	
  
       take	
  the	
  analysis	
  to	
  the	
  data!	
  
       –  Do	
  all	
  data	
  manipulaEons	
  locally	
  
              •  Build	
  custom	
  procedures	
  and	
  funcEons	
  locally	
  
SeerSuite	
  
•  Open	
  source	
  search	
  engine	
  and	
  digital	
  library	
  tool	
  kit	
  used	
  to	
  
   build	
  search	
  engines	
  and	
  digital	
  libraries	
  
     –  CiteSeerX	
  ,	
  ChemXSeer,	
  RefSeer,YouSeer,	
  CollabSeer,	
  etc. 	
  
•  Supports	
  research	
  in	
  
     –    Indexing	
  and	
  search	
  
     –    Digital	
  libraries	
  
     –    Data	
  mining	
  &	
  structures	
  
     –    InformaEon	
  and	
  knowledge	
  extracEon	
  
     –    Social	
  networks	
  
     –    Scientometrics/infometrics	
  
     –    Systems	
  engineering,	
  User	
  design	
  
     –    Sokware	
  engineering	
  and	
  management	
  
     –    Web	
  crawling	
  
•  Trains	
  students	
  in	
  search	
  and	
  sokware	
  systems	
  
     –  EducaEonal	
  tool	
  for	
  search	
  engine	
  creaEon	
  
     –  Students	
  highly	
  sought	
  in	
  industry	
  and	
  government	
  
SeerSuite	
  -­‐	
  proper3es	
  
•  Modular,	
  scalable,	
  extensible,	
  robust	
  design	
  
      –  Extensible	
  to	
  many	
  problems	
  and	
  disciplines	
  
•  Integrated	
  features	
  
      –    Focused	
  crawler	
  -­‐	
  Heritrix	
  
      –    Indexer	
  -­‐	
  Solr/lucene	
  
      –    Metadata	
  extracEon	
  -­‐	
  modular	
  
      –    Ranked	
  results	
  
•  Builds	
  on	
  experience	
  with	
  other	
  domain	
  engines	
  and	
  OS	
  tools	
  
      –    	
  Lucene	
  and	
  Solr	
  
      –    	
  The	
  MySQL	
  Database	
  and	
  InnoDB	
  Storage	
  Engine	
  
      –    	
  Apache	
  Tomcat	
  
      –    	
  Spring	
  Framework	
  
      –    	
  Acegi	
  Security	
  
      –    	
  AcEveMQ	
  
      –    	
  AcEveBPEL	
  Open	
  Source	
  Engine	
  
      –    	
  Apache	
  Commons	
  Libraries	
  
      –    	
  SVMlight	
  support	
  vector	
  machine	
  package	
  
      –    	
  CRF++	
  condiEonal	
  random	
  field	
  package	
  
•  Hardware	
  independent;	
  Linux	
  
•  Reuse	
  not	
  reinvent	
  
Data Mining & Information Extraction in Seers
•  Data acquisition
    •  SeerSuite systems often crawls the public web for new data
    •  Many data types available
•  Richness of data offers unique data mining features
     •  CiteSeerX as testbed/sandbox
        •  Large scale data resources
              •  Millions of documents, authors, etc.
              •  Some common features/metadata
        •  Commercial grade indexer (Solr/Lucene)
              •  Scalable to G’s of documents and M’s of users
              •  “Watson”
        •  Modular design
        •  Cloudable
•  State of the art algorithms (machine learning) for large scale
unique metadata (information) extraction & mining
    •  Unique parsers and indexing
    •  Quality of extraction
    •  Precision/recall
    •  Ranking
    •  Architecture/integration
Seer	
  Friends	
  
•  In	
  various	
  stages	
  of	
  the	
  system	
  lifecycle	
  with	
  various	
  data	
  resources	
  
   and	
  indexes:	
  
     –  Mature	
  and	
  developing,	
  code	
  released	
  
           •    CiteSeer,	
  now	
  CiteSeerX	
  
           •    ChemXSeer	
  
           •    TableSeer	
  
           •    YouSeer	
  
     –  New,	
  future	
  TBD,	
  not	
  all	
  aspects	
  public	
  
           •    ArchSeer	
  
           •    AlgoSeer	
  
           •    CollabSeer	
  
           •    RefSeer	
  
           •    SeerSeer	
  
           •    GrantSeer	
  
     –  Dead	
  or	
  limping	
  by	
  (could	
  be	
  revived)	
  
           •  AckSeer	
  (acknowledgement	
  indexing)	
  (revived!)	
  
           •  BizSeer	
  
           •  BotSeer	
  
     –  Proposed,	
  but	
  do	
  not	
  exist	
  
           •  BrainSeer	
  
           •  CensorSeer	
  
           •  ArXivSeer	
  
Why	
  Solr/Lucene?	
  
•  Only	
  open	
  source	
  considered	
  –	
  cost	
  
•  CompeEtors:	
  
    –  Indri	
  
    –  Wumpus	
  
    –  Terrier	
  
    –  Others?	
  
•  Must	
  scale	
  for	
  both	
  number	
  of	
  documents	
  and	
  users	
  
•  Easily	
  integrable	
  and	
  customizable	
  
    –  Other	
  indexes,	
  crawlers,	
  ingesEon,	
  metadata	
  extractors	
  	
  
•  Well	
  used	
  (Watson)	
  
•  AcEve	
  community	
  of	
  support	
  
    –  Enterprise	
  plaporm	
  a	
  plus	
  
•  Easy	
  to	
  transiEon	
  to	
  government/industry/academia	
  
    –  Apache	
  license	
  
Next Generation CiteSeer, CiteSeerX

• 	
  	
  2	
  M	
  documents	
  
• 	
  	
  40	
  M	
  citaEons	
  
• 	
  2	
  to	
  5	
  M	
  authors	
  
• 	
  2	
  to	
  4	
  M	
  hits	
  day	
  
• 	
  800K	
  individual	
  users	
  
• 	
  en3re	
  data	
  shared	
  

• 	
  Index	
  -­‐	
  50	
  G	
  




                                             http://citeseerx.ist.psu.edu
History:	
  CiteSeer	
  (aka	
  ResearchIndex)
                                             	
  
       Project	
  at	
  NEC	
  Research	
  InsEtute,	
  Princeton	
  
             1st	
  academic	
  document	
  search	
  engine	
  
             Very	
  popular	
  with	
  computer	
  science	
  
                                                                                  C. Lee Giles
       Hosted	
  at	
  NEC	
  from	
  1997	
  –	
  2004.	
  
             Moved	
  to	
  Penn	
  State	
  as	
  collaborators	
  lek.	
  
       Provided	
  a	
  broad	
  range	
  of	
  unique	
  services	
  
        including	
  
             AutomaEc	
  citaEon	
  indexing,	
  reference	
  linking,	
  
              full	
  text	
  indexing,	
  similar	
  documents	
  lisEng,	
     Kurt Bollacker
              automated	
  metadata	
  extracEon	
  and	
  several	
  
              other	
  pioneering	
  features.	
  
       Refactored	
  and	
  redesigned	
  as	
  CiteSeerx	
  
             Released	
  2008	
  
             Lucene	
  based	
  indexing	
  

                 CiteSeer continuously running for 15 years!                     Steve Lawrence
SeerSuite/CiteSeerX Architecture
                         •  Web Application
                         •  Focused Crawler
                         •  Document Conversion and
                            Extraction
                         •  Document Ingestion
                         •  Data Storage
                         •  Maintenance Services
                         •  Federated Services

 Teregowda, USENIX ‘10
4 systems:

•  Production
•  Crawling
•  Staging
•  Research

All or some
can be
cloudized
                Teregowda, USENIX 2010
CiteSeerX	
  Services	
  
    CiteSeerX	
  is	
  a	
  very	
  automated	
  system:	
  
          Full	
  OAI	
  metadata	
  if	
  available	
  
          Full	
  text	
  Indexing	
  (many	
  different	
  indexes)	
  
                 Documents	
  
                 CitaEons	
  
                 Tables	
  
                 More	
  forthcoming	
  	
  (Algorithms,	
  Figures,	
  Acknowledgements).	
  
          CitaEon	
  Graph.	
  
                 Ranking	
  based	
  on	
  citaEons.	
  
                 Linking	
  documents	
  	
  
                   -    Co-­‐citaEons	
  
                   -    CiEng	
  documents	
  
          Author	
  DisambiguaEon	
  
                 DisEnguish	
  between	
  authors	
  with	
  similar	
  names.	
  
                 Profiles	
  and	
  publicaEon	
  informaEon	
  for	
  author.	
  
          AutomaEc	
  crawling	
  from	
  list	
  and	
  submissions	
  
          PersonalizaEon	
  
             -    Login	
  based	
  access	
  to	
  features	
  on	
  CiteSeerX.	
  
             -    CorrecEons	
  to	
  metadata.	
  
             -    Storage	
  of	
  queries.	
  
             -    CollecEon	
  of	
  papers	
  
             -    Follows	
  document	
  metadata	
  changes.	
  
Focused	
  Crawling	
  
•  Maintain	
  a	
  list	
  of	
  parent	
  URLs	
  where	
  documents	
  were	
  previously	
  found	
  
      –  Parent	
  URLs	
  are	
  usually	
  academic	
  homepages.	
  
             •  300,000	
  unique	
  parent	
  URLs,	
  as	
  of	
  summer	
  2011	
  
      –  Parent	
  URLs	
  are	
  stored	
  in	
  a	
  database	
  table	
  with	
  two	
  addiEonal	
  fields	
  for	
  
         scheduling:	
  
             •  Last	
  Eme	
  changed,	
  get	
  new	
  documents	
  from	
  the	
  page.	
  
             •  EsEmated	
  change	
  rate	
  according	
  to	
  previous	
  crawls	
  of	
  this	
  page.	
  
•    The	
  crawling	
  process	
  starts	
  with	
  the	
  scheduler	
  selecEng	
  1000	
  parent	
  URLs	
  
     which	
  have	
  the	
  highest	
  probability	
  of	
  having	
  new	
  documents	
  available.	
  	
  
      –  Assume	
  Poisson	
  process	
  for	
  the	
  change	
  behavior	
  of	
  a	
  parent	
  page.	
  	
  
             •  Suppose	
  a	
  parent	
  page	
  P’s	
  last	
  observed	
  change	
  occurred	
  at	
  Eme	
  t1,	
  and	
  its	
  esEmated	
  
                change	
  rate	
  is	
  R,	
  then	
  at	
  Eme	
  t2	
  (t2	
  =	
  t1	
  +	
  Δ),	
  the	
  probability	
  that	
  it	
  has	
  changed	
  again	
  
                since	
  t1	
  is	
  1	
  –	
  exp(-­‐R*Δ)	
  
             •  Larger	
  R	
  or	
  larger	
  Δ	
  will	
  give	
  larger	
  probability.	
  
             •  Aker	
  each	
  crawl,	
  the	
  change	
  rate	
  of	
  the	
  scheduled	
  parent	
  URL	
  should	
  be	
  recalculated.	
  
•  Crawling	
  run	
  incrementally	
  daily	
  (invoked	
  by	
  a	
  Linux	
  cron	
  job	
  at	
  12	
  am)	
  
      –  Most	
  discovered	
  documents	
  have	
  been	
  crawled	
  before.	
  	
  
             •  Use	
  hash	
  table	
  comparison	
  for	
  detecEon	
  of	
  new	
  documents	
  
             •  Normally	
  retrieve	
  a	
  few	
  thousand	
  NEW	
  documents	
  per	
  day,	
  someEmes	
  less	
  than	
  1k.	
  
•  Moved	
  to	
  whitelist	
  vs	
  blacklist	
  	
  
                                                                                                     Zheng, CIKM’09
documents	
  from	
  crawled	
  urls	
  
                                       90% all
                                       citations from
                                       the first 550
                                       sites


                                       90% all
                                       documents
                                       from the first
                                       1250 sites
How	
  will	
  we	
  get	
  metadata	
  for	
  fields?	
  




     Now... that should clear up a few things around here
Metadata	
  ExtracEon	
  
•         Documents	
  are	
  converted	
  from	
  PDF/PS	
  to	
  text	
  using	
  
          converters.	
  
            –    Converters	
  include	
  TET,	
  pd{ox,	
  pdkotext,gs.	
  
•         Documents	
  are	
  filtered	
  checking,	
  for	
  existence	
  of	
  
          references	
  and	
  duplicaEon	
  (checksum).	
  
•         Use	
  tools	
  or	
  build	
  your	
  own	
  
     –       Metadata	
  extracEon	
  system	
  uses	
  machine	
  learning	
  
             methods	
  like	
  SVM	
  (Header	
  Parser),	
  CRF	
  (ParsCit)	
  to	
  
             extract	
  various	
  enEEes	
  from	
  the	
  document.	
  
•         Rule	
  based	
  templates	
  are	
  applied	
  before	
  extracEon.	
  
AutomaEcally	
  Created	
  DB	
  of	
  paper	
  in	
  CSX	
  
  10.1.1.130.782        Tensor Decompositions and Applications

                                                    This ..      2009          pages       455-500
   id                      title
                                                    abstract     year        publisher       SIAM

“Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500
Abstract: This ….
Cited 34 times, 6 times by Author                          venue                         Assigned
                                           SIAM REVIEW
                                                                                            By
                                                                        venueType
   version    cluster                                                                     System
                                                     JOURNAL
                                                                                         Extractor/
        2     9248987                                                                      User/
         10        12/30/2008                                      True                  Inference
                                         n-cites        34
                                                                                         Inference/
                                        selfCites       6         public                    User
   repositoryID     crawldate
3	
  Tier	
  Architecture	
  
                                                         Queries
                                                                           Index
                                              Web 1

                                                                     Index - Tables
User Request   Load Balancer
                                    Web Application
               Load Balancer                                         Repository


                                              Web 2
                                                                         Database
                                                         Requests
                                                                          Storage

                               Crawler                       Ingestion

                                            Extraction
CiteSeer X	
  Sokware	
  Overview                                                                      	
  
•    IngesEon	
  Process:	
  Responsible	
  for	
  obtaining	
  and	
  preparing	
  a	
  document	
  and	
  the	
  
     related	
  metadata.	
  
        –       Process	
  the	
  document	
  
                      •       Submi|ed	
  by	
  the	
  user	
  or	
  Crawler	
  
        –       Extract	
  Metadata	
  
                      •       Header	
  
                      •       CitaEons	
  
                      •       Acknowledgements	
  
        –       Store	
  the	
  metadata	
  and	
  documents.	
  
•    CitaEon	
  Matching	
  
        –       Iden>fying	
  the	
  underlying	
  graph	
  structure	
  –	
  documents	
  ci>ng	
  this	
  document	
  and	
  
                the	
  rela>onship	
  between	
  documents	
  and	
  cita>ons	
  
                       •    Inference	
  matching	
  and	
  graph	
  generaEon	
  
        –       User	
  CorrecEons	
  (Version	
  Maintenance)	
  
        –       Determine	
  and	
  accept	
  valid	
  user	
  correc>ons	
  
        –       Regular	
  NoEficaEon	
  Mechanisms	
  
        –       Ensure	
  that	
  the	
  user	
  is	
  no>fied	
  when	
  new	
  documents	
  are	
  added	
  to	
  the	
  collec>on	
  
                    •       Linked	
  to	
  MyCiteSeer.	
  
•    Update	
  and	
  Maintenance	
  
        –       Update	
  and	
  make	
  valid	
  the	
  full	
  text	
  index	
  and	
  various	
  sta>s>cs.	
  
        –       StaEsEcs	
  
        –       Index	
  updates	
  
CiteSeerX	
  Search	
  
                  Enabling	
  Search	
  
                        Fulltext	
  
                        Fields	
  created	
  
                          -    Title	
  
                          -    Authors	
  
                          -    CitaEons	
  
                          -    Venue	
  
                          -    Keywords	
  
                          -    Abstract	
  
                          -    Range	
  (PublicaEon)	
  
                          -    CitaEons	
  
Field	
  Schema	
  
     Field                    Type                   Indexed/Stored
     DOI                      String                 Y/Y - Unique
     Citation/Document        String                 Y/Y
     Title                    Text                   Y/Y
      Author                   A Text                 Y/Y
      Authors Normalized       A Text                 Y/N
      ncites (# cited by)      Integer                Y/Y
      URL                      String                 Y/Y
      cites                    Tokens                 Y/N
      citedby                  Tokens                 Y/N
      Timestamp                Date                   Y/Y



* - A Text is a Text field which does not have a stopword filter or stemming
^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer
CiteSeerX	
  Search	
  Results
                             	
  
                          Results	
  SorEng	
  
                                Relevance	
  (default)	
  
                                   -    Based	
  on	
  dismax	
  query	
  
                                        handling	
  with	
  boosEng.	
  
                 Sorting
                                CitaEons	
  
                                   -    CitaEons	
  received	
  by	
  the	
  
                                        document	
  in	
  collecEon	
  plus	
  
                                        default	
  

                                Year	
  
                                   -    PublicaEon	
  date.	
  

                                Recency	
  
                                   -    Date	
  of	
  acquisiEon.	
  
CiteSeerX	
  CitaEon	
  Graph	
  
                                  RelaEonships	
  
               B

                   Cited by            CitaEon	
  graph 	
  	
  
    E
                                         -    Store	
  Cited	
  by	
  and	
  
                    A
           Cites                              Cites	
  in	
  index	
  
                                       Build	
  
D
                                         -    Build	
  document	
  
               C
                                              graph	
  by	
  querying	
  
                                              index	
  for	
  
                                              relaEonship.	
  
Adding	
  documents
                                           	
  
    Ingest	
  documents	
  for	
  new	
  crawls	
  
      -  Add	
  metadata	
  to	
  collecEon	
  

      -  Add	
  full	
  text	
  to	
  system	
  

      -  Link	
  metadata	
  in	
  collecEon	
  

    Run	
  maintenance	
  scripts	
  
      -  Poll	
  updates	
  and	
  post	
  to	
  Solr.	
  
              Fulltext	
  
              Metadata	
  
              RelaEonships	
  

    Challenge:	
  Maintain	
  data	
  freshness.	
  
Query	
  Response	
  
Web
                                •    Query	
  forwarded	
  to	
  Solr	
  
                                     from	
  the	
  presentaEon	
  
              Web Interface          layer	
  (JSP)	
  
                                •    Solr	
  generates	
  ranked	
  
                                     response	
  in	
  JSON	
  
                                •    Build	
  each	
  record	
  in	
  xml	
  
                                     with	
  the	
  database	
  (Add	
  
                     Database
                                     fields:	
  Abstract)	
  
                                •    PresentaEon	
  layer	
  (JSP)	
  
      Index                          formats	
  records	
  based	
  
                                     on	
  ranking.	
  
Ranking	
  with	
  BoosEng	
  (Relevance)
                                        	
  
    Use	
  of	
  Boost	
  FuncEon,	
  Minimum	
  Match,	
  
     Query	
  Fields	
  
         Boost	
  FuncEon	
  –	
  	
  the	
  effect	
  of	
  citaEons	
  
           -    Map	
  number	
  of	
  citaEons	
  >	
  1	
  to	
  500	
  
         Minimum	
  Match	
  –	
  2	
  	
  
         Query	
  Fields	
  
           -    Text	
  (1)	
  
           -    Title	
  (4)	
  
           -    Abstract	
  (2)	
  
Query	
  Response	
  
       Web Interaface
                                         Query	
  at	
  Interface	
  (JSP)	
  
       Q
                                         Hand	
  over	
  to	
  Web	
  
Text                    R HashMap
                                          applicaEon	
  (Java/Spring)	
  
       Web Application
                                         Hand	
  over	
  to	
  Solr	
  
                          F              Ranked	
  response	
  from	
  Solr	
  
Text               JSON
        Q      R           HashMap
                                          (JSON)	
  
                          DB             Response	
  unwrapped	
  and	
  
                                          more	
  details	
  included	
  with	
  
           Index                          informaEon	
  from	
  DB	
  
                                         Present	
  response	
  at	
  
                                          Interface	
  (JSP)	
  
Name	
  DisambiguaEon	
  
•  Name	
  disambiguaEon	
  (NER)	
  
      –  A	
  person	
  can	
  be	
  referred	
  to	
  in	
  different	
  ways	
  with	
  different	
  a|ributes	
  in	
  
         mulEple	
  records;	
  the	
  goal	
  of	
  name	
  disambiguaEon	
  is	
  to	
  resolve	
  such	
  
         ambiguiEes,	
  linking	
  and	
  merging	
  all	
  the	
  records	
  of	
  the	
  same	
  enEty	
  together	
  
•  Three	
  types	
  of	
  name	
  ambiguiEes:	
  
      –  Aliases	
  -­‐	
  one	
  person	
  with	
  mulEple	
  aliases,	
  name	
  variaEons,	
  or	
  name	
  
         changed	
  	
  
              e.g.	
  CL	
  Giles	
  &	
  Lee	
  Giles,	
  Superman	
  &	
  Clark	
  Kent	
  

      –  Common	
  Names	
  -­‐	
  more	
  than	
  one	
  person	
  shares	
  a	
  common	
  name,	
  	
  
              e.g.	
  Jian	
  Huang	
  –	
  103	
  papers	
  in	
  DBLP	
  

      –  Typography	
  Errors	
  -­‐	
  resulEng	
  from	
  human	
  input	
  or	
  automaEc	
  extracEon	
  
•  Goal:	
  disambiguate,	
  cluster	
  and	
  link	
  names	
  in	
  a	
  large	
  digital	
  
   library	
  or	
  bibliographic	
  resource	
  such	
  as	
  Medline,	
  CiteSeerX,	
  etc.	
  
Efficient	
  Large	
  Scale	
  En3ty	
  Disambigua3on	
  
                      Testbed:	
  CiteSeerX	
  and	
  PubMedSeer	
  et.al PKDD 2006
                                                                  Huang,
                                                                                                                                             Treeratpituk, et.al JCDL 2009
•    EnEty	
  disambiguaEon	
  problem	
                                                                                                   Online SVM
      –  Determine	
  the	
  real	
  idenEty	
  of	
  the	
                                                                            with Active Learning
         authors	
  using	
  metadata	
  of	
  the	
                                                   Annotator
                                                                                                                                        Distance Learner
         research	
  papers,	
  including	
  co-­‐                                  Metadata
         authors,	
  affiliaEon,	
  physical	
  




                                                                                                                                                                              Actors, entities
                                                                                    Extraction                                    Soft-
         address,	
  email	
  address,	
  	
                                         Module
                                                                                                        Jaccard
                                                                                                                                 TFIDF




                                                                     documents
                                                                                                        Similarity
         informaEon	
  from	
  crawling	
  such	
                                                                               Similarity            SVM
                                                                                                                                                     Distance    DBSCAN
         as	
  host	
  server,	
  etc.	
                                                                                                             Function    Clustering
      –  EnEty	
  normalizaEon	
                                                                                  Similarity                                      Module
                                                                                                                  Function
•    MoEvaEon	
  
      –  Enhance	
  search	
  funcEonaliEes	
                                        Blocking
         for	
  digital	
  repositories	
                                            Module                                            Candidate
                                                                                                                                         Class
               •     Fielded	
  search	
  by	
  author	
  name	
                                                            Author 1
                                                                                                                            Paper 3
                                                                                                                                                     Author 2
                                                                                                                                                     Paper 4


      –  Improve	
  metadata	
  quality	
  
      –  Improved	
  social	
  network	
  analysis	
  
      –  Government	
  and	
  business	
                                •        Key	
  features	
  
         intelligence	
  
               •     E.g.	
  census	
  data	
  and	
  credit	
                     –     LASVM	
  distance	
  funcEon	
  
                     records	
                                                              •    AcEve	
  learning	
  
•    Challenges	
                                                                                       – 
                                                                                                        – 
                                                                                                               Simpler	
  and	
  more	
  accurate	
  model	
  
                                                                                                               Be|er	
  generalizaEon	
  power	
  
      –  Accuracy	
                                                                         •    Online	
  learning	
  
      –  Scalability	
                                                                                  –      Expandable	
  to	
  new	
  training	
  data	
  
      –  Expandability	
                                                           –     DBSCAN	
  clustering	
  
                                                                                            •    Ameliorate	
  labeling	
  inconsistency	
  (transiEvity	
  problem)	
  
                                                                                            •    Efficient	
  soluEon	
  to	
  find	
  name	
  clusters	
  
                                                                                            •    N	
  logN	
  scaling	
  
Author	
  DisambiguaEon	
  Field	
  
•  Currently	
  uses	
  author	
  fields	
  
    –  For	
  author	
  search	
  (both	
  for	
  author	
  menEons	
  and	
  for	
  
       disambiguated	
  authors)	
  

•  Future	
  direcEon	
  	
  
    –  Use	
  Lucene	
  index	
  for	
  blocking	
  in	
  author	
  disambiguaEon	
  –	
  
       creaEng	
  candidate	
  set	
  of	
  author	
  menEons	
  that	
  could	
  
       belong	
  to	
  the	
  same	
  cluster	
  
Author	
  DisambiguaEon	
  
•    Random	
  Forest	
  (RF)	
  	
  
       –     Use	
  random	
  feature	
  selecEon+bootstrap	
  sampling	
  to	
  construct	
  mulEple	
  decision	
  trees	
  from	
  one	
  training	
  data	
  
       –     Aggregate	
  votes	
  of	
  a	
  collecEon	
  of	
  decision	
  tree	
  as	
  final	
  decision	
  
       –     The	
  more	
  independent	
  each	
  tree	
  is,	
  the	
  be|er	
  the	
  improvement	
  over	
  a	
  single	
  decision	
  tree	
  
•    Author	
  disambiguaEon	
  with	
  Random	
  Forest	
  
       –     Various	
  meta	
  data	
  is	
  used	
  as	
  features	
  in	
  Random	
  Forest	
  to	
  determine	
  whether	
  two	
  author	
  name	
  from	
  two	
  papers	
  
             refer	
  to	
  the	
  same	
  person	
  
                •  E.g.	
  Author	
  names,	
  affiliaEon,	
  coauthors,	
  keywords,	
  journal	
  informaEon,	
  year	
  of	
  publicaEons,	
  etc	
  
       –     MulEple	
  distance	
  funcEons	
  are	
  used	
  for	
  each	
  type	
  of	
  meta	
  data	
  
                •  E.g.	
  TFIDF,	
  Jaccard	
  distance,	
  for	
  comparing	
  affiliaEons	
  
•    Compared	
  with	
  previous	
  SVM-­‐based	
  approach	
  
       –     Shown	
  to	
  provide	
  higher	
  accuracy	
  than	
  SVM	
  in	
  pair-­‐wise 	
  author	
  disambiguaEon	
  task	
  
       –     Easy	
  parameterizaEon	
  in	
  the	
  training	
  phrase	
  (only	
  number	
  of	
  trees	
  and	
  randomness	
  at	
  each	
  node,	
  no	
  decision	
  on	
  
             kernel	
  funcEon	
  needed),	
  and	
  performance	
  is	
  not	
  sensiEve	
  to	
  parameters	
  chosen	
  
       –     Provide	
  measurement	
  for	
  importance	
  of	
  each	
  individual	
  features	
  (how	
   informaEve	
  each	
  feature	
  is,	
  and	
  how	
  
             sensiEve	
  the	
  decision	
  is	
  to	
  noise	
  in	
  a	
  parEcular	
  feature),	
  which	
  is	
  not	
  trivial	
  for	
  SVM	
  with	
  non-­‐linear	
  kernel	
  
       –     Training	
  Eme	
  &	
  classificaEon	
  Eme	
  is	
  linear	
  to	
  the	
  number	
  of	
  tree	
  and	
  data	
  size	
  
•    Also	
  provide	
  higher	
  disambiguaEon	
  accuracy	
  when	
  compared	
  with	
  other	
  tradiEonal	
  method	
  (LogisEc	
  
     Regression,	
  Naïve	
  Bayes,	
  Decision	
  Tree)	
  


                                 Treeratpituk, Giles, JCDL09
Data and Publications in the Field of Chemistry
Chemistry
   • not physics - no arXiv – or computer science - no CiteSeer
          • Legacy of early information access - Chem Abstracts
     • Cheminformatics is not bioinformatics

Chemistry has been up to recently a data poor field
     Data sharing tradition just being established
     Data creation is exploding - local (small science)

Journals and societies sensitive to their IP issues dominate the field
     Unsubstantiated IP claims such as data in the paper belongs to the publisher
     Discourage online versions of publications - ACS

Large powerful international companies have a vested interest in research
     Chemical information extraction tools are easily monetized
     Standards exist - CML, InCHI

“Fixing the past so we can fix the future.” Jeremy Frey
 Chemistry is an old discipline with publications going back 100 years

Chemistry is compound centric, not algorithmic centric
     Search is about the compound!
     Compounds have a rich data environ
          3D graph structure, energies, etc.
ChemXSeer Architecture
Integrate and implement well-used open source tools
     Use CiteSeerX tools when possible
     Integrate into SeerSuite
     Search
          Chemical formulae unique search
           Table search
           Figure search
           More data (grey literature) than documents

•  Automated information extraction modules based on machine learning methods
•  Lucene/Solr indices for extracted fields,
•  Relational databases for datasets,

Work closely with chemists to understand their needs
    Tools for data conversion

Provide a public portal and repository for easy use
     User access controls

Integrated visualization tools like JMOL for Gaussian data residing into
our repository

API’s for users for extracted data

Data and documents standards de facto: xml, pdf, etc.
chemxseer.ist.psu.edu
ChemXSeer Formula Search


• Extraction and search of chemical formulae in scientific
documents has been shown to be very useful.

• Intersection of two research areas:
     • Information retrieval
     • Chemoinformatics

•  Formulae cannot be treated as text.
    • Domain knowledge (formula identification)
    • Structural knowledge (substructure finding and search)

                                   B. Sun, WWW’07, WWW’08, TOIS’11
                                   D. Yuan, ICDE’12
Challenges in Formula Search

How to identify a formula in scientific documents?

Non-Formula
“… This work was funded under NIH grants …”
“ … YSI 5301, Yellow Springs, OH, USA …”
“… action and disease. He has published over …”

Formula
“… such as hydroxyl radical OH, superoxide O2- …”
“ and the other He emissions scarcely changed …”

Machine learning algorithms (SVM + CRF) yield high
accuracies for correct formula identification.
SegmenEng	
  chemical	
  names	
  
•  Goal:	
  to	
  discover	
  semanEcally	
  meaningful	
  sub-­‐terms	
  in	
  
   chemical	
  names	
  
     –  Methylethyl	
  alcohol	
  
     –  methionylglutaminylarginyltyrosylglutamylserylleucyl	
  
        phenylalanylalanylglutaminylleucyllysylglutamylarginyl	
  
        lysylglutamylglycylalanylphenylalanylvalylprolylphenyl	
  
        alanylvalylthreonylleucylglycylaspartylprolylglycylisol	
  
        eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl	
  
        threonylleucylisoleucylglutamylalanylglycylalanylaspartyl	
  
        alanylleucylglutamylleucylglycylisoleucylprolylphenyl	
  
        alanylserylaspartylprolylleucylalanylaspartylglycylprolyl	
  
        threonylisoleucylglutaminylasparaginylalanylthreonylleucyl	
  
        arginylalanylphenylalanylalanylalanylglycylvalylthreonyl	
  
        prolylalanylglutaminylcysteinylphenylalanylglutamyl	
  
        methionylleucylalanylleucylisoleucylarginylglutaminyllysyl	
  
        hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl	
  
        leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl	
  
        alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl	
  
        alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl	
  
        glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl	
  
        prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl	
  
        arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl	
  
        valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl	
  
        prolylprolylaspartylalanylaspartylaspartylaspartylleucyl	
  
        leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl	
  
        arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl	
  
Chemical	
  Search	
  Aspects	
  
•    Parsing	
  
•    ExtracEon	
  and	
  tagging	
  
•    Indexing	
  
•    Ranking   	
  
Chemical	
  EnEty	
  ExtracEon	
  and	
  Tagging	
  
•    Name	
  tagging	
  
      –  Each	
  chemical	
  name	
  can	
  be	
  a	
  phrase	
  
      –  Example	
  
              •  "...	
  Determina>on	
  of	
  lac4c	
  acid	
  and	
  ...“	
  
              •  "...	
  insec>cide	
  promecarb	
  (3-­‐isopropyl-­‐5-­‐methylphenyl	
  methylcarbamate)	
  acts	
  
                 against	
  ..."	
  
•    Formula	
  tagging	
  
      –  Each	
  formula	
  is	
  a	
  single	
  term	
  
      –  Example	
  
              •  "...	
  such	
  as	
  hydroxyl	
  radical	
  OH,	
  superoxide	
  ..."	
  
      –  Non-­‐formula	
  example	
  
              •  "...	
  YSI	
  5301,	
  Yellow	
  Springs,	
  OH,	
  USA	
  ...	
  ”	
  
•    Tagging	
  examples	
  
      –  Name	
  tagging:	
  
              "...	
  	
  of	
  <name-­‐type>lac>c	
  acid</name-­‐type>	
  and	
  ...“	
  

      –  Formula	
  tagging:	
  
              "...	
  radical	
  <formula-­‐type>OH</formula-­‐type>	
  ,	
  superoxide	
  ..."	
  
Textual	
  Chemical	
  Molecule	
  InformaEon	
  
                      Indexing	
  and	
  Search	
  
   •  Index	
  Schemes:	
  	
  
         –  Which	
  tokens	
  to	
  index?	
  
         –  Indexing	
  all	
  subsequences	
  generates	
  a	
  large	
  size	
  index	
  

•  SegmentaEon-­‐based	
  index	
  scheme	
  
     –  Used	
  for	
  indexing	
  chemical	
  names	
  
                                                                                           methylethyl
     –  First	
  segment	
  a	
  chemical	
  name	
  hierarchically	
  
        and	
  then	
  index	
  substrings	
  at	
  each	
  node	
                  methyl        ethyl

                                                                               meth         yl   eth      yl

                                                                              me      th


 •  Frequency-­‐and-­‐discriminaEon-­‐based	
  index	
  scheme	
  
       –  Used	
  for	
  indexing	
  chemical	
  formulas	
  
       –  SequenEally	
  select	
  frequent	
  and	
  discriminaEve	
  subsequences	
  of	
  a	
  
          formula	
  from	
  the	
  shortest	
  to	
  the	
  longest	
  
Features	
  for	
  Formula	
  Indexing	
  
•  Formula	
  
    –  A	
  sequence	
  of	
  chemical	
  element	
  or	
  par3al	
  formula	
  
       with	
  corresponding	
  frequencies	
  
    –  E.g.	
  CH3(CH2)2OH	
  
•  ParEal	
  formula	
  
    –  ParEal	
  formula:	
  a	
  subsequence	
  of	
  a	
  formula	
  
    –  E.g.	
  C,	
  H,	
  O,	
  CH3,	
  CH2,	
  OH,	
  CH3(CH)2,	
  H3(CH)2,	
  CH3
       (CH)2O,	
  etc.	
  
•  Index	
  construcEon	
  
    –  ParEal	
  formulas	
  with	
  frequencies:	
  e.g.	
  <C,3>,<H,
       6>,<CH2,2>,	
  etc.	
  
    –  Too	
  many	
  parEal	
  formulas,	
  need	
  feature	
  selec3on	
  
Criteria	
  of	
  Feature	
  SelecEon	
  

•  Criteria	
  of	
  feature	
  selecEon	
  
    –  Frequent	
  features	
  (Freqs≥Freqmin)	
  

    –  DiscriminaEve	
  features	
  (αs	
  ≥αmin)	
  
         •  If	
  a	
  sequence’s	
  selected	
  subsequences	
  are	
  enough	
  to	
  
            disEnguish	
  formulas	
  containing	
  them	
  from	
  other	
  
            formulas,	
  this	
  sequence	
  is	
  redundant.	
  
         •  DiscriminaEon	
  score	
  
               α s =| I s '∈F ∧ s 'p s Ds ' | / | Ds |

           	
  where	
  F	
  is	
  the	
  selected	
  feature	
  set,	
  and	
  Ds	
  is	
  the	
  set	
  of	
  
               formulas	
  containing	
  s.	
  
An	
  Example	
  for	
  Formula	
  Indexing	
  

•  Data	
  set:	
  	
  
       –  1.CH3COOH,	
  2.CH3(CH2)2OH,	
  3.CH3(CH2)3COOH	
  
•  Parameter:	
  	
  
       –  Freqmin=2,	
  αmin=1.1	
  
•  Steps:	
  
      –  Length=1,	
  Candidates={C,H,O},	
  F={C,H,O}	
  
      –  Length=2,	
  Candidates={CH3,H3C,CO,OO,OH,CH2},	
  Frequent	
  
          Candidates={CH3,CO,OO,OH,CH2}	
  
  α CH 3 =| {1,2,3}C I {1,2,3}H | / | {1,2,3}CH 3 |= 1
  α CO =| {1,2,3}C I {1,2,3}O | / | {1,3}CO |= 1.5
          	
  Frequent	
  &	
  DiscriminaEve	
  Candidates={CO,OO,CH2}	
  
          	
  F={C,H,O,CO,OO,CH2}	
  
       –  Length=3,	
  …	
  
Formula	
  Search	
  
 •  SF.IEF:	
  Subsequence	
  Frequency	
  &	
  Inverse	
  EnEty	
  Frequency	
  
                         Freq(s,e)                      |C |
           SF(s,e) =               ,IEF(s) = log
                            |e |                   |{e | s p e} |
 •  Exact	
  formula	
  search	
  
      –  Search	
  for	
  exact	
  representaEons.	
  E.g.	
  =C1-­‐2H4-­‐6	
  matches	
  CH4	
  and	
  
         C2H6,	
  not	
  H4C	
  or	
  H6C2.	
  
€
 •  Frequency	
  formula	
  search	
  
      –  Full	
  frequency	
  search:	
  search	
  for	
  formulas	
  with	
  specified	
  chemical	
  
         elements	
  and	
  frequency	
  ranges,	
  ignoring	
  the	
  order,	
  no	
  unspecified	
  
         elements.	
  E.g.	
  C1-­‐2H4-­‐6	
  matches	
  CH4,	
  C2H6,	
  H6C2,	
  CH3CH3,	
  not	
  
         CH4O,	
  C2H6O2.	
  
      –  ParEal	
  frequency	
  search:	
  similar	
  but	
  allow	
  unspecified	
  elements.	
  E.g.	
  
         *C1-­‐2H4-­‐6	
  matches	
  CH4,	
  C2H6,	
  H6C2,	
  CH3CH3,	
  and	
  CH4O	
  and	
  
         C2H6O2	
  as	
  well.	
  
      –  Ranking	
  funcEon	
  
                                      score(q, e) = ∑ SF ( s, e) IFF ( s ) 2 /( | f | ×   ∑ IFF (s)   2
                                                                                                          )
                                                     s∈q                                  s∈q
Formula	
  Search	
  substructure	
  
•  Substructure	
  formula	
  search	
  
       –  Search	
  for	
  formulas	
  that	
  may	
  have	
  a	
  substructure.	
  E.g.	
  -­‐COOH	
  
              matches	
  CH3COOH	
  (exact	
  match:	
  high	
  score),	
  HOOCCH3	
  (reverse	
  
              match:	
  medium	
  score),	
  and	
  CH3CHO2	
  (parsed	
  match:	
  low	
  score).	
  
       –  Ranking	
  funcEon	
                              score(s,e) = W match(s, f )SF(s,e)IFF(s) / | e |
          	
  where	
  Wmatch(q,f)	
  	
  is	
  the	
  weight	
  for	
  exact	
  match,	
  reverse	
  match,	
  and	
  
              parsed	
  match	
  
•  Similarity	
  formula	
  search	
  
    –  Search	
  for	
  formulas	
  with	
  a	
  similar	
  structure	
  of	
  the	
  query	
  formula.	
  
                              €
          Feature-­‐based	
  approach	
  using	
  parEal	
  formula	
  matching.	
  E.g.	
  
          ~CH3COOH	
  matches	
  CH3COOH,	
  (CH3COO)2Co,	
  CH3COO-­‐,	
  etc.	
  
       –  Ranking	
  funcEon	
  
                                              score(q,e) = ∑W match(q,e )W (s)SF(s,q)SF(s,e)IFF(s) / | e |
                                                               sp q
•  ConjuncEve	
  search	
  of	
  the	
  basic	
  types	
  of	
  formula	
  searches	
  
       –  E.g.	
  [*C2H4-­‐6	
  -­‐COOH]	
  matches	
  CH3COOH,	
  not	
  C2H4O	
  or	
  
          CH3CH2COOH.	
  
                                €
•  Document	
  query	
  rewriEng	
  
       –  E.g.	
  document	
  query	
  atom	
  formula:=CH4	
  is	
  rewri|en	
  to	
  atom	
  (CH4	
  
          OR	
  CD4),	
  if	
  formula	
  search	
  of	
  =CH4	
  matches	
  CH4	
  and	
  CD4.	
  
Formula	
  Search	
  -­‐Query	
  Models	
  
Many	
  models	
  are	
  possible	
  from	
  exact	
  to	
  semanEc	
  
       Models	
  discriminated	
  by	
  matching	
  algorithms	
  

•    Exact	
  search	
  
       –  Search	
  for	
  exact	
  representaEons	
  
       –  E.g.	
  =C1-­‐2H4-­‐6	
  matches	
  CH4	
  and	
  C2H6,	
  not	
  H4C	
  or	
  H6C2	
  
•    Frequency	
  searches	
  
       –  Full	
  frequency	
  search:	
  search	
  for	
  formulae	
  with	
  specified	
  chemical	
  elements	
  and	
  
          frequency	
  ranges,	
  ignoring	
  the	
  order,	
  no	
  unspecified	
  elements	
  
       –  E.g.	
  C1-­‐2H4-­‐6	
  matches	
  CH4,	
  C2H6,	
  H6C2,	
  CH3CH3,	
  not	
  CH4O,	
  C2H6O2	
  
       –  ParEal	
  frequency	
  search:	
  similar	
  but	
  allow	
  unspecified	
  elements	
  
       –  E.g.	
  *C1-­‐2H4-­‐6	
  matches	
  CH4,	
  C2H6,	
  H6C2,	
  CH3CH3,	
  and	
  CH4O	
  and	
  C2H6O2	
  as	
  well	
  
•    Substructure	
  search	
  
       –  Search	
  for	
  formulae	
  that	
  may	
  have	
  a	
  substructure	
  
       –  E.g.	
  -­‐COOH	
  matches	
  CH3COOH	
  (exact	
  match:	
  high	
  score),	
  HOOCCH3	
  (reverse	
  match:	
  
          medium	
  score),	
  and	
  CH3CHO2	
  (parsed	
  match:	
  low	
  score).	
  
•    Similarity	
  search	
  
       –  Search	
  for	
  formulae	
  with	
  a	
  similar	
  structure	
  of	
  the	
  query	
  formula.	
  Feature-­‐based	
  
          approach	
  using	
  parEal	
  formulae	
  matching.	
  
       –  E.g.	
  ~CH3COOH	
  matches	
  CH3COOH,	
  (CH3COO)2Co,	
  CH3COO-­‐,	
  etc.	
  
Ranking	
  formulae	
  
•    Ranking	
  formulae	
  has	
  to	
  depend	
  on	
  need	
  and	
  importance	
  
•    Focus	
  on	
  structural	
  methods	
  and	
  frequency	
  
•    Importance	
  can	
  be	
  introduced	
  by	
  citaEon	
  rank	
  or	
  pagerank	
  or	
  others	
  
•    SF.IFF	
  
      –  Substructure	
  frequency	
  and	
  inverse	
  formula	
  frequency	
  
•    Frequency	
  searches	
  
      –  	
  score(q, f ) = SF (e, f ) IFF (e) 2 /( | f | ×
             	
                                                                 IFF (e) 2 )
                                  ∑e∈q
                                                                                    ∑
                                                                                    e∈q
       –  where	
  |f|	
  is	
  the	
  total	
  frequency	
  of	
  elements	
  

•    Substructure	
  search	
  
         	
  	
  
      –  score(q, f ) = W                                SF (q, f ) IFF (q) / | f |
                                       match ( q , f )


       –  	
  where	
  Wmatch(q,f)	
  	
  is	
  the	
  weight	
  for	
  exact	
  match,	
  reverse	
  match,	
  and	
  
           parsed	
  match	
  

•    Similarity	
  search	
  
      –  	
  	
  	
  score(q, f ) =
                                      ∑W
                                      s pq
                                                            W ( s ) SF ( s, q ) SF ( s, f ) IFF ( s ) / | f |
                                              match ( q , f )
Chemical	
  compounds	
  as	
  graphs
                                       	
  
•  Chemical	
  compound	
  modeled	
  as	
  a	
  semanEc	
  
   graph	
  with	
  properEes	
  




 Atom: vertex/node in the graph
 Bond: edge in the graph
 Dimensions: 3 or 4
                                         Above figures are copied from
                                         eMolecules.com
What’s	
  Chemical	
  Structure	
  Search	
  
•  Substructure	
  Search	
  
   –  Given	
  an	
  input	
  chemical	
  structure	
  sketch,	
  find	
  all	
  
      the	
  chemical	
  compounds	
  containing	
  the	
  input	
  as	
  a	
  
      substructure.	
  	
  
•  Super	
  structure	
  Search	
  
   –  Given	
  an	
  input	
  chemical	
  structure	
  sketch,	
  find	
  all	
  
      the	
  important	
  descriptors	
  (substructures/	
  
      funcEonal	
  group)	
  contained	
  in	
  the	
  input.	
  	
  
•  Similarity	
  Search	
  
   –  Given	
  an	
  input	
  chemical	
  structure	
  sketch,	
  find	
  all	
  
      the	
  chemical	
  compounds	
  “similar”	
  to	
  the	
  input.	
  	
  
Table Search

Tables are widely used to present experimental results or statistical
data in scientific documents; some data only exists in these tables.

Current search engines treat tabular data as regular text
    •  Structural information and semantics not preserved.

Goal: automatically identify tables, extract table metadata from pdf
documents into xml and rank data


Table Metadata Representation:
•  Environment metadata: (document specifics: type, title,…)
•  Frame metadata: (border left, right, top, bottom, …)
•  Affiliated metadata: (Caption, footnote, …)
•  Layout metadata: (number of rows, columns, headers,…)
•  Cell content metadata: (values in cells)
•  Type metadata: (numeric, symbolic, hybrid, …)
                                 Y. Liu AAA’07, JCDL’07.
Tables	
  
•  A history that pre-dates that of sentential text
    –  Cuneiform clay tablets
•  Not received the same level of formal characterization
   enjoyed by sentential text
•  Varying and irregular formats
•  Different intuitive understanding of what a “table” is.
    –    Is the Periodic Table of the Elements a table?
    –    Tables vs. Lists?
    –    Tables vs. Forms?
    –    Tables vs. Figures?
    –    Genuine table vs. non-genuine table? [12]
•  Our definition: scientific genuine table
    –  Caption + tabular structure
    –  Ruling lines are not required
TableSeer	
  
Beta design of a table search engine
TableSeer	
  
  System	
  	
  
Architecture	
  
Page	
  Box-­‐Cu‡ng	
  Algorithm	
  
•  Improves	
  the	
  table	
  detecEon	
  performance	
  by	
  
   excluding	
  more	
  than	
  93.6%	
  document	
  content	
  
   in	
  the	
  beginning	
  
Sample	
  Table	
  Metadata	
  Extracted	
  File	
  




•    <Table>	
  
•    <DocumentOrigin>Analyst</DocumentOrigin>	
  
•    <DocumentName>b006011i.pdf</DocumentName>	
  
•    <Year>2001</Year>	
  
•    <DocumentTitle>Detec3on	
  of	
  chlorinated	
  methanes	
  by	
  3n	
  oxide	
  gas	
  sensors	
  </DocumentTitle>	
  
•    <Author>Sang	
  Hyun	
  Park,	
  a	
  ?	
  Young-­‐Chan	
  Son,	
  a	
  Brenda	
  R	
  .	
  Shaw,	
  a	
  Kenneth	
  E.	
  Creasy,*	
  b	
  and	
  Steven	
  L.	
  Suib*	
  acd	
  a	
  Department	
  of	
  Chemistry,	
  U-­‐60,	
  University	
  of	
  Connec3cut,	
  
     Storrs,	
  C	
  T	
  06269-­‐3060</Author>	
  
•    <TheNumOfCiters></TheNumOfCiters>	
  
•    <Citers></Citers>	
  
•    <TableCap3on>Table	
  1	
  Temperature	
  effect	
  o	
  n	
  r	
  esistance	
  change	
  (	
  D	
  R	
  )	
  and	
  response	
  3meof	
  3n	
  oxide	
  thin	
  film	
  with	
  1	
  %	
  C	
  Cl	
  4</TableCap3on>	
  
•    <TableColumnHeading>D	
  R	
  Temperature/	
  ¡ã	
  C	
  D	
  R	
  a	
  /	
  W	
  (	
  R	
  ,O	
  2	
  )	
  (%)	
  R	
  esponse	
  3me	
  Reproducibiliy	
  </TableColumnHeading>	
  
•    <TableContent>100	
  223	
  5	
  ~	
  22	
  min	
  Yes	
  200	
  270	
  9	
  ~	
  7-­‐8	
  min	
  Yes	
  300	
  1027	
  21	
  <	
  2	
  0	
  s	
  Yes	
  400	
  993	
  31	
  ~	
  1	
  0	
  s	
  No	
  </TableContent>	
  
•    <TableFootnote>	
  a	
  D	
  R	
  =(	
  R	
  ,	
  CCl	
  4	
  )	
  -­‐	
  (	
  R	
  ,O	
  2	
  ).	
  </TableFootnote>	
  
•    <ColumnNum>5</ColumnNum>	
  
•    <TableReferenceText>In	
  page	
  3,	
  line	
  11,	
  …	
  Film	
  responses	
  to	
  1%	
  CCl4	
  at	
  different	
  temperatures	
  are	
  summarized	
  in	
  Table	
  1……</TableReferenceText>	
  
•    <PageNumOfTable>3</PageNumOfTable>	
  
•    <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>	
  
•    </Table>	
  
TableRank	
  

• Rank tables by rating the <query, table> pairs, instead of the
<query, document> pairs: preventing a lot of false positive hits
for table search, which frequently occur in current web search
engines
• The similarity between a <table, query> pair: the cosine of the
angle between vectors




• Tailored term vector space => table vectors:
    • Query vectors and table vectors, instead of document
    vectors
Table	
  Index
                                               	
  
    Index	
  
         CapEons	
  
         Footnotes	
  
         Reference	
  Text	
  
    BoosEng	
  
         CapEons	
  (2)	
  
         FuncEon:	
  	
  
           -    Inversely	
  (recip)	
  proporEonal	
  to	
  #cites.	
  
Term	
  WeighEng	
  for	
  Tables	
  
–  TTF	
  –	
  ITTF:	
  (Table	
  Term	
  Frequency-­‐Inverse	
  Table	
  Term	
  Frequency)	
  




–  TLB:	
  Table	
  Level	
  Boost	
  Factors	
  (e.g.,	
  table	
  frequency)	
  
–  DLB:	
  Document	
  Level	
  Boost	
  factors	
  (e.g.,	
  journal/proceeding	
  order,	
  document	
  
   citaEon)	
  	
  
Table	
  term	
  ranking	
  




• A term occurring in a few tables is likely to be a better discriminator than a term
appearing in most or all tables
• Similar to document abstract, table metadata and table query should be treated as
semi-structured text
       • Not complete sentences and express a summary
       • P = 0.5 (G. Salton 1988)
•  b is the total number of tables
• IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)
Table	
  Level	
  Boost	
  and	
  Document	
  Level	
  
                       Boost	
  


Btbf is the boost value of the table frequency
Btrt is the boost value of the table reference text (e.g., the normalized length), and
Btp is the boost value of the table position. r is a parameter, which is 1 if users
specify the table position in the query. Otherwise, r = 0.




IVj: document Importance Value (IV). If a table comes from a document with
a high IV , all the table terms of this document should get a high document
level boost
ICj: the inherited citation value (ICj)
DOj: source value (the rank of the journal/conference proceeding)
DFj: document freshness
Table	
  citaEon	
  network	
  
•  Similar	
  to	
  the	
  PageRank	
  network	
  
     –  Documents	
  construct	
  a	
  network	
  from	
  the	
  citaEons	
  
     –  The	
  “incoming	
  links”	
  –	
  the	
  documents	
  that	
  cite	
  the	
  document	
  in	
  which	
  
        the	
  table	
  is	
  located	
  
     –  ExponenEal	
  decay	
  used	
  to	
  deal	
  with	
  the	
  impact	
  of	
  the	
  propagated	
  
        importance	
  
•  Unlike	
  the	
  PageRank	
  network	
  
     –  Directed	
  Acyclic	
  Graph	
  
     –  Importance	
  Value	
  (IV)	
  of	
  a	
  document	
  not	
  decreased	
  as	
  the	
  number	
  of	
  
        citaEons	
  increases	
  
     –  IV	
  not	
  divided	
  by	
  the	
  number	
  of	
  outbound	
  links	
  
•  A	
  document	
  may	
  have	
  mulEple,	
  one,	
  or	
  no	
  tables	
   	
  	
  
•  Each	
  table	
  is	
  consisted	
  as	
  a	
  set	
  of	
  metadata	
  	
  
•  Same	
  keywords	
  may	
  appear	
  in	
  different	
  metadata	
  in	
  different	
  
   tables 	
  	
  
Table	
  Search	
  Summary	
  
•  An	
  novel	
  first	
  table	
  ranking	
  algorithm	
  -­‐-­‐	
  TableRank	
  
•  A	
  tailored	
  table	
  term	
  vector	
  space	
  
•  A	
  table	
  term	
  weighEng	
  scheme	
  –	
  TTF-­‐ITTF	
  
    –  AggregaEng	
  impact	
  factors	
  from	
  three	
  levels:	
  the	
  
         term,	
  the	
  table,	
  and	
  the	
  document	
  
•  Index	
  table	
  referenced	
  texts,	
  term	
  locaEons,	
  and	
  
   document	
  backgrounds	
  
•  Design	
  and	
  implement	
  first	
  table	
  search	
  engine,	
  
   TableSeer,	
  to	
  evaluate	
  the	
  TableRank	
  and	
  compare	
  with	
  
   popular	
  web	
  search	
  engines	
  
•  Code	
  released	
  
•  Currently	
  implement	
  in	
  CiteSeerX	
  -­‐	
  millions	
  of	
  tables	
  
•  Improving	
  extracEon	
  –	
  Dow	
  Chemical	
  support	
  
Automated Figure Data Extraction and Search"
•     Large amount of results in digital documents are recorded in figures, time series, experimental
      results (eg., NMR spectra, income growth) and this is the only record of the data"

•     Extraction for purposes of:"
       –      Further modeling using presented data"
       –      Indexing, meta-data creation for storage & search on figures for data reuse"


•     Current extraction done manually!!




              Documents	
  
                                                     Extracted	
  Plot	
                Extracted	
  Info.	
  


                               Document	
                     Merged	
  
                                                               Index	
                           Plot	
  Index	
  
                                 Index	
  


      Digital	
  Library	
  


                                                                             User	
  
Seer Figure/Plot Data Extraction and Search

 Numerical data in
 scientific publications
 are often found in figures.




 Tools that automate the data extraction from figures
 provide the following:
 •  Increases our understanding of key concepts of papers
 •  Provides data for automatic comparative analyses.
 •  Enables regeneration of figures in different contexts.
 •  Enables search for documents with figures containing
 specific experiment results.
                      X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08
Metadata & data to extract: 

          2 Dimensional Plot"

                             Y-Axis
                             Labels
                                                              Legend




                                                      Data Points


                              Ticks

                         Axis Units
                                                            X-Axis
                                                            Label
Snapshot of a document                Extracted 2D plot
Our	
  Approach	
  to	
  Plot	
  Data	
  ExtracEon	
  
• Identify and extract figures from digital documents
    • Ascii and image extraction (xpdf)
    • OCR - bit map, raster pdfs
• Identify figures as images of 2D plots using SVM (Only for Bit
 map images)
    • Hough transform
    • Wavelets coefficients of image
    • Surrounding text features
• Binarization of the 2D plots identified for preprocessing (No
 need for Vectorized Images)
    • Adaptive Thresholding
•  Image segmentation to identify regions
     • Profiling or Image Signature
•  Text block detection
     • Nearest Neighbor
•  Data point detection
     • K-means Filtering
•  Data point disambiguation for overlapping points
     • Simulated Annealing
Future Directions
•  System integration within ChemXSeer or
   CiteSeerX"
   –  XML data generation"
   –  Open source tool in Lucene/SOLR "

•  Extension to other figures (3D, …)	
  

                                        "
                                  1.2e+08


                                   1e+08"
                                   8e+07"
                                   6e+07"
                                   4e+07"
                                   2e+07"
                                         "  0

                                                30   "
                                                         25   "                                                                                                 "
                                                                  20   "                                                                      "   60   "   70

                                                                           15   "                                                    "   50

                                                                                    10   "                         "   30   "   40

                                                                                             5   "   10   "   20
ChemXSeer Highlights
•  Portal for academic researchers in environmental chemistry which integrates the scientific
literature with experimental, analytical and simulation results and tools

•  Provides unique metadata extraction, indexing and searching pertinent to the chemical
literature by using heuristics combined with machine learning
       •  Chemical formulae and names
     •  Tables
     •  Figures
     •  Publication functions as in CiteSeerX
     •  Interoperability ORE-Chem development
     •  Novel ranking required

•  After extraction, data stored API accessible xml for users

•  Hybrid repository (Not fully open): Serves as a federated information interoperational system
      •  Scientific papers crawled and indexed from the web
      •  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM
      toolkit outputs)
      •  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry)

•  Access control for publisher-provided content and user-submitted experiment data

•  Takes advantage of developments in other funded cyberinfrastructure and open source
projects
     •  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others
     •  Some released open source
Experimental Collaborator recommendation system




•  CollabSeer	
  currently	
  supports	
  400k	
  authors	
  
•  h|p://collabseer.ist.psu.edu	
  
CollaboraEon	
  recommendaEon	
  
•  Metadata	
  of	
  authors	
  and	
  coauthors	
  and	
  topics	
  of	
  interest	
  
   (similar	
  to	
  expert	
  recommendaEon)	
  
•  Use	
  social	
  network	
  and	
  topics	
  to	
  recommend	
  
   collaborators	
  of	
  collaborators	
  (FOF)	
  
•  Devise	
  SN	
  index	
  and	
  ranking	
  scheme	
  
•  Explore	
  models	
  of	
  vertex	
  similarity	
  
•  Built	
  on	
  SeerSuite	
  

                                                               Gou JCDL’10,
•  Other	
  recommendaEons?	
                                  Gou MIR’10
     –  Experimental	
  methods	
                              Chen JCDL’11, SAC’12

     –  Chemicals?	
  
RecommendaEon	
  list	
  and	
  user’s	
  topic	
  of	
  interest	
  
•    Users	
  refine	
  the	
  recommend	
  list	
  by	
  clicking	
  on	
  their	
  topic	
  of	
  interest.	
  (lek:	
  refined	
  by	
  “query	
  
     processing”,	
  right:	
  default	
  recommendaEon	
  list)	
  
•    How	
  two	
  potenEal	
  collaborators	
  are	
  linked	
  by	
  common	
  collaborators	
  
CollabSeer	
  Framework	
  
IntegraEon	
  of	
  Vertex	
  Similarity	
  and	
  
                 Textual	
  Similarity    	
  

•  	
  	
  
         –  S:	
  vertex	
  similarity	
  
         –  SC.O.T.:	
  collaborator’s	
  contribuEon	
  to	
  a	
  specified	
  topic	
  
         –  Use	
  the	
  product	
  of	
  exponenEal	
  funcEons	
  to	
  avoid	
  zero	
  
            vertex	
  similarity	
  score	
  or	
  zero	
  contribuEon	
  (textual	
  
            similarity)	
  score	
  to	
  turn	
  the	
  whole	
  measure	
  into	
  zero	
  
•  Other	
  measures?	
  
•  RefSeerX:	
  recommend	
  citaEons	
  for	
  papers	
  

                                     Use these
                         paper	
  
                                                 citaEons	
  



                 The authors are unaware of related work
                  they do not know they are looking for
                  recommends related citations
•  Based	
  
    –    ExisEng	
  citaEons	
  
    –    CitaEon	
  context	
  
    –    Venue	
  and	
  importance	
  
    –    Contemporary	
  vs	
  seminal	
  
He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,
 
               Expert	
  Search




• Expert search for authors, currently in alpha
 
               Expert	
  Search




• Expert search for authors, currently in alpha
Keyphrase	
  ExtracEon	
  for	
  experts	
  
                                          Text	
  Document	
  


                                                                      Parse document into sections with
                                          SecEon	
  Parser	
          regular expression



                                             Candidate	
              Use DBLP statistic to extract
            DBLP	
  data	
                                            keyphrase candidates
                                             Extractor	
  

                                                                      Train random forest to classify &
          Training	
  Data	
              Random	
  Forest	
          rank whether a phrase is a
                                                                      keyphrase




                                          Top	
  Keyphrases	
  


Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from
Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop
on Semantic Evaluations (SemEval 2010), Sweden, July 2010.
GrantSeer	
  
•  Prototype	
  search	
  engine	
  for	
  PI	
  profiles	
  and	
  their	
  grant	
  
   informaEon	
  to	
  assist	
  funding	
  agencies,	
  deans	
  of	
  research,	
  
   foundaEons	
  
•  Link	
  PIs	
  with	
  their	
  	
  
     –    Grants	
  	
  
     –    PublicaEons	
  
     –    CitaEons	
  
     –    OrganizaEon	
  
     –    ExperEse	
  
     –    Others?	
  
•  Data	
  that	
  can	
  be	
  shared	
  
     –  CiteSeerX	
  or	
  Google	
  Scholar	
  data	
  
     –  Database	
  of	
  funded	
  research	
  
                                                           Funded by NSF – Julia Lane
Cover	
  page	
  NSF	
  XML	
  extracEon	
  
GrantSeer:	
  PI	
  profile	
  


                     grants awarded



                                                  PI’s expertise
publications + citations
Algorithm	
  Search	
  




• Homepage search for authors, currently in alpha
AlgorithmSeer	
  

Algorithm	
  Search	
  

-­‐	
  ExtracEon	
  
-­‐	
  Indexing	
  
-­‐	
  Ranking	
  




  Suite Workshop
  ICSE ‘11
Algorithm Search
Metadata extraction
• Extract
    • Pseudo-codes and their metadata
      • Captions
      • Reference sentences
      • Synopsys
      • Etc.
• Index metadata using Solr to make the pseudo-
codes searchable
• Each search result has a pointer to the page in the
document where the pseudo-code appears
Index Fields
id <string>
caption <text>
reftext <text> (Reference Sentences)
synopsis <text> (Summarizing Text)
page <sint> (Page Number)
paperid <string> (Document ID)
year <sint> (Year of Publication)
ncites <sint> (Number of Citations)
AckSeer	
  




              94
AckSeer	
  




              95
Number of        Total       C/A
                                   Name                       Acknowledge-ments   Citations   Metric   Name
                                                                                                       Educational
                                   Funding Agencies
                                                                                                       Institutions
                                   National Science                                                    Carnegie Mello
                                                                   12287          144643      11.77
                                   Foundation                                                          University
                                   Defense Advanced                                                    Massachusetts
                                                                    4712           80659      17.12
                                   Research Projects Agency                                            of Technology
                                                                                                       California Inst
                                   Office of Naval Research         3080           48873      15.87
                                                                                                       Technology
Funding Agency Impact              Deutsche
                                                                    2780           9782        3.52    Santa Fe Institu
                                   Forschungsgemeinschaft
                                                                                                       French Nationa
                                   National Aeronautics and
                                                                    2408           21242       8.82    Institute for Re
                                   Space Administration
Funding agency impact              Engineering and Physical
                                                                                                       Computer Scie
                                                                    2007           16582       8.26    Stanford Unive
•  based on                        Science Research Council
                                   Air Force Office of                                                 University of C
acknowledgement indexing           Scientific Research
                                                                    1657           16850      10.17
                                                                                                       at Berkeley
                                   National Sciences and                                               National Cente
•  # of acknowledgements           Engineering Research             1422           12050       8.47    Supercomputin
•  total citations                 Council of Canada                                                   Applications
                                                                                                       International C
•  #Citation / #ack metric         Department of Energy             1054           5562        5.28
                                                                                                       Science Institu
                                   Australian Research
                                                                    1010           5464        5.41    Cornell Univer
                                   Council
Based on acknowledgment            European Union
                                                                                                       University of I
                                   Information Technologies         825            9594       11.63
entities extracted from 150K       Program
                                                                                                       Urbana-Champ
acknowledgements in CiteSeer       National Institutes of
                                                                    709            7279       10.27
                                                                                                       USC Informati
                                   Health                                                              Sciences Instit
                                                                                                       University of C
New system available this spring   Army Research Office             666            7709       11.58
                                                                                                       Los Angeles
                                   Netherlands Organization
AckSeer                            for Scientific Research
                                                                    646            2843        4.4     McGill Univer
                                   Science and Engineering                                             Australian Nat
                                                                    489            6976       14.27
                                   Research Council                                                    University
                                   Companies                                                           Individuals
                                   International Business     Giles, PNAS, 2004
                                                                    1380           23948      17.35    Olivier Danvy
                                   Machines
                                   Intel Corporation                962            14441      15.01    Oded Goldreic
Most Acknowledged Authors and Impact Factor

                                                                              C/A
                                Author       Citations   Acknowledge-ments   Metric
                                Olivier
Interviewed by                  Danvy
                                               847              268          29.85
Nature as to why                Oded
                                              3277              259          17.82
                                Goldreich
he was the most                 Luca
                                              3847              247          43.91
acknowledged                    Cardelli
                                Tom
computer scientist              Mitchell
                                              3336              226          24.31
                                Martin
                                              3507              222          43.46
                                Abadi
                                Phil
                                              3780              181          40.07
                                Wadler
                                Moshe
                                              3786              180          33.86
                                Vardi
Who is most acknowledged?                     1790
                                Peter Lee                       167          53.54
                                Avi
                                              2566              160          18.13
Mom or dad                      Wigderson
                                Matthias
Theorists or experimentalists   Felleisen
                                              1622              154          30.55
                                Benjamin
                                              1484              152          30.53
Who has a better metric?        Pierce
                                Noga Alon     2640              152          15.71
                                John
                                              3693              152           41.9
                                Ousterhout
                                Frank
                                              1639              148          13.84
                                Pfenning
                                Andrew
                                              2064              144          52.99
                                Appel
Clouding CiteSeerX
•    Hosting cloud CiteSeerX instances
        •    Economic issues
                •    Cost of hosting
                •    Cost of refactoring the source to be hosted in the cloud.
        •    Computational/technical issues
                •    What workflow to cloudize
                •    Component modification for efficient operation
                •    VM size: storage, memory and CPU sizing as a function of
                     needs
                •    Establishing computational needs and availability clusters
                •    Appropriate load balancing across multiple sites.
                •    Security of data stored including metadata and user data.
        •    Policy issues
               •     Privacy of user data
               •     Copyright issues.
                                                    Teregowda Cloud’10 USENIX’10
SeerSuite	
  Research/Development	
  Opportuni3es	
  
•     Old	
  Seers	
  
        –  Improve	
  or	
  revive	
  old	
  systems	
  and	
  port	
  them	
  into	
  compeEEve	
  SeerX	
  space	
  
                 •    eBizSeer	
  to	
  eBizSeerX;	
  BotSeer	
  to	
  BotSeerX;	
  ArchSeer	
  to	
  ArchSeerX	
  
•     New	
  Seers	
  
        –  New	
  domains	
  such	
  as	
  physics,	
  neuroscience,	
  biology,	
  algorithms,	
  TBD	
  (build	
  new	
  indexes)	
  
        –  MyCiteSeerX	
  
•     Be|er	
  features	
  
        –    Parsing	
  
        –    EnEty	
  disambiguaEon	
  
        –    CitaEon	
  analysis	
  
        –    Ranking;	
  ranking,	
  ranking	
  
•     New	
  features	
  
        –  New	
  parsing,	
  indexing,	
  ranking	
  
                 •    Tables,	
  figures,	
  equaEons,	
  algorithms,	
  maps,	
  carbon	
  daEng,	
  chemical	
  formulae,	
  etc	
  
        –    Homepage	
  linking	
  
        –    ORE	
  search	
  and	
  data	
  integraEon	
  
        –    CollaboraEve	
  spaces	
  
        –    API/web	
  services	
  
        –    IntegraEon	
  with	
  DL	
  such	
  as	
  Fedora	
  
        –    New	
  clusters	
  
                 •    Topics,	
  venues,	
  affiliaEons	
  
        –  Recommender	
  systems	
  
        –  SNA	
  analysis	
  
        –  Others	
  
Collabora>ons	
  welcomed!	
  	
  
Data	
  and	
  sohware	
  available	
  
Research	
  SeerSuite	
  supports	
  
•  Many	
  uses	
  as	
  a	
  research	
  testbed	
  and	
  support	
  structure	
  
     –    Scaling	
  of	
  algorithms	
  for	
  IR,	
  IE,	
  data	
  mining,	
  social	
  networks,	
  ...	
  
     –    NLP	
  methods	
  on	
  large	
  text	
  collecEons	
  
     –    ML	
  methods	
  to	
  automaEcally	
  extract	
  data	
  
     –    Novel	
  indexing	
  and	
  ranking	
  
     –    Federated	
  search	
  
     –    CollaboraEve	
  and	
  social	
  networks	
  
     –    Focused	
  crawling	
  –	
  new	
  data	
  resources	
  
     –    Interface	
  design	
  and	
  integraEon	
  
     –    Systems	
  analysis	
  

•  Many	
  development	
  	
  applied	
  research	
  issues	
  
     –    IntegraEon	
  with	
  other	
  DLs	
  
     –    Automated	
  feature	
  development	
  
     –    Transfer	
  to	
  nontechnical	
  use	
  
     –    Cloud	
  based	
  delivery	
  
Summary	
  
•    Propose	
  an	
  infrastructure	
  for	
  academic	
  and	
  scienEfic	
  search	
  engine/digital	
  library	
  
     creaEon	
  -­‐	
  SeerSuite	
  
      –  Modular,	
  scalable,	
  extensible,	
  robust	
  
      –  Based	
  on	
  commercial	
  grade	
  open	
  source	
  (Solr/Lucene);	
  easy	
  to	
  use	
  
      –  Easy	
  to	
  apply	
  to	
  other	
  domains	
  (separable	
  indexes	
  and	
  projects	
  -­‐	
  integraEon)	
  
•    Allows	
  scalable	
  data	
  mining	
  and	
  informaEon	
  extracEon	
  for	
  actual	
  systems	
  
      –  Unique	
  informa4on	
  extrac4on	
  plugins	
  
      –  Focus	
  on	
  unique	
  scalable	
  extracEon/data	
  mining	
  methods	
  
              •    Most	
  methods	
  less	
  than	
  N2	
  complexity	
  
      –  AutomaEcally	
  populates	
  databases	
  or	
  data	
  structures	
  
•    Demonstrate	
  with	
  beta	
  systems	
  in	
  
      –  Computer	
  science,	
  Archaeology,	
  Chemistry,	
  Robots.txt,	
  PubMed,	
  YouSeer,	
  Tables,	
  
         Figures,	
  Maps,	
  References,	
  CollaboraEons,	
  DisambiguaEon	
  
      –  Personal	
  features	
  
•    Systems	
  are	
  reasonably	
  easy	
  to	
  build;	
  issues	
  are	
  
      –  Data	
  collecEon	
  or	
  data	
  access	
  
      –  InformaEon	
  extracEon,	
  indexing,	
  ranking	
  
•    Many	
  uses	
  as	
  a	
  research	
  testbed	
  
      –  Data	
  sharing	
  models	
  
•  Want	
  to	
  find	
  a	
  Seer,	
  search	
  Google	
  or	
  use	
  my	
  homepage.	
  
Opportun3es	
  
•  Science	
  is	
  being	
  flooded	
  with	
  data	
  
    –  SimulaEons,	
  sensors,	
  web	
  
•  Digital	
  humaniEes	
  is	
  right	
  behind	
  
•  Needs	
  in	
  
    –  Large	
  scale	
  data	
  management	
  (tera	
  to	
  peta)	
  
          •  NoSQL	
  databases:	
  graphs,	
  documents,	
  floaEng	
  point,	
  	
  
    –  Large	
  scale	
  	
  
          •  data	
  mining	
  
          •  informaEon	
  extracEon	
  
          •  search	
  
•  Domain	
  experEse	
  crucial	
  
•  Reuse	
  not	
  reinvent	
  (much	
  is	
  out	
  there)	
  
•  Solr/Lucene	
  is	
  great	
  for	
  both	
  demos,	
  producEon	
  and	
  
   research.	
  
“Human attention is the scarce
  resource, not information.” Herbert
  A. Simon, Nobel Laureate, 1997.




For	
  more	
  informaEon	
  
•  clgiles.ist.psu.edu	
  	
  
•  giles@ist.psu.edu	
  
•  SourceForge.com	
  

Mais conteúdo relacionado

Mais procurados

How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
 
Data management basics, for UC Davis EDU 292
Data management basics, for UC Davis EDU 292Data management basics, for UC Davis EDU 292
Data management basics, for UC Davis EDU 292Phoebe Ayers
 
Ownership, intellectual property, and governance considerations for academic ...
Ownership, intellectual property, and governance considerations for academic ...Ownership, intellectual property, and governance considerations for academic ...
Ownership, intellectual property, and governance considerations for academic ...Rebekah Cummings
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationNikesh Narayanan
 
Bushra bioinformatic Presentation
Bushra bioinformatic PresentationBushra bioinformatic Presentation
Bushra bioinformatic PresentationNaveed Akhtar Isamu
 
Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Rebekah Cummings
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesRebekah Cummings
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
Small Science: First Impressions of Curation Needs. Presentation at Digital L...
Small Science: First Impressions of Curation Needs. Presentation at Digital L...Small Science: First Impressions of Curation Needs. Presentation at Digital L...
Small Science: First Impressions of Curation Needs. Presentation at Digital L...Sarah Shreeves
 

Mais procurados (17)

Rdm slides march 2014
Rdm slides march 2014Rdm slides march 2014
Rdm slides march 2014
 
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
Llauferseiler "OU Libraries: Opportunities Supporting Research and Education"
 
How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?How Portable Are the Metadata Standards for Scientific Data?
How Portable Are the Metadata Standards for Scientific Data?
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Open Science and Open Data for Librarians
Open Science and Open Data for LibrariansOpen Science and Open Data for Librarians
Open Science and Open Data for Librarians
 
Data management basics, for UC Davis EDU 292
Data management basics, for UC Davis EDU 292Data management basics, for UC Davis EDU 292
Data management basics, for UC Davis EDU 292
 
Ownership, intellectual property, and governance considerations for academic ...
Ownership, intellectual property, and governance considerations for academic ...Ownership, intellectual property, and governance considerations for academic ...
Ownership, intellectual property, and governance considerations for academic ...
 
Role of libraries in research and scholarly communication
Role of libraries in research and scholarly communicationRole of libraries in research and scholarly communication
Role of libraries in research and scholarly communication
 
Curating Humanities Data: Law, technology and reality
Curating Humanities Data: Law, technology and realityCurating Humanities Data: Law, technology and reality
Curating Humanities Data: Law, technology and reality
 
Data Management 101
Data Management 101Data Management 101
Data Management 101
 
Bushra bioinformatic Presentation
Bushra bioinformatic PresentationBushra bioinformatic Presentation
Bushra bioinformatic Presentation
 
Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)Data Management for Undergraduate Researchers (updated - 02/2016)
Data Management for Undergraduate Researchers (updated - 02/2016)
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
Small Science: First Impressions of Curation Needs. Presentation at Digital L...
Small Science: First Impressions of Curation Needs. Presentation at Digital L...Small Science: First Impressions of Curation Needs. Presentation at Digital L...
Small Science: First Impressions of Curation Needs. Presentation at Digital L...
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 

Destaque

Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friendslucenerevolution
 
Building big social network search system using lucene
Building big social network search system using luceneBuilding big social network search system using lucene
Building big social network search system using lucenelucenerevolution
 
Using Lucene/Solr to Surface the Big Data of Social Media
Using Lucene/Solr to Surface the Big Data of Social MediaUsing Lucene/Solr to Surface the Big Data of Social Media
Using Lucene/Solr to Surface the Big Data of Social Medialucenerevolution
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...Pham Cuong
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systemsFalitokiniaina Rabearison
 

Destaque (6)

Using Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and FriendsUsing Lucene/Solr to Build CiteSeerX and Friends
Using Lucene/Solr to Build CiteSeerX and Friends
 
Building big social network search system using lucene
Building big social network search system using luceneBuilding big social network search system using lucene
Building big social network search system using lucene
 
Using Lucene/Solr to Surface the Big Data of Social Media
Using Lucene/Solr to Surface the Big Data of Social MediaUsing Lucene/Solr to Surface the Big Data of Social Media
Using Lucene/Solr to Surface the Big Data of Social Media
 
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
Clustering Technique for Collaborative  Filtering Recommendation and Applicat...Clustering Technique for Collaborative  Filtering Recommendation and Applicat...
Clustering Technique for Collaborative Filtering Recommendation and Applicat...
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems[Final]collaborative filtering and recommender systems
[Final]collaborative filtering and recommender systems
 

Semelhante a Using Lucene/Solr to Build CiteSeerX and Friends

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithAfrican Open Science Platform
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...datacite
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Don't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyDon't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyVince Smith
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Don’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyDon’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyVince Smith
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation InfrastructureMicah Altman
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
A Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraA Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraVicki Ferrini
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
"Data in Context" IG sessions @ RDA 3rd Plenary
"Data in Context" IG sessions @  RDA 3rd Plenary"Data in Context" IG sessions @  RDA 3rd Plenary
"Data in Context" IG sessions @ RDA 3rd PlenaryBrigitte Jörg
 
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Brigitte Jörg
 
Biomedical Atlas Centre
Biomedical Atlas CentreBiomedical Atlas Centre
Biomedical Atlas CentreELIXIR UK
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemASIS&T
 

Semelhante a Using Lucene/Solr to Build CiteSeerX and Friends (20)

Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
 
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
2013 DataCite Summer Meeting - Purdue University Research Repository (PURR) (...
 
CDL research lifecycle
CDL research lifecycleCDL research lifecycle
CDL research lifecycle
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Don't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyDon't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easy
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Big Data
Big Data Big Data
Big Data
 
Don’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyDon’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easy
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
A Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraA Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital Era
 
Sandusky, "Deep Indexing and Discover of Tables and Figures"
Sandusky, "Deep Indexing and Discover of Tables and Figures"Sandusky, "Deep Indexing and Discover of Tables and Figures"
Sandusky, "Deep Indexing and Discover of Tables and Figures"
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
"Data in Context" IG sessions @ RDA 3rd Plenary
"Data in Context" IG sessions @  RDA 3rd Plenary"Data in Context" IG sessions @  RDA 3rd Plenary
"Data in Context" IG sessions @ RDA 3rd Plenary
 
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
 
Biomedical Atlas Centre
Biomedical Atlas CentreBiomedical Atlas Centre
Biomedical Atlas Centre
 
Database management system
Database management systemDatabase management system
Database management system
 
RDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management EcosystemRDAP13 John Kunze: The Data Management Ecosystem
RDAP13 John Kunze: The Data Management Ecosystem
 

Mais de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Último

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Using Lucene/Solr to Build CiteSeerX and Friends

  • 1. Using  Lucene/Solr  to  Build  CiteSeerX  and   Friends     Dr. C. Lee Giles Information Sciences and Technology Computer Science and Engineering The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu
  • 2. http://clgiles.ist.psu.edu Prof.  C.  Lee  Giles   •  Intelligent  and  specialty  search  engines;  cyberinfrastructure   for  science,  academia  and  government   –  Modular,  scalable,  robust,  automaEc  cyberinfrastructure  and   search  engine  creaEon  and  maintenance   –  Large  heterogeneous  data  and  informaEon  systems   –  Specialty  search  engines  and  portals  for  knowledge  integraEon   •  CiteSeerx  (computer  and  informaEon  science)   •  ChemXSeer  (e-­‐chemistry  portal)   •  GrantSeer  (grant  search)   •  RefSeer    (recommendaEon  of  paper  references)   •  Scalable  intelligent  tools/agents/methods/algorithms   –  InformaEon,  knowledge  and  data  integraEon   –  InformaEon  and  metadata  extracEon;  enEty  disambiguaEon   –  Unique  search,  knowledge  discovery,  informaEon  integraEon,   data  mining  algorithms   –  Web  2.0  methods   •  Automated  tagging  for  search  and  informaEon  retrieval   •  Social  network  analysis  
  • 3. SeerSuite  Contributors/Collaborators:  recent   past  and  present  (incomplete  list)   Projects:  CiteSeer,  CiteSeerX,  ChemXSeer,  ArchSeer,  CollabSeer,  GrantSeer,   SeerSeer,  RefSeer,  AlgoSeer,  AckSeer,  BotSeer,  YouSeer,  …   •  P.  Mitra,  V.  Bhatnagar,  L.  Bolelli,  J.  Carroll,  I.  Councill,  F.  Fonseca,  J.  Jansen,   D.  Lee,  W-­‐C.  Lee,  H.  Li,  J.  Li,  E.  Manavoglu,  A.  Sivasubramaniam,  P.   Teregowda,  H.  Zha,  S.  Zheng,  D.  Zhou,  Z.  Zhuang,  J.  Stribling,  D.  Karger,  S.   Lawrence,  J.  Gray,  G.  Flake,  S.  Debnath,  H.  Han,  D.  Pavlov,  E.  Fox,  M.  Gori,   E.  Blanzieri,  M.  Marchese,  N.  Shadbolt,  I.  Cox,  S.  Gauch,  A.  Bernstein,  L.   Cassel,  M-­‐Y.  Kan,  X.  Lu,  Y.  Liu,  A.  Jaiswal,  K.  Bai,  B.  Sun,  Y.  Sung,  J.  Z.  Wang,   K.  Mueller,  J.Kubicki,  B.  Garrison,  J.  Bandstra,  Q.  Tan,  J.  Fernandez,  P.   Treeratpituk,  W.  Brouwer,  U.  Farooq,  J.  Huang,  M.  Khabsa,  M.  Halm,  B.   Urgaonkar,  Q.  He,  D.  Kifer,  J.  Pei,  S.  Das,  S.  Kataria,  D.  Yuan,  T.  Suppawong,   others.   •  Current  funding:  NSF,  Dow  Chemical  
  • 4. Outline   •  MoEvaEon   –  Data  science;  Cyberinfrastructure   –  Vast  growth  in  domain  science  data  and  documents   •  SeerSuite   –  Tool  for  creaEng  Seers   –  Specialized  data  and  document  search  and  recommendaEons   •  Tables,  formulae,  figures,  references  …   –  Use  of  Solr/Lucene   •  Disciplinary  sciences,  indexes  &  informaEon  extracEon  (the   Seers)   –  Computer  science   –  Chemistry   –  Briefly  other  Seers   •  OpportuniEes  for  Research   •  Conclusions  and  DirecEons  
  • 5. The  Evolu3on  of  Science  -­‐  the  4th   Paradigm   Jim Gray’s paradigm •  Observa3onal  Science     –  ScienEst  gathers  data  by  direct   observaEon   –  ScienEst  analyzes  data   •  Analy3cal  Science     –  ScienEst  builds  analyEcal  model   –  Makes  predicEons.   •  Computa3onal  Science     –  Simulate  analyEcal  model   –  Validate  model  and  makes  predicEons     •  Data  Driven  Science   –  Data  captured  from  the  web,  by   instruments,  or  from  documents   –  Data  generated  by  simulaEon   –  Placed  in  data  structures  /  files   –  ScienEst(s)  analyze(s)  data   –  Access  &  search  crucial  
  • 6. Data  Access  Varies  with  Discipline   or  Small  vs  Big  Science   •  Small  vs  Big  science   –  “Data  from  Big  Science  is  …  easier  to  handle,  understand  and  archive.   Small  Science  is  horribly  heterogeneous  and  far  more  vast.  In  Eme  Small   Science  will  generate  2-­‐3  Emes  more  data  than  Big  Science.”       •  ‘Lost  in  a  Sea  of  Science  Data’  S.Carlson,  The  Chronicle  of  Higher  EducaEon   (23/06/2006)     –  Data  is  local   –  Data  will  not  be  shared   •  At  some  point  there  will  be  needed     –  indices  to  control  search   –  parallel  data  search  and  analysis   •  Cyberinfrastructure  can  help   –  If  you  can’t  move  the  data  around,   –  Bandwidth  of  a  van  loaded  with  disks       take  the  analysis  to  the  data!   –  Do  all  data  manipulaEons  locally   •  Build  custom  procedures  and  funcEons  locally  
  • 7. SeerSuite   •  Open  source  search  engine  and  digital  library  tool  kit  used  to   build  search  engines  and  digital  libraries   –  CiteSeerX  ,  ChemXSeer,  RefSeer,YouSeer,  CollabSeer,  etc.   •  Supports  research  in   –  Indexing  and  search   –  Digital  libraries   –  Data  mining  &  structures   –  InformaEon  and  knowledge  extracEon   –  Social  networks   –  Scientometrics/infometrics   –  Systems  engineering,  User  design   –  Sokware  engineering  and  management   –  Web  crawling   •  Trains  students  in  search  and  sokware  systems   –  EducaEonal  tool  for  search  engine  creaEon   –  Students  highly  sought  in  industry  and  government  
  • 8. SeerSuite  -­‐  proper3es   •  Modular,  scalable,  extensible,  robust  design   –  Extensible  to  many  problems  and  disciplines   •  Integrated  features   –  Focused  crawler  -­‐  Heritrix   –  Indexer  -­‐  Solr/lucene   –  Metadata  extracEon  -­‐  modular   –  Ranked  results   •  Builds  on  experience  with  other  domain  engines  and  OS  tools   –   Lucene  and  Solr   –   The  MySQL  Database  and  InnoDB  Storage  Engine   –   Apache  Tomcat   –   Spring  Framework   –   Acegi  Security   –   AcEveMQ   –   AcEveBPEL  Open  Source  Engine   –   Apache  Commons  Libraries   –   SVMlight  support  vector  machine  package   –   CRF++  condiEonal  random  field  package   •  Hardware  independent;  Linux   •  Reuse  not  reinvent  
  • 9. Data Mining & Information Extraction in Seers •  Data acquisition •  SeerSuite systems often crawls the public web for new data •  Many data types available •  Richness of data offers unique data mining features •  CiteSeerX as testbed/sandbox •  Large scale data resources •  Millions of documents, authors, etc. •  Some common features/metadata •  Commercial grade indexer (Solr/Lucene) •  Scalable to G’s of documents and M’s of users •  “Watson” •  Modular design •  Cloudable •  State of the art algorithms (machine learning) for large scale unique metadata (information) extraction & mining •  Unique parsers and indexing •  Quality of extraction •  Precision/recall •  Ranking •  Architecture/integration
  • 10. Seer  Friends   •  In  various  stages  of  the  system  lifecycle  with  various  data  resources   and  indexes:   –  Mature  and  developing,  code  released   •  CiteSeer,  now  CiteSeerX   •  ChemXSeer   •  TableSeer   •  YouSeer   –  New,  future  TBD,  not  all  aspects  public   •  ArchSeer   •  AlgoSeer   •  CollabSeer   •  RefSeer   •  SeerSeer   •  GrantSeer   –  Dead  or  limping  by  (could  be  revived)   •  AckSeer  (acknowledgement  indexing)  (revived!)   •  BizSeer   •  BotSeer   –  Proposed,  but  do  not  exist   •  BrainSeer   •  CensorSeer   •  ArXivSeer  
  • 11. Why  Solr/Lucene?   •  Only  open  source  considered  –  cost   •  CompeEtors:   –  Indri   –  Wumpus   –  Terrier   –  Others?   •  Must  scale  for  both  number  of  documents  and  users   •  Easily  integrable  and  customizable   –  Other  indexes,  crawlers,  ingesEon,  metadata  extractors     •  Well  used  (Watson)   •  AcEve  community  of  support   –  Enterprise  plaporm  a  plus   •  Easy  to  transiEon  to  government/industry/academia   –  Apache  license  
  • 12. Next Generation CiteSeer, CiteSeerX •     2  M  documents   •     40  M  citaEons   •   2  to  5  M  authors   •   2  to  4  M  hits  day   •   800K  individual  users   •   en3re  data  shared   •   Index  -­‐  50  G   http://citeseerx.ist.psu.edu
  • 13. History:  CiteSeer  (aka  ResearchIndex)     Project  at  NEC  Research  InsEtute,  Princeton     1st  academic  document  search  engine     Very  popular  with  computer  science   C. Lee Giles   Hosted  at  NEC  from  1997  –  2004.     Moved  to  Penn  State  as  collaborators  lek.     Provided  a  broad  range  of  unique  services   including     AutomaEc  citaEon  indexing,  reference  linking,   full  text  indexing,  similar  documents  lisEng,   Kurt Bollacker automated  metadata  extracEon  and  several   other  pioneering  features.     Refactored  and  redesigned  as  CiteSeerx     Released  2008     Lucene  based  indexing   CiteSeer continuously running for 15 years! Steve Lawrence
  • 14. SeerSuite/CiteSeerX Architecture •  Web Application •  Focused Crawler •  Document Conversion and Extraction •  Document Ingestion •  Data Storage •  Maintenance Services •  Federated Services Teregowda, USENIX ‘10
  • 15. 4 systems: •  Production •  Crawling •  Staging •  Research All or some can be cloudized Teregowda, USENIX 2010
  • 16. CiteSeerX  Services     CiteSeerX  is  a  very  automated  system:     Full  OAI  metadata  if  available     Full  text  Indexing  (many  different  indexes)     Documents     CitaEons     Tables     More  forthcoming    (Algorithms,  Figures,  Acknowledgements).     CitaEon  Graph.     Ranking  based  on  citaEons.     Linking  documents     -  Co-­‐citaEons   -  CiEng  documents     Author  DisambiguaEon     DisEnguish  between  authors  with  similar  names.     Profiles  and  publicaEon  informaEon  for  author.     AutomaEc  crawling  from  list  and  submissions     PersonalizaEon   -  Login  based  access  to  features  on  CiteSeerX.   -  CorrecEons  to  metadata.   -  Storage  of  queries.   -  CollecEon  of  papers   -  Follows  document  metadata  changes.  
  • 17. Focused  Crawling   •  Maintain  a  list  of  parent  URLs  where  documents  were  previously  found   –  Parent  URLs  are  usually  academic  homepages.   •  300,000  unique  parent  URLs,  as  of  summer  2011   –  Parent  URLs  are  stored  in  a  database  table  with  two  addiEonal  fields  for   scheduling:   •  Last  Eme  changed,  get  new  documents  from  the  page.   •  EsEmated  change  rate  according  to  previous  crawls  of  this  page.   •  The  crawling  process  starts  with  the  scheduler  selecEng  1000  parent  URLs   which  have  the  highest  probability  of  having  new  documents  available.     –  Assume  Poisson  process  for  the  change  behavior  of  a  parent  page.     •  Suppose  a  parent  page  P’s  last  observed  change  occurred  at  Eme  t1,  and  its  esEmated   change  rate  is  R,  then  at  Eme  t2  (t2  =  t1  +  Δ),  the  probability  that  it  has  changed  again   since  t1  is  1  –  exp(-­‐R*Δ)   •  Larger  R  or  larger  Δ  will  give  larger  probability.   •  Aker  each  crawl,  the  change  rate  of  the  scheduled  parent  URL  should  be  recalculated.   •  Crawling  run  incrementally  daily  (invoked  by  a  Linux  cron  job  at  12  am)   –  Most  discovered  documents  have  been  crawled  before.     •  Use  hash  table  comparison  for  detecEon  of  new  documents   •  Normally  retrieve  a  few  thousand  NEW  documents  per  day,  someEmes  less  than  1k.   •  Moved  to  whitelist  vs  blacklist     Zheng, CIKM’09
  • 18. documents  from  crawled  urls   90% all citations from the first 550 sites 90% all documents from the first 1250 sites
  • 19. How  will  we  get  metadata  for  fields?   Now... that should clear up a few things around here
  • 20. Metadata  ExtracEon   •  Documents  are  converted  from  PDF/PS  to  text  using   converters.   –  Converters  include  TET,  pd{ox,  pdkotext,gs.   •  Documents  are  filtered  checking,  for  existence  of   references  and  duplicaEon  (checksum).   •  Use  tools  or  build  your  own   –  Metadata  extracEon  system  uses  machine  learning   methods  like  SVM  (Header  Parser),  CRF  (ParsCit)  to   extract  various  enEEes  from  the  document.   •  Rule  based  templates  are  applied  before  extracEon.  
  • 21. AutomaEcally  Created  DB  of  paper  in  CSX   10.1.1.130.782 Tensor Decompositions and Applications This .. 2009 pages 455-500 id title abstract year publisher SIAM “Tensor Decompositions and Applications”, SIAM REVIEW, 2009, pp 455-500 Abstract: This …. Cited 34 times, 6 times by Author venue Assigned SIAM REVIEW By venueType version cluster System JOURNAL Extractor/ 2 9248987 User/ 10 12/30/2008 True Inference n-cites 34 Inference/ selfCites 6 public User repositoryID crawldate
  • 22. 3  Tier  Architecture   Queries Index Web 1 Index - Tables User Request Load Balancer Web Application Load Balancer Repository Web 2 Database Requests Storage Crawler Ingestion Extraction
  • 23. CiteSeer X  Sokware  Overview   •  IngesEon  Process:  Responsible  for  obtaining  and  preparing  a  document  and  the   related  metadata.   –  Process  the  document   •  Submi|ed  by  the  user  or  Crawler   –  Extract  Metadata   •  Header   •  CitaEons   •  Acknowledgements   –  Store  the  metadata  and  documents.   •  CitaEon  Matching   –  Iden>fying  the  underlying  graph  structure  –  documents  ci>ng  this  document  and   the  rela>onship  between  documents  and  cita>ons   •  Inference  matching  and  graph  generaEon   –  User  CorrecEons  (Version  Maintenance)   –  Determine  and  accept  valid  user  correc>ons   –  Regular  NoEficaEon  Mechanisms   –  Ensure  that  the  user  is  no>fied  when  new  documents  are  added  to  the  collec>on   •  Linked  to  MyCiteSeer.   •  Update  and  Maintenance   –  Update  and  make  valid  the  full  text  index  and  various  sta>s>cs.   –  StaEsEcs   –  Index  updates  
  • 24. CiteSeerX  Search     Enabling  Search     Fulltext     Fields  created   -  Title   -  Authors   -  CitaEons   -  Venue   -  Keywords   -  Abstract   -  Range  (PublicaEon)   -  CitaEons  
  • 25. Field  Schema   Field Type Indexed/Stored DOI String Y/Y - Unique Citation/Document String Y/Y Title Text Y/Y Author A Text Y/Y Authors Normalized A Text Y/N ncites (# cited by) Integer Y/Y URL String Y/Y cites Tokens Y/N citedby Tokens Y/N Timestamp Date Y/Y * - A Text is a Text field which does not have a stopword filter or stemming ^ - Tokens are a Text field with only duplicate removal and whitespace tokenizer
  • 26. CiteSeerX  Search  Results     Results  SorEng     Relevance  (default)   -  Based  on  dismax  query   handling  with  boosEng.   Sorting   CitaEons   -  CitaEons  received  by  the   document  in  collecEon  plus   default     Year   -  PublicaEon  date.     Recency   -  Date  of  acquisiEon.  
  • 27. CiteSeerX  CitaEon  Graph     RelaEonships   B Cited by   CitaEon  graph     E -  Store  Cited  by  and   A Cites Cites  in  index     Build   D -  Build  document   C graph  by  querying   index  for   relaEonship.  
  • 28. Adding  documents     Ingest  documents  for  new  crawls   -  Add  metadata  to  collecEon   -  Add  full  text  to  system   -  Link  metadata  in  collecEon     Run  maintenance  scripts   -  Poll  updates  and  post  to  Solr.     Fulltext     Metadata     RelaEonships     Challenge:  Maintain  data  freshness.  
  • 29. Query  Response   Web •  Query  forwarded  to  Solr   from  the  presentaEon   Web Interface layer  (JSP)   •  Solr  generates  ranked   response  in  JSON   •  Build  each  record  in  xml   with  the  database  (Add   Database fields:  Abstract)   •  PresentaEon  layer  (JSP)   Index formats  records  based   on  ranking.  
  • 30. Ranking  with  BoosEng  (Relevance)     Use  of  Boost  FuncEon,  Minimum  Match,   Query  Fields     Boost  FuncEon  –    the  effect  of  citaEons   -  Map  number  of  citaEons  >  1  to  500     Minimum  Match  –  2       Query  Fields   -  Text  (1)   -  Title  (4)   -  Abstract  (2)  
  • 31. Query  Response   Web Interaface   Query  at  Interface  (JSP)   Q   Hand  over  to  Web   Text R HashMap applicaEon  (Java/Spring)   Web Application   Hand  over  to  Solr   F   Ranked  response  from  Solr   Text JSON Q R HashMap (JSON)   DB   Response  unwrapped  and   more  details  included  with   Index informaEon  from  DB     Present  response  at   Interface  (JSP)  
  • 32. Name  DisambiguaEon   •  Name  disambiguaEon  (NER)   –  A  person  can  be  referred  to  in  different  ways  with  different  a|ributes  in   mulEple  records;  the  goal  of  name  disambiguaEon  is  to  resolve  such   ambiguiEes,  linking  and  merging  all  the  records  of  the  same  enEty  together   •  Three  types  of  name  ambiguiEes:   –  Aliases  -­‐  one  person  with  mulEple  aliases,  name  variaEons,  or  name   changed     e.g.  CL  Giles  &  Lee  Giles,  Superman  &  Clark  Kent   –  Common  Names  -­‐  more  than  one  person  shares  a  common  name,     e.g.  Jian  Huang  –  103  papers  in  DBLP   –  Typography  Errors  -­‐  resulEng  from  human  input  or  automaEc  extracEon   •  Goal:  disambiguate,  cluster  and  link  names  in  a  large  digital   library  or  bibliographic  resource  such  as  Medline,  CiteSeerX,  etc.  
  • 33. Efficient  Large  Scale  En3ty  Disambigua3on   Testbed:  CiteSeerX  and  PubMedSeer  et.al PKDD 2006 Huang, Treeratpituk, et.al JCDL 2009 •  EnEty  disambiguaEon  problem   Online SVM –  Determine  the  real  idenEty  of  the   with Active Learning authors  using  metadata  of  the   Annotator Distance Learner research  papers,  including  co-­‐ Metadata authors,  affiliaEon,  physical   Actors, entities Extraction Soft- address,  email  address,     Module Jaccard TFIDF documents Similarity informaEon  from  crawling  such   Similarity SVM Distance DBSCAN as  host  server,  etc.   Function Clustering –  EnEty  normalizaEon   Similarity Module Function •  MoEvaEon   –  Enhance  search  funcEonaliEes   Blocking for  digital  repositories   Module Candidate Class •  Fielded  search  by  author  name   Author 1 Paper 3 Author 2 Paper 4 –  Improve  metadata  quality   –  Improved  social  network  analysis   –  Government  and  business   •  Key  features   intelligence   •  E.g.  census  data  and  credit   –  LASVM  distance  funcEon   records   •  AcEve  learning   •  Challenges   –  –  Simpler  and  more  accurate  model   Be|er  generalizaEon  power   –  Accuracy   •  Online  learning   –  Scalability   –  Expandable  to  new  training  data   –  Expandability   –  DBSCAN  clustering   •  Ameliorate  labeling  inconsistency  (transiEvity  problem)   •  Efficient  soluEon  to  find  name  clusters   •  N  logN  scaling  
  • 34. Author  DisambiguaEon  Field   •  Currently  uses  author  fields   –  For  author  search  (both  for  author  menEons  and  for   disambiguated  authors)   •  Future  direcEon     –  Use  Lucene  index  for  blocking  in  author  disambiguaEon  –   creaEng  candidate  set  of  author  menEons  that  could   belong  to  the  same  cluster  
  • 35. Author  DisambiguaEon   •  Random  Forest  (RF)     –  Use  random  feature  selecEon+bootstrap  sampling  to  construct  mulEple  decision  trees  from  one  training  data   –  Aggregate  votes  of  a  collecEon  of  decision  tree  as  final  decision   –  The  more  independent  each  tree  is,  the  be|er  the  improvement  over  a  single  decision  tree   •  Author  disambiguaEon  with  Random  Forest   –  Various  meta  data  is  used  as  features  in  Random  Forest  to  determine  whether  two  author  name  from  two  papers   refer  to  the  same  person   •  E.g.  Author  names,  affiliaEon,  coauthors,  keywords,  journal  informaEon,  year  of  publicaEons,  etc   –  MulEple  distance  funcEons  are  used  for  each  type  of  meta  data   •  E.g.  TFIDF,  Jaccard  distance,  for  comparing  affiliaEons   •  Compared  with  previous  SVM-­‐based  approach   –  Shown  to  provide  higher  accuracy  than  SVM  in  pair-­‐wise  author  disambiguaEon  task   –  Easy  parameterizaEon  in  the  training  phrase  (only  number  of  trees  and  randomness  at  each  node,  no  decision  on   kernel  funcEon  needed),  and  performance  is  not  sensiEve  to  parameters  chosen   –  Provide  measurement  for  importance  of  each  individual  features  (how   informaEve  each  feature  is,  and  how   sensiEve  the  decision  is  to  noise  in  a  parEcular  feature),  which  is  not  trivial  for  SVM  with  non-­‐linear  kernel   –  Training  Eme  &  classificaEon  Eme  is  linear  to  the  number  of  tree  and  data  size   •  Also  provide  higher  disambiguaEon  accuracy  when  compared  with  other  tradiEonal  method  (LogisEc   Regression,  Naïve  Bayes,  Decision  Tree)   Treeratpituk, Giles, JCDL09
  • 36. Data and Publications in the Field of Chemistry Chemistry • not physics - no arXiv – or computer science - no CiteSeer • Legacy of early information access - Chem Abstracts • Cheminformatics is not bioinformatics Chemistry has been up to recently a data poor field Data sharing tradition just being established Data creation is exploding - local (small science) Journals and societies sensitive to their IP issues dominate the field Unsubstantiated IP claims such as data in the paper belongs to the publisher Discourage online versions of publications - ACS Large powerful international companies have a vested interest in research Chemical information extraction tools are easily monetized Standards exist - CML, InCHI “Fixing the past so we can fix the future.” Jeremy Frey Chemistry is an old discipline with publications going back 100 years Chemistry is compound centric, not algorithmic centric Search is about the compound! Compounds have a rich data environ 3D graph structure, energies, etc.
  • 37. ChemXSeer Architecture Integrate and implement well-used open source tools Use CiteSeerX tools when possible Integrate into SeerSuite Search Chemical formulae unique search Table search Figure search More data (grey literature) than documents •  Automated information extraction modules based on machine learning methods •  Lucene/Solr indices for extracted fields, •  Relational databases for datasets, Work closely with chemists to understand their needs Tools for data conversion Provide a public portal and repository for easy use User access controls Integrated visualization tools like JMOL for Gaussian data residing into our repository API’s for users for extracted data Data and documents standards de facto: xml, pdf, etc.
  • 39. ChemXSeer Formula Search • Extraction and search of chemical formulae in scientific documents has been shown to be very useful. • Intersection of two research areas: • Information retrieval • Chemoinformatics •  Formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, WWW’07, WWW’08, TOIS’11 D. Yuan, ICDE’12
  • 40. Challenges in Formula Search How to identify a formula in scientific documents? Non-Formula “… This work was funded under NIH grants …” “ … YSI 5301, Yellow Springs, OH, USA …” “… action and disease. He has published over …” Formula “… such as hydroxyl radical OH, superoxide O2- …” “ and the other He emissions scarcely changed …” Machine learning algorithms (SVM + CRF) yield high accuracies for correct formula identification.
  • 41. SegmenEng  chemical  names   •  Goal:  to  discover  semanEcally  meaningful  sub-­‐terms  in   chemical  names   –  Methylethyl  alcohol   –  methionylglutaminylarginyltyrosylglutamylserylleucyl   phenylalanylalanylglutaminylleucyllysylglutamylarginyl   lysylglutamylglycylalanylphenylalanylvalylprolylphenyl   alanylvalylthreonylleucylglycylaspartylprolylglycylisol   eucylglutamylglutaminylserylleucyllysylisoleucylaspartyl   threonylleucylisoleucylglutamylalanylglycylalanylaspartyl   alanylleucylglutamylleucylglycylisoleucylprolylphenyl   alanylserylaspartylprolylleucylalanylaspartylglycylprolyl   threonylisoleucylglutaminylasparaginylalanylthreonylleucyl   arginylalanylphenylalanylalanylalanylglycylvalylthreonyl   prolylalanylglutaminylcysteinylphenylalanylglutamyl   methionylleucylalanylleucylisoleucylarginylglutaminyllysyl   hisEdylprolylthreonylisoleucylprolylisoleucylglycylleucyl   leucylmethionyltyrosylalanylasparaginylleucylvalylphenyl   alanylasparaginyllysylglycylisoleucylaspartylglutamylphenyl   alanyltyrosylalanylglutaminylcysteinylglutamyllysylvalyl   glycylvalylaspartylserylvalylleucylvalylalanylaspartylvalyl   prolylvalylglutaminylglutamylserylalanylprolylphenylalanyl   arginylglutaminylalanylalanylleucylarginylhisEdylasparaginyl   valylalanylprolylisoleucylphenylalanylisoleucylcysteinyl   prolylprolylaspartylalanylaspartylaspartylaspartylleucyl   leucylarginylglutaminylisoleucylalanylseryltyrosylglycyl   arginylglycyltyrosylthreonyltyrosylleucylleucylserylarginyl  
  • 42. Chemical  Search  Aspects   •  Parsing   •  ExtracEon  and  tagging   •  Indexing   •  Ranking  
  • 43. Chemical  EnEty  ExtracEon  and  Tagging   •  Name  tagging   –  Each  chemical  name  can  be  a  phrase   –  Example   •  "...  Determina>on  of  lac4c  acid  and  ...“   •  "...  insec>cide  promecarb  (3-­‐isopropyl-­‐5-­‐methylphenyl  methylcarbamate)  acts   against  ..."   •  Formula  tagging   –  Each  formula  is  a  single  term   –  Example   •  "...  such  as  hydroxyl  radical  OH,  superoxide  ..."   –  Non-­‐formula  example   •  "...  YSI  5301,  Yellow  Springs,  OH,  USA  ...  ”   •  Tagging  examples   –  Name  tagging:   "...    of  <name-­‐type>lac>c  acid</name-­‐type>  and  ...“   –  Formula  tagging:   "...  radical  <formula-­‐type>OH</formula-­‐type>  ,  superoxide  ..."  
  • 44. Textual  Chemical  Molecule  InformaEon   Indexing  and  Search   •  Index  Schemes:     –  Which  tokens  to  index?   –  Indexing  all  subsequences  generates  a  large  size  index   •  SegmentaEon-­‐based  index  scheme   –  Used  for  indexing  chemical  names   methylethyl –  First  segment  a  chemical  name  hierarchically   and  then  index  substrings  at  each  node   methyl ethyl meth yl eth yl me th •  Frequency-­‐and-­‐discriminaEon-­‐based  index  scheme   –  Used  for  indexing  chemical  formulas   –  SequenEally  select  frequent  and  discriminaEve  subsequences  of  a   formula  from  the  shortest  to  the  longest  
  • 45. Features  for  Formula  Indexing   •  Formula   –  A  sequence  of  chemical  element  or  par3al  formula   with  corresponding  frequencies   –  E.g.  CH3(CH2)2OH   •  ParEal  formula   –  ParEal  formula:  a  subsequence  of  a  formula   –  E.g.  C,  H,  O,  CH3,  CH2,  OH,  CH3(CH)2,  H3(CH)2,  CH3 (CH)2O,  etc.   •  Index  construcEon   –  ParEal  formulas  with  frequencies:  e.g.  <C,3>,<H, 6>,<CH2,2>,  etc.   –  Too  many  parEal  formulas,  need  feature  selec3on  
  • 46. Criteria  of  Feature  SelecEon   •  Criteria  of  feature  selecEon   –  Frequent  features  (Freqs≥Freqmin)   –  DiscriminaEve  features  (αs  ≥αmin)   •  If  a  sequence’s  selected  subsequences  are  enough  to   disEnguish  formulas  containing  them  from  other   formulas,  this  sequence  is  redundant.   •  DiscriminaEon  score   α s =| I s '∈F ∧ s 'p s Ds ' | / | Ds |  where  F  is  the  selected  feature  set,  and  Ds  is  the  set  of   formulas  containing  s.  
  • 47. An  Example  for  Formula  Indexing   •  Data  set:     –  1.CH3COOH,  2.CH3(CH2)2OH,  3.CH3(CH2)3COOH   •  Parameter:     –  Freqmin=2,  αmin=1.1   •  Steps:   –  Length=1,  Candidates={C,H,O},  F={C,H,O}   –  Length=2,  Candidates={CH3,H3C,CO,OO,OH,CH2},  Frequent   Candidates={CH3,CO,OO,OH,CH2}   α CH 3 =| {1,2,3}C I {1,2,3}H | / | {1,2,3}CH 3 |= 1 α CO =| {1,2,3}C I {1,2,3}O | / | {1,3}CO |= 1.5  Frequent  &  DiscriminaEve  Candidates={CO,OO,CH2}    F={C,H,O,CO,OO,CH2}   –  Length=3,  …  
  • 48. Formula  Search   •  SF.IEF:  Subsequence  Frequency  &  Inverse  EnEty  Frequency   Freq(s,e) |C | SF(s,e) = ,IEF(s) = log |e | |{e | s p e} | •  Exact  formula  search   –  Search  for  exact  representaEons.  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and   C2H6,  not  H4C  or  H6C2.   € •  Frequency  formula  search   –  Full  frequency  search:  search  for  formulas  with  specified  chemical   elements  and  frequency  ranges,  ignoring  the  order,  no  unspecified   elements.  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not   CH4O,  C2H6O2.   –  ParEal  frequency  search:  similar  but  allow  unspecified  elements.  E.g.   *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and   C2H6O2  as  well.   –  Ranking  funcEon   score(q, e) = ∑ SF ( s, e) IFF ( s ) 2 /( | f | × ∑ IFF (s) 2 ) s∈q s∈q
  • 49. Formula  Search  substructure   •  Substructure  formula  search   –  Search  for  formulas  that  may  have  a  substructure.  E.g.  -­‐COOH   matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse   match:  medium  score),  and  CH3CHO2  (parsed  match:  low  score).   –  Ranking  funcEon   score(s,e) = W match(s, f )SF(s,e)IFF(s) / | e |  where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and   parsed  match   •  Similarity  formula  search   –  Search  for  formulas  with  a  similar  structure  of  the  query  formula.   € Feature-­‐based  approach  using  parEal  formula  matching.  E.g.   ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.   –  Ranking  funcEon   score(q,e) = ∑W match(q,e )W (s)SF(s,q)SF(s,e)IFF(s) / | e | sp q •  ConjuncEve  search  of  the  basic  types  of  formula  searches   –  E.g.  [*C2H4-­‐6  -­‐COOH]  matches  CH3COOH,  not  C2H4O  or   CH3CH2COOH.   € •  Document  query  rewriEng   –  E.g.  document  query  atom  formula:=CH4  is  rewri|en  to  atom  (CH4   OR  CD4),  if  formula  search  of  =CH4  matches  CH4  and  CD4.  
  • 50. Formula  Search  -­‐Query  Models   Many  models  are  possible  from  exact  to  semanEc   Models  discriminated  by  matching  algorithms   •  Exact  search   –  Search  for  exact  representaEons   –  E.g.  =C1-­‐2H4-­‐6  matches  CH4  and  C2H6,  not  H4C  or  H6C2   •  Frequency  searches   –  Full  frequency  search:  search  for  formulae  with  specified  chemical  elements  and   frequency  ranges,  ignoring  the  order,  no  unspecified  elements   –  E.g.  C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  not  CH4O,  C2H6O2   –  ParEal  frequency  search:  similar  but  allow  unspecified  elements   –  E.g.  *C1-­‐2H4-­‐6  matches  CH4,  C2H6,  H6C2,  CH3CH3,  and  CH4O  and  C2H6O2  as  well   •  Substructure  search   –  Search  for  formulae  that  may  have  a  substructure   –  E.g.  -­‐COOH  matches  CH3COOH  (exact  match:  high  score),  HOOCCH3  (reverse  match:   medium  score),  and  CH3CHO2  (parsed  match:  low  score).   •  Similarity  search   –  Search  for  formulae  with  a  similar  structure  of  the  query  formula.  Feature-­‐based   approach  using  parEal  formulae  matching.   –  E.g.  ~CH3COOH  matches  CH3COOH,  (CH3COO)2Co,  CH3COO-­‐,  etc.  
  • 51. Ranking  formulae   •  Ranking  formulae  has  to  depend  on  need  and  importance   •  Focus  on  structural  methods  and  frequency   •  Importance  can  be  introduced  by  citaEon  rank  or  pagerank  or  others   •  SF.IFF   –  Substructure  frequency  and  inverse  formula  frequency   •  Frequency  searches   –   score(q, f ) = SF (e, f ) IFF (e) 2 /( | f | ×   IFF (e) 2 ) ∑e∈q ∑ e∈q –  where  |f|  is  the  total  frequency  of  elements   •  Substructure  search       –  score(q, f ) = W SF (q, f ) IFF (q) / | f | match ( q , f ) –   where  Wmatch(q,f)    is  the  weight  for  exact  match,  reverse  match,  and   parsed  match   •  Similarity  search   –       score(q, f ) = ∑W s pq W ( s ) SF ( s, q ) SF ( s, f ) IFF ( s ) / | f | match ( q , f )
  • 52. Chemical  compounds  as  graphs   •  Chemical  compound  modeled  as  a  semanEc   graph  with  properEes   Atom: vertex/node in the graph Bond: edge in the graph Dimensions: 3 or 4 Above figures are copied from eMolecules.com
  • 53. What’s  Chemical  Structure  Search   •  Substructure  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  chemical  compounds  containing  the  input  as  a   substructure.     •  Super  structure  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  important  descriptors  (substructures/   funcEonal  group)  contained  in  the  input.     •  Similarity  Search   –  Given  an  input  chemical  structure  sketch,  find  all   the  chemical  compounds  “similar”  to  the  input.    
  • 54. Table Search Tables are widely used to present experimental results or statistical data in scientific documents; some data only exists in these tables. Current search engines treat tabular data as regular text •  Structural information and semantics not preserved. Goal: automatically identify tables, extract table metadata from pdf documents into xml and rank data Table Metadata Representation: •  Environment metadata: (document specifics: type, title,…) •  Frame metadata: (border left, right, top, bottom, …) •  Affiliated metadata: (Caption, footnote, …) •  Layout metadata: (number of rows, columns, headers,…) •  Cell content metadata: (values in cells) •  Type metadata: (numeric, symbolic, hybrid, …) Y. Liu AAA’07, JCDL’07.
  • 55. Tables   •  A history that pre-dates that of sentential text –  Cuneiform clay tablets •  Not received the same level of formal characterization enjoyed by sentential text •  Varying and irregular formats •  Different intuitive understanding of what a “table” is. –  Is the Periodic Table of the Elements a table? –  Tables vs. Lists? –  Tables vs. Forms? –  Tables vs. Figures? –  Genuine table vs. non-genuine table? [12] •  Our definition: scientific genuine table –  Caption + tabular structure –  Ruling lines are not required
  • 56. TableSeer   Beta design of a table search engine
  • 57. TableSeer   System     Architecture  
  • 58. Page  Box-­‐Cu‡ng  Algorithm   •  Improves  the  table  detecEon  performance  by   excluding  more  than  93.6%  document  content   in  the  beginning  
  • 59. Sample  Table  Metadata  Extracted  File   •  <Table>   •  <DocumentOrigin>Analyst</DocumentOrigin>   •  <DocumentName>b006011i.pdf</DocumentName>   •  <Year>2001</Year>   •  <DocumentTitle>Detec3on  of  chlorinated  methanes  by  3n  oxide  gas  sensors  </DocumentTitle>   •  <Author>Sang  Hyun  Park,  a  ?  Young-­‐Chan  Son,  a  Brenda  R  .  Shaw,  a  Kenneth  E.  Creasy,*  b  and  Steven  L.  Suib*  acd  a  Department  of  Chemistry,  U-­‐60,  University  of  Connec3cut,   Storrs,  C  T  06269-­‐3060</Author>   •  <TheNumOfCiters></TheNumOfCiters>   •  <Citers></Citers>   •  <TableCap3on>Table  1  Temperature  effect  o  n  r  esistance  change  (  D  R  )  and  response  3meof  3n  oxide  thin  film  with  1  %  C  Cl  4</TableCap3on>   •  <TableColumnHeading>D  R  Temperature/  ¡ã  C  D  R  a  /  W  (  R  ,O  2  )  (%)  R  esponse  3me  Reproducibiliy  </TableColumnHeading>   •  <TableContent>100  223  5  ~  22  min  Yes  200  270  9  ~  7-­‐8  min  Yes  300  1027  21  <  2  0  s  Yes  400  993  31  ~  1  0  s  No  </TableContent>   •  <TableFootnote>  a  D  R  =(  R  ,  CCl  4  )  -­‐  (  R  ,O  2  ).  </TableFootnote>   •  <ColumnNum>5</ColumnNum>   •  <TableReferenceText>In  page  3,  line  11,  …  Film  responses  to  1%  CCl4  at  different  temperatures  are  summarized  in  Table  1……</TableReferenceText>   •  <PageNumOfTable>3</PageNumOfTable>   •  <Snapshot>b006011i/b006011i_t1.jpg</Snapshot>   •  </Table>  
  • 60. TableRank   • Rank tables by rating the <query, table> pairs, instead of the <query, document> pairs: preventing a lot of false positive hits for table search, which frequently occur in current web search engines • The similarity between a <table, query> pair: the cosine of the angle between vectors • Tailored term vector space => table vectors: • Query vectors and table vectors, instead of document vectors
  • 61. Table  Index     Index     CapEons     Footnotes     Reference  Text     BoosEng     CapEons  (2)     FuncEon:     -  Inversely  (recip)  proporEonal  to  #cites.  
  • 62. Term  WeighEng  for  Tables   –  TTF  –  ITTF:  (Table  Term  Frequency-­‐Inverse  Table  Term  Frequency)   –  TLB:  Table  Level  Boost  Factors  (e.g.,  table  frequency)   –  DLB:  Document  Level  Boost  factors  (e.g.,  journal/proceeding  order,  document   citaEon)    
  • 63. Table  term  ranking   • A term occurring in a few tables is likely to be a better discriminator than a term appearing in most or all tables • Similar to document abstract, table metadata and table query should be treated as semi-structured text • Not complete sentences and express a summary • P = 0.5 (G. Salton 1988) •  b is the total number of tables • IDF(ijk): the number of tables that term t(i) occurs in the matadata m(k)
  • 64. Table  Level  Boost  and  Document  Level   Boost   Btbf is the boost value of the table frequency Btrt is the boost value of the table reference text (e.g., the normalized length), and Btp is the boost value of the table position. r is a parameter, which is 1 if users specify the table position in the query. Otherwise, r = 0. IVj: document Importance Value (IV). If a table comes from a document with a high IV , all the table terms of this document should get a high document level boost ICj: the inherited citation value (ICj) DOj: source value (the rank of the journal/conference proceeding) DFj: document freshness
  • 65. Table  citaEon  network   •  Similar  to  the  PageRank  network   –  Documents  construct  a  network  from  the  citaEons   –  The  “incoming  links”  –  the  documents  that  cite  the  document  in  which   the  table  is  located   –  ExponenEal  decay  used  to  deal  with  the  impact  of  the  propagated   importance   •  Unlike  the  PageRank  network   –  Directed  Acyclic  Graph   –  Importance  Value  (IV)  of  a  document  not  decreased  as  the  number  of   citaEons  increases   –  IV  not  divided  by  the  number  of  outbound  links   •  A  document  may  have  mulEple,  one,  or  no  tables       •  Each  table  is  consisted  as  a  set  of  metadata     •  Same  keywords  may  appear  in  different  metadata  in  different   tables    
  • 66. Table  Search  Summary   •  An  novel  first  table  ranking  algorithm  -­‐-­‐  TableRank   •  A  tailored  table  term  vector  space   •  A  table  term  weighEng  scheme  –  TTF-­‐ITTF   –  AggregaEng  impact  factors  from  three  levels:  the   term,  the  table,  and  the  document   •  Index  table  referenced  texts,  term  locaEons,  and   document  backgrounds   •  Design  and  implement  first  table  search  engine,   TableSeer,  to  evaluate  the  TableRank  and  compare  with   popular  web  search  engines   •  Code  released   •  Currently  implement  in  CiteSeerX  -­‐  millions  of  tables   •  Improving  extracEon  –  Dow  Chemical  support  
  • 67. Automated Figure Data Extraction and Search" •  Large amount of results in digital documents are recorded in figures, time series, experimental results (eg., NMR spectra, income growth) and this is the only record of the data" •  Extraction for purposes of:" –  Further modeling using presented data" –  Indexing, meta-data creation for storage & search on figures for data reuse" •  Current extraction done manually!! Documents   Extracted  Plot   Extracted  Info.   Document   Merged   Index   Plot  Index   Index   Digital  Library   User  
  • 68. Seer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. Tools that automate the data extraction from figures provide the following: •  Increases our understanding of key concepts of papers •  Provides data for automatic comparative analyses. •  Enables regeneration of figures in different contexts. •  Enables search for documents with figures containing specific experiment results. X. Lu JCDL’06 & IJDAR’09, Brouwer JCDL’08, Kataria AAAI’08
  • 69. Metadata & data to extract: 
 2 Dimensional Plot" Y-Axis Labels Legend Data Points Ticks Axis Units X-Axis Label Snapshot of a document Extracted 2D plot
  • 70. Our  Approach  to  Plot  Data  ExtracEon   • Identify and extract figures from digital documents • Ascii and image extraction (xpdf) • OCR - bit map, raster pdfs • Identify figures as images of 2D plots using SVM (Only for Bit map images) • Hough transform • Wavelets coefficients of image • Surrounding text features • Binarization of the 2D plots identified for preprocessing (No need for Vectorized Images) • Adaptive Thresholding •  Image segmentation to identify regions • Profiling or Image Signature •  Text block detection • Nearest Neighbor •  Data point detection • K-means Filtering •  Data point disambiguation for overlapping points • Simulated Annealing
  • 71. Future Directions •  System integration within ChemXSeer or CiteSeerX" –  XML data generation" –  Open source tool in Lucene/SOLR " •  Extension to other figures (3D, …)   " 1.2e+08 1e+08" 8e+07" 6e+07" 4e+07" 2e+07" " 0 30 " 25 " " 20 " " 60 " 70 15 " " 50 10 " " 30 " 40 5 " 10 " 20
  • 72. ChemXSeer Highlights •  Portal for academic researchers in environmental chemistry which integrates the scientific literature with experimental, analytical and simulation results and tools •  Provides unique metadata extraction, indexing and searching pertinent to the chemical literature by using heuristics combined with machine learning •  Chemical formulae and names •  Tables •  Figures •  Publication functions as in CiteSeerX •  Interoperability ORE-Chem development •  Novel ranking required •  After extraction, data stored API accessible xml for users •  Hybrid repository (Not fully open): Serves as a federated information interoperational system •  Scientific papers crawled and indexed from the web •  User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) •  Scientific documents and metadata from publishers (e.g. Royal Society of Chemistry) •  Access control for publisher-provided content and user-submitted experiment data •  Takes advantage of developments in other funded cyberinfrastructure and open source projects •  CiteSeerX, PlanetLab, Lucene/Solr, ORE, others •  Some released open source
  • 73. Experimental Collaborator recommendation system •  CollabSeer  currently  supports  400k  authors   •  h|p://collabseer.ist.psu.edu  
  • 74. CollaboraEon  recommendaEon   •  Metadata  of  authors  and  coauthors  and  topics  of  interest   (similar  to  expert  recommendaEon)   •  Use  social  network  and  topics  to  recommend   collaborators  of  collaborators  (FOF)   •  Devise  SN  index  and  ranking  scheme   •  Explore  models  of  vertex  similarity   •  Built  on  SeerSuite   Gou JCDL’10, •  Other  recommendaEons?   Gou MIR’10 –  Experimental  methods   Chen JCDL’11, SAC’12 –  Chemicals?  
  • 75. RecommendaEon  list  and  user’s  topic  of  interest  
  • 76. •  Users  refine  the  recommend  list  by  clicking  on  their  topic  of  interest.  (lek:  refined  by  “query   processing”,  right:  default  recommendaEon  list)  
  • 77. •  How  two  potenEal  collaborators  are  linked  by  common  collaborators  
  • 79. IntegraEon  of  Vertex  Similarity  and   Textual  Similarity   •      –  S:  vertex  similarity   –  SC.O.T.:  collaborator’s  contribuEon  to  a  specified  topic   –  Use  the  product  of  exponenEal  funcEons  to  avoid  zero   vertex  similarity  score  or  zero  contribuEon  (textual   similarity)  score  to  turn  the  whole  measure  into  zero   •  Other  measures?  
  • 80. •  RefSeerX:  recommend  citaEons  for  papers   Use these paper   citaEons   The authors are unaware of related work  they do not know they are looking for  recommends related citations •  Based   –  ExisEng  citaEons   –  CitaEon  context   –  Venue  and  importance   –  Contemporary  vs  seminal  
  • 81. He, WWW ‘10, WSDM ’11; Kataria, CIKM ’10, IJCAI’11,
  • 82.
  • 83.   Expert  Search • Expert search for authors, currently in alpha
  • 84.   Expert  Search • Expert search for authors, currently in alpha
  • 85. Keyphrase  ExtracEon  for  experts   Text  Document   Parse document into sections with SecEon  Parser   regular expression Candidate   Use DBLP statistic to extract DBLP  data   keyphrase candidates Extractor   Train random forest to classify & Training  Data   Random  Forest   rank whether a phrase is a keyphrase Top  Keyphrases   Treeratpituk, P., Teregowda, P., Huang, J. and Giles, CL. SEERLAB: A System for Extracting Keyphrases from Scholarly Documents, Semeval-2010 task 5: Automatic keyphrase extraction from scientific article. ACL workshop on Semantic Evaluations (SemEval 2010), Sweden, July 2010.
  • 86. GrantSeer   •  Prototype  search  engine  for  PI  profiles  and  their  grant   informaEon  to  assist  funding  agencies,  deans  of  research,   foundaEons   •  Link  PIs  with  their     –  Grants     –  PublicaEons   –  CitaEons   –  OrganizaEon   –  ExperEse   –  Others?   •  Data  that  can  be  shared   –  CiteSeerX  or  Google  Scholar  data   –  Database  of  funded  research   Funded by NSF – Julia Lane
  • 87. Cover  page  NSF  XML  extracEon  
  • 88. GrantSeer:  PI  profile   grants awarded PI’s expertise publications + citations
  • 89. Algorithm  Search   • Homepage search for authors, currently in alpha
  • 90. AlgorithmSeer   Algorithm  Search   -­‐  ExtracEon   -­‐  Indexing   -­‐  Ranking   Suite Workshop ICSE ‘11
  • 92. Metadata extraction • Extract • Pseudo-codes and their metadata • Captions • Reference sentences • Synopsys • Etc. • Index metadata using Solr to make the pseudo- codes searchable • Each search result has a pointer to the page in the document where the pseudo-code appears
  • 93. Index Fields id <string> caption <text> reftext <text> (Reference Sentences) synopsis <text> (Summarizing Text) page <sint> (Page Number) paperid <string> (Document ID) year <sint> (Year of Publication) ncites <sint> (Number of Citations)
  • 96. Number of Total C/A Name Acknowledge-ments Citations Metric Name Educational Funding Agencies Institutions National Science Carnegie Mello 12287 144643 11.77 Foundation University Defense Advanced Massachusetts 4712 80659 17.12 Research Projects Agency of Technology California Inst Office of Naval Research 3080 48873 15.87 Technology Funding Agency Impact Deutsche 2780 9782 3.52 Santa Fe Institu Forschungsgemeinschaft French Nationa National Aeronautics and 2408 21242 8.82 Institute for Re Space Administration Funding agency impact Engineering and Physical Computer Scie 2007 16582 8.26 Stanford Unive •  based on Science Research Council Air Force Office of University of C acknowledgement indexing Scientific Research 1657 16850 10.17 at Berkeley National Sciences and National Cente •  # of acknowledgements Engineering Research 1422 12050 8.47 Supercomputin •  total citations Council of Canada Applications International C •  #Citation / #ack metric Department of Energy 1054 5562 5.28 Science Institu Australian Research 1010 5464 5.41 Cornell Univer Council Based on acknowledgment European Union University of I Information Technologies 825 9594 11.63 entities extracted from 150K Program Urbana-Champ acknowledgements in CiteSeer National Institutes of 709 7279 10.27 USC Informati Health Sciences Instit University of C New system available this spring Army Research Office 666 7709 11.58 Los Angeles Netherlands Organization AckSeer for Scientific Research 646 2843 4.4 McGill Univer Science and Engineering Australian Nat 489 6976 14.27 Research Council University Companies Individuals International Business Giles, PNAS, 2004 1380 23948 17.35 Olivier Danvy Machines Intel Corporation 962 14441 15.01 Oded Goldreic
  • 97. Most Acknowledged Authors and Impact Factor C/A Author Citations Acknowledge-ments Metric Olivier Interviewed by Danvy 847 268 29.85 Nature as to why Oded 3277 259 17.82 Goldreich he was the most Luca 3847 247 43.91 acknowledged Cardelli Tom computer scientist Mitchell 3336 226 24.31 Martin 3507 222 43.46 Abadi Phil 3780 181 40.07 Wadler Moshe 3786 180 33.86 Vardi Who is most acknowledged? 1790 Peter Lee 167 53.54 Avi 2566 160 18.13 Mom or dad Wigderson Matthias Theorists or experimentalists Felleisen 1622 154 30.55 Benjamin 1484 152 30.53 Who has a better metric? Pierce Noga Alon 2640 152 15.71 John 3693 152 41.9 Ousterhout Frank 1639 148 13.84 Pfenning Andrew 2064 144 52.99 Appel
  • 98. Clouding CiteSeerX •  Hosting cloud CiteSeerX instances •  Economic issues •  Cost of hosting •  Cost of refactoring the source to be hosted in the cloud. •  Computational/technical issues •  What workflow to cloudize •  Component modification for efficient operation •  VM size: storage, memory and CPU sizing as a function of needs •  Establishing computational needs and availability clusters •  Appropriate load balancing across multiple sites. •  Security of data stored including metadata and user data. •  Policy issues •  Privacy of user data •  Copyright issues. Teregowda Cloud’10 USENIX’10
  • 99. SeerSuite  Research/Development  Opportuni3es   •  Old  Seers   –  Improve  or  revive  old  systems  and  port  them  into  compeEEve  SeerX  space   •  eBizSeer  to  eBizSeerX;  BotSeer  to  BotSeerX;  ArchSeer  to  ArchSeerX   •  New  Seers   –  New  domains  such  as  physics,  neuroscience,  biology,  algorithms,  TBD  (build  new  indexes)   –  MyCiteSeerX   •  Be|er  features   –  Parsing   –  EnEty  disambiguaEon   –  CitaEon  analysis   –  Ranking;  ranking,  ranking   •  New  features   –  New  parsing,  indexing,  ranking   •  Tables,  figures,  equaEons,  algorithms,  maps,  carbon  daEng,  chemical  formulae,  etc   –  Homepage  linking   –  ORE  search  and  data  integraEon   –  CollaboraEve  spaces   –  API/web  services   –  IntegraEon  with  DL  such  as  Fedora   –  New  clusters   •  Topics,  venues,  affiliaEons   –  Recommender  systems   –  SNA  analysis   –  Others   Collabora>ons  welcomed!     Data  and  sohware  available  
  • 100. Research  SeerSuite  supports   •  Many  uses  as  a  research  testbed  and  support  structure   –  Scaling  of  algorithms  for  IR,  IE,  data  mining,  social  networks,  ...   –  NLP  methods  on  large  text  collecEons   –  ML  methods  to  automaEcally  extract  data   –  Novel  indexing  and  ranking   –  Federated  search   –  CollaboraEve  and  social  networks   –  Focused  crawling  –  new  data  resources   –  Interface  design  and  integraEon   –  Systems  analysis   •  Many  development    applied  research  issues   –  IntegraEon  with  other  DLs   –  Automated  feature  development   –  Transfer  to  nontechnical  use   –  Cloud  based  delivery  
  • 101. Summary   •  Propose  an  infrastructure  for  academic  and  scienEfic  search  engine/digital  library   creaEon  -­‐  SeerSuite   –  Modular,  scalable,  extensible,  robust   –  Based  on  commercial  grade  open  source  (Solr/Lucene);  easy  to  use   –  Easy  to  apply  to  other  domains  (separable  indexes  and  projects  -­‐  integraEon)   •  Allows  scalable  data  mining  and  informaEon  extracEon  for  actual  systems   –  Unique  informa4on  extrac4on  plugins   –  Focus  on  unique  scalable  extracEon/data  mining  methods   •  Most  methods  less  than  N2  complexity   –  AutomaEcally  populates  databases  or  data  structures   •  Demonstrate  with  beta  systems  in   –  Computer  science,  Archaeology,  Chemistry,  Robots.txt,  PubMed,  YouSeer,  Tables,   Figures,  Maps,  References,  CollaboraEons,  DisambiguaEon   –  Personal  features   •  Systems  are  reasonably  easy  to  build;  issues  are   –  Data  collecEon  or  data  access   –  InformaEon  extracEon,  indexing,  ranking   •  Many  uses  as  a  research  testbed   –  Data  sharing  models   •  Want  to  find  a  Seer,  search  Google  or  use  my  homepage.  
  • 102. Opportun3es   •  Science  is  being  flooded  with  data   –  SimulaEons,  sensors,  web   •  Digital  humaniEes  is  right  behind   •  Needs  in   –  Large  scale  data  management  (tera  to  peta)   •  NoSQL  databases:  graphs,  documents,  floaEng  point,     –  Large  scale     •  data  mining   •  informaEon  extracEon   •  search   •  Domain  experEse  crucial   •  Reuse  not  reinvent  (much  is  out  there)   •  Solr/Lucene  is  great  for  both  demos,  producEon  and   research.  
  • 103. “Human attention is the scarce resource, not information.” Herbert A. Simon, Nobel Laureate, 1997. For  more  informaEon   •  clgiles.ist.psu.edu     •  giles@ist.psu.edu   •  SourceForge.com