SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
How we built the largest
                          open database of
                        companies in the world




Thursday, 7 June 2012
A simple (huge) goal: an entry (and URI) for
       every corporate legal entity in the world
                                            URI is based on the company register
                                              ID, meaning it’s open and IP-free




        Also i
    trade       mpor
          marks       ting p
   officia         , gove ublic data
          l regis        rnme
                 ters &       nt spe –
                         gazet         nding
                               te not         ,
                                      ices..
                                             .




Thursday, 7 June 2012
All Op
                                 enly L
                        free re         icens
                                use, e        ed, al
                                       ven c         lowin
                                             omm           g
                                                   ercial
                                                          ly

Thursday, 7 June 2012
5 core uses




Thursday, 7 June 2012
1. An open identifying system
               URIs can be used as common identifiers among a
               variety of organisations
               Can be used without reference to OpenCorporates
               Because they map to the id issued by the company
               register the corresponding entry in the registry (and
               associated info) can be found, and vice versa
               Fits the new EU Business Vocabulary
               Can even by used for companies in jurisdiction we
               haven’t yet imported

Thursday, 7 June 2012
2. The simple search

               Not to be underestimated
               Massively reduces friction
               (how long will it take you
               to find and search
               multiple jurisdictions)
               Allows what if questions
               Potentially generates
               stories in its own right

Thursday, 7 June 2012
3. Source for additional info
               Addresses, filings,
               status, websites...
               Intl trademarks, UK
               govt spending, official
               notices, health & safety
               violations...
               Other IDs: SEC, CAGE,
               etc – allows reverse
               mapping queries, e.g.
               show me legal entitity
               mapped to a CIK code
Thursday, 7 June 2012
4. Reconciliation
         (matching names to legal entities)

         Clean up messy
         company names
         (& prev names)
         to legal entity,
         and from there
         to other data
         Google Refine
         reconciliation
         service (specific
         to jurisdiction)

Thursday, 7 June 2012
5. The platform

               API: allows all
               information to be
               retrieved as data,
               even searches
               Users can now
               add data too
               Coming soon: the
               option to match
               data to
               companies
Thursday, 7 June 2012
New feature: directors/officers

         We’ve just
         started
         importing &
         indexing
         company
         directors &
         officers,
         allowing search
         by name, &
                                other resources
         finding links
         between them
         and other         similarly named
         companies

Thursday, 7 June 2012
How have we done it?
         1. Started small,
         with just three
         countries and
         3 million
         companies
         2. Increasingly
         using official
         sources, where
         this is possible (i.e.
         the company
         registers are open
         and make data
         available)

Thursday, 7 June 2012
How have we done it?
          3. Leveraged the
          open data
          community and
          ScraperWiki to
          scrape company
          registers around
          the world
          4. Worked with
          governments to
          help understand
          the problems – EU,
          World Bank, G20
          Financial Stability
          Board, etc

Thursday, 7 June 2012
The technology
         Vanilla, commodity open-source software, hosted on our
         own UK-based servers

         Database                        MySQL
                              (but considering PostgreSQL)
         Search                           Solr
                             (but considering ElasticSearch)
         Code                              Ruby
                          (RubyOnRails main app, Sinatra API,
                         vanilla Ruby for various internal libraries)
         Webserver          Nginx (webserver) + Memcached
                         (caching) + Redis (queue + persistence)

Thursday, 7 June 2012
How do we pay for all this?


               Unlike many open data projects, we’re a for-profit
               company – the open data movement needs successful
               companies if it’s going to have a diverse ecosystem
               But we’re a company whose business model is
               dependent on making more data open, and an
               advisory board to make sure we do the right thing
               Not yet looking for customers, but...


Thursday, 7 June 2012
How do we pay for all this?
         Two projected sources of income

               Services model, especially around cleansing data/
               reconciliation. Of course, you can use our API,
               reconciliation service without asking us, but it may be
               cheaper to pay us to do it. Ditto custom extracts, and
               verticals
               Dual-licence model – contribute back to the community
               either with data, or financial support, e.g. if you have a
               proprietary database you may not want to be bound by
               the share-alike attribution restrictions
               And we already have some (small) customers

Thursday, 7 June 2012
The problems




         Getting the data Company registers have forgotten their
         main role is as public record, and actively work to prohibit
         free and open access to the data
Thursday, 7 June 2012
The problems




         Understanding the data Language, legal and cultural
         issues, not to mention the complexity of the subject
Thursday, 7 June 2012
The problems




         Normalising the data How do we abstract company
         types, status, industry codes, addresses, etc
Thursday, 7 June 2012
W3C Business Vocabulary

               What are
               we doing?
               Why are we
               doing it?
               What does
               it mean?
               Where is it
               going?

Thursday, 7 June 2012
The problems




         Handling the data Over 150 million rows in some tables
         (slow schema changes), heavy reading and writing,
         evolving understanding of the problems and solutions
Thursday, 7 June 2012
tions
                                                          isdic tes
                                                     0 jur
                                              nies in 5 23 US sta
                                        compa     clud ing
                               3million         In
     wo                 v er 4
  No




Thursday, 7 June 2012

Mais conteúdo relacionado

Semelhante a EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 StartupsCSI Piemonte
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torinomzaglio
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneElise Hu-Stiles
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODMateja Verlic
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoMER Conference
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEvalAdam Rae
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingAnand Deshpande
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMEsmespire
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionIQPC Exchange
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAMS Imaging
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future VisionMicro Focus SRL
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation AgenciesNovavia Solutions
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?Chris Taggart
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1studiokandm
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Jari Koister
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Peter Wells
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesEDINA, University of Edinburgh
 

Semelhante a EDF2012 Chris Taggart - How the biggest Open Database of Companies was built (20)

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 Startups
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torino
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas Tribune
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LOD
 
Story spaces pitch
Story spaces pitchStory spaces pitch
Story spaces pitch
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEval
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SME
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data Collection
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland Onbase
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future Vision
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation Agencies
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple Repositories
 

Mais de European Data Forum

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEuropean Data Forum
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...European Data Forum
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...European Data Forum
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...European Data Forum
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...European Data Forum
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...European Data Forum
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...European Data Forum
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...European Data Forum
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...European Data Forum
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...European Data Forum
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...European Data Forum
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...European Data Forum
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...European Data Forum
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...European Data Forum
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...European Data Forum
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...European Data Forum
 

Mais de European Data Forum (20)

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
 
Barbato leit ict 15-16-17
Barbato leit ict 15-16-17Barbato leit ict 15-16-17
Barbato leit ict 15-16-17
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

  • 1. How we built the largest open database of companies in the world Thursday, 7 June 2012
  • 2. A simple (huge) goal: an entry (and URI) for every corporate legal entity in the world URI is based on the company register ID, meaning it’s open and IP-free Also i trade mpor marks ting p officia , gove ublic data l regis rnme ters & nt spe – gazet nding te not , ices.. . Thursday, 7 June 2012
  • 3. All Op enly L free re icens use, e ed, al ven c lowin omm g ercial ly Thursday, 7 June 2012
  • 4. 5 core uses Thursday, 7 June 2012
  • 5. 1. An open identifying system URIs can be used as common identifiers among a variety of organisations Can be used without reference to OpenCorporates Because they map to the id issued by the company register the corresponding entry in the registry (and associated info) can be found, and vice versa Fits the new EU Business Vocabulary Can even by used for companies in jurisdiction we haven’t yet imported Thursday, 7 June 2012
  • 6. 2. The simple search Not to be underestimated Massively reduces friction (how long will it take you to find and search multiple jurisdictions) Allows what if questions Potentially generates stories in its own right Thursday, 7 June 2012
  • 7. 3. Source for additional info Addresses, filings, status, websites... Intl trademarks, UK govt spending, official notices, health & safety violations... Other IDs: SEC, CAGE, etc – allows reverse mapping queries, e.g. show me legal entitity mapped to a CIK code Thursday, 7 June 2012
  • 8. 4. Reconciliation (matching names to legal entities) Clean up messy company names (& prev names) to legal entity, and from there to other data Google Refine reconciliation service (specific to jurisdiction) Thursday, 7 June 2012
  • 9. 5. The platform API: allows all information to be retrieved as data, even searches Users can now add data too Coming soon: the option to match data to companies Thursday, 7 June 2012
  • 10. New feature: directors/officers We’ve just started importing & indexing company directors & officers, allowing search by name, & other resources finding links between them and other similarly named companies Thursday, 7 June 2012
  • 11. How have we done it? 1. Started small, with just three countries and 3 million companies 2. Increasingly using official sources, where this is possible (i.e. the company registers are open and make data available) Thursday, 7 June 2012
  • 12. How have we done it? 3. Leveraged the open data community and ScraperWiki to scrape company registers around the world 4. Worked with governments to help understand the problems – EU, World Bank, G20 Financial Stability Board, etc Thursday, 7 June 2012
  • 13. The technology Vanilla, commodity open-source software, hosted on our own UK-based servers Database MySQL (but considering PostgreSQL) Search Solr (but considering ElasticSearch) Code Ruby (RubyOnRails main app, Sinatra API, vanilla Ruby for various internal libraries) Webserver Nginx (webserver) + Memcached (caching) + Redis (queue + persistence) Thursday, 7 June 2012
  • 14. How do we pay for all this? Unlike many open data projects, we’re a for-profit company – the open data movement needs successful companies if it’s going to have a diverse ecosystem But we’re a company whose business model is dependent on making more data open, and an advisory board to make sure we do the right thing Not yet looking for customers, but... Thursday, 7 June 2012
  • 15. How do we pay for all this? Two projected sources of income Services model, especially around cleansing data/ reconciliation. Of course, you can use our API, reconciliation service without asking us, but it may be cheaper to pay us to do it. Ditto custom extracts, and verticals Dual-licence model – contribute back to the community either with data, or financial support, e.g. if you have a proprietary database you may not want to be bound by the share-alike attribution restrictions And we already have some (small) customers Thursday, 7 June 2012
  • 16. The problems Getting the data Company registers have forgotten their main role is as public record, and actively work to prohibit free and open access to the data Thursday, 7 June 2012
  • 17. The problems Understanding the data Language, legal and cultural issues, not to mention the complexity of the subject Thursday, 7 June 2012
  • 18. The problems Normalising the data How do we abstract company types, status, industry codes, addresses, etc Thursday, 7 June 2012
  • 19. W3C Business Vocabulary What are we doing? Why are we doing it? What does it mean? Where is it going? Thursday, 7 June 2012
  • 20. The problems Handling the data Over 150 million rows in some tables (slow schema changes), heavy reading and writing, evolving understanding of the problems and solutions Thursday, 7 June 2012
  • 21. tions isdic tes 0 jur nies in 5 23 US sta compa clud ing 3million In wo v er 4 No Thursday, 7 June 2012