SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
BUSINESS INTELLIGENCE SOLUTION USING SEARCH ENGINE Minor Project Prepared  For the Partial fulfillment of the Bachelor Of Engineering Prepared by  Guided By Ankur Mukherjee  IshwarLal Deshmukh Sir  Prateek Barapatre Shetanshu Parihar
BUSINESS INTELLIGENCE SOLUTION Business Intelligence (BI) solutions are designed to allow companies to easily turn the volumes of data they collect and store into meaningful information – to best manage their operations. When key information is readily accessible, you can make better and timelier business decisions
“ Business Intelligence” (BI) is a business management term that refers to applications and  technologies used to gather, provide access to, and analyze data and information about their company operations. Business intelligence systems can help companies have a more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and they can help companies to make better business decisions.
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Business Intelligence (BI) solutions have for many years been a hot topic among companies due to their optimization and decision making capabilities in business processes
WEB SEARCH ENGINES Index-based : search the Web, index Web pages, and build and store huge keyword-based indices. Help locate sets of Web pages containing certain keywords. Deficiencies:  1. A topic of any breadth may easily contain hundreds of thousands of documents. 2. Many documents that are highly relevant to a topic may not contain keywords  defining them (polysemy).
WEB TEXT MINING The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges. Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes. For example:  10 random, average people on the streets say Mr. XYZ is a good dentist. 5 reputed doctors, including dentists, recommend Mr. ABC  as a better dentist. Whom to choose?
BI – TEXT MINING With the widespread inclusion of document, especially text, in the business systems, business executives can not get useful details from the large collection of unstructured and semi structured written materials based on natural languages within our traditional business intelligence systems. It is the right time to develop the powerful tool to expand the scope of business intelligence to gain more competitive advantages for the business.
Data mining has been touted to be the solution for the business intelligence. We can learn its good performance form the classical example that data mining can scan a large amount of retail sales to find the money-making purchasing patterns of the consumers to decide which products would be placed close together on shelves. Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition.
Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition. Commonly, text mining is the discovery by computer of previously unknown knowledge in text, by automatically extracting information from different written resources. Text mining can represent flexible approaches to information management, research and analysis. Thus text mining can expand the fists of data mining to the ability to deal with textual materials
MINING THE WORLD WIDE WEB The WWW is huge, widely distributed, global information service centre for: 1.  Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. 2.  Hyper-link information. 3. Access and usage information. WWW provides rich sources for data mining  . Challenges faced are: 1.  Too huge for effective data warehousing and data mining. 2. Too complex and heterogeneous: no standards and structure. The Data is growing and changing rapidly.
 
CRAWLERS The crawlers are implemented as multi-threaded objects in Java. Each crawler has many (possibly hundreds) threads of execution sharing a single synchronized  frontier   that lists the unvisited URLs. Each thread of the crawler follows a  crawling loop   that involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL.
Tag Tree representation of HTML snippet .
PROBLEM IDENTIFICATION Searching for URLs of related business entities is a type of business intelligence problem. The entities could be related through the area of competence, research thrust, comparable nature (like start-ups) or a combination of such features. We start by assuming that a short list of URLs of related business entities is already available. However, the list needs to be further expanded. The short list may have been generated manually with the help of search engines, business portals or Web directories. An analyst may face some hurdles in expanding the list of relevant URLs. Such hurdles could be due to lack of appropriate content in relevant pages, inadequate user queries, staleness of search engines' collections, or bias in search engines' ranking. Similar problems plague information discovery using Web directories or portals. The staleness of a search engine's collection is highlighted by the dynamic nature of the Web. Hence, it is reasonable to complement traditional techniques with topical crawlers to discover up-to-date business information.
METHODOLOGY With the ubiquity of the Internet and Web, search engines have been sprouting like mushrooms after a rainfall. However, innovative search engines and guided search capabilities have started appearing only in recent years. For instance, Google, which is one of the popular search engines, supports Web Services that allow external applications to issue Web search queries that are actually processed using a Google’s commodity cluster computer made up of 15,000 PC nodes. The goals of these applications are to help ease and guide the searching efforts of novice web users towards their desired objectives.
SYSTEM FEATURES   A search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called  Page Rank . Second, it utilizes link to improve search results.
PAGE RANK CALCULATION Counting citations or back links to a given page gives some approximation of a page's importance or quality. Page Rank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. Page Rank can be defined as follows: Let us assume page A has pages T1 to Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The Page Rank of a page A is given as follows:  PR(A) = (1-d) + d ( PR(T1) / C(T1) + ... + PR(Tn) / C(Tn) )
SYSTEM ARCHITECTURE
REPOSITORY The repository contains the full HTML of every web page. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors
INDEXER Parsing  --  Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work. Sorting  -- In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID.
SEARCHING The goal of searching is to provide quality search results efficiently. Steps involved are: 1.Parse the query.  2.Convert words into wordIDs.  3.Seek to the start of the doclist in the short barrel for every word.  4.Scan through the doclists until there is a document that matches all the search terms.
DATABASE STRUCTURE
CRAWLTABLE It has three fields Serial which is just a serial number, URLAddress which is crawled URLaddress which is available in server, and Iscrawled which is meant for weather URLaddress is crawled or not
INDEXTABLE It contains three fields:  Keyword which is the meta text, URL address which is crawled URL address which is available in server, and Frequency which shows the number of Hits to the particular URL.
RESULT AND DISCUSSION 1. Starting the Search Engine
LOGIN FORM
MAIN WINDOW
CRAWLER WINDOW
INDEXER
USER INTERFACE
OUTPUT WINDOW (SEARCH RESULT FOR STRING “A”)
CONCLUSION AND SCOPE OF FUTURE WORK We would like to implement Phrase search, example- suppose a string entered by the user “search broser” then the search engine will say : Did You Mean : “ Search Browser ”.  The future development can also be done by implementing Filters in the search engine just like Google search engine. The future work also includes graphical results of the searched string

Mais conteúdo relacionado

Mais procurados

Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics DemoOntotext
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Juan Sequeda
 
First Logistics Seo analysis logistics - why us page
First Logistics Seo analysis   logistics - why us pageFirst Logistics Seo analysis   logistics - why us page
First Logistics Seo analysis logistics - why us pageBrian Bateman
 
Web Database
Web DatabaseWeb Database
Web Databaseidroos7
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBCJuan Sequeda
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Virtualizing Relational Databases as Graphs: a multi-model approach
Virtualizing Relational Databases as Graphs: a multi-model approachVirtualizing Relational Databases as Graphs: a multi-model approach
Virtualizing Relational Databases as Graphs: a multi-model approachJuan Sequeda
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformOntotext
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...OpenSource Connections
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics
 
Integrating Semantic Web with the Real World - A Journey between Two Cities ...
Integrating Semantic Web with the Real World  - A Journey between Two Cities ...Integrating Semantic Web with the Real World  - A Journey between Two Cities ...
Integrating Semantic Web with the Real World - A Journey between Two Cities ...Juan Sequeda
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304James Kenney
 
Enabling Clinical Research in the Real World
Enabling Clinical Research in the Real WorldEnabling Clinical Research in the Real World
Enabling Clinical Research in the Real WorldMongoDB
 
Whitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsWhitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsHappiest Minds Technologies
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryOntotext
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
 

Mais procurados (20)

Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics Demo
 
Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities Integrating Semantic Web in the Real World: A Journey between Two Cities
Integrating Semantic Web in the Real World: A Journey between Two Cities
 
B1803040412
B1803040412B1803040412
B1803040412
 
First Logistics Seo analysis logistics - why us page
First Logistics Seo analysis   logistics - why us pageFirst Logistics Seo analysis   logistics - why us page
First Logistics Seo analysis logistics - why us page
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Web Database
Web DatabaseWeb Database
Web Database
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBC
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Virtualizing Relational Databases as Graphs: a multi-model approach
Virtualizing Relational Databases as Graphs: a multi-model approachVirtualizing Relational Databases as Graphs: a multi-model approach
Virtualizing Relational Databases as Graphs: a multi-model approach
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
Integrating Semantic Web with the Real World - A Journey between Two Cities ...
Integrating Semantic Web with the Real World  - A Journey between Two Cities ...Integrating Semantic Web with the Real World  - A Journey between Two Cities ...
Integrating Semantic Web with the Real World - A Journey between Two Cities ...
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304
 
Enabling Clinical Research in the Real World
Enabling Clinical Research in the Real WorldEnabling Clinical Research in the Real World
Enabling Clinical Research in the Real World
 
Whitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest MindsWhitepaper: Extract value from Facebook Data - Happiest Minds
Whitepaper: Extract value from Facebook Data - Happiest Minds
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 
Taxonomies put in the right place
Taxonomies put in the right placeTaxonomies put in the right place
Taxonomies put in the right place
 

Semelhante a Business Intelligence Solution Using Search Engine

My Internship At Clientserver Technology Solutions
My Internship At Clientserver Technology SolutionsMy Internship At Clientserver Technology Solutions
My Internship At Clientserver Technology SolutionsCrystal Harris
 
Essay NT2670 Lab2
Essay NT2670 Lab2Essay NT2670 Lab2
Essay NT2670 Lab2Sara Rouse
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunitiesBigdata Meetup Kochi
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easyRaghav Shaligram
 
Security Of Database Systems Has Become Very Important Now...
Security Of Database Systems Has Become Very Important Now...Security Of Database Systems Has Become Very Important Now...
Security Of Database Systems Has Become Very Important Now...Christi Miller
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
Kudler Fine Foods Database Analysis Essay
Kudler Fine Foods Database Analysis EssayKudler Fine Foods Database Analysis Essay
Kudler Fine Foods Database Analysis EssayChantel Marie
 
Cache Coherence Essay
Cache Coherence EssayCache Coherence Essay
Cache Coherence EssayKatie Harris
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
A Research Study On Relational Database Technology
A Research Study On Relational Database TechnologyA Research Study On Relational Database Technology
A Research Study On Relational Database TechnologyLucy Castillo
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET Journal
 
Three E-Business Marketing Strategies
Three E-Business Marketing StrategiesThree E-Business Marketing Strategies
Three E-Business Marketing StrategiesSummer Young
 
Data Engineering
Data EngineeringData Engineering
Data Engineeringkiansahafi
 
A Note On Detection Algorithm
A Note On Detection AlgorithmA Note On Detection Algorithm
A Note On Detection AlgorithmMisty Harris
 
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media miningRoxana Tadayon
 
ER Diagrams And Databases
ER Diagrams And DatabasesER Diagrams And Databases
ER Diagrams And DatabasesGwen Hoffman
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Findwise
 
How Structured data benefits search engines and user experience
How Structured data benefits search engines and user experienceHow Structured data benefits search engines and user experience
How Structured data benefits search engines and user experiencetechcraftpranto
 

Semelhante a Business Intelligence Solution Using Search Engine (20)

My Internship At Clientserver Technology Solutions
My Internship At Clientserver Technology SolutionsMy Internship At Clientserver Technology Solutions
My Internship At Clientserver Technology Solutions
 
Essay NT2670 Lab2
Essay NT2670 Lab2Essay NT2670 Lab2
Essay NT2670 Lab2
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Competitive Intelligence Made easy
Competitive Intelligence Made easyCompetitive Intelligence Made easy
Competitive Intelligence Made easy
 
Security Of Database Systems Has Become Very Important Now...
Security Of Database Systems Has Become Very Important Now...Security Of Database Systems Has Become Very Important Now...
Security Of Database Systems Has Become Very Important Now...
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
Kudler Fine Foods Database Analysis Essay
Kudler Fine Foods Database Analysis EssayKudler Fine Foods Database Analysis Essay
Kudler Fine Foods Database Analysis Essay
 
Cache Coherence Essay
Cache Coherence EssayCache Coherence Essay
Cache Coherence Essay
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
A Research Study On Relational Database Technology
A Research Study On Relational Database TechnologyA Research Study On Relational Database Technology
A Research Study On Relational Database Technology
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Three E-Business Marketing Strategies
Three E-Business Marketing StrategiesThree E-Business Marketing Strategies
Three E-Business Marketing Strategies
 
Data Engineering
Data EngineeringData Engineering
Data Engineering
 
A Note On Detection Algorithm
A Note On Detection AlgorithmA Note On Detection Algorithm
A Note On Detection Algorithm
 
Semantic Web, e-commerce
Semantic Web, e-commerceSemantic Web, e-commerce
Semantic Web, e-commerce
 
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media mining
 
Web
WebWeb
Web
 
ER Diagrams And Databases
ER Diagrams And DatabasesER Diagrams And Databases
ER Diagrams And Databases
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
 
How Structured data benefits search engines and user experience
How Structured data benefits search engines and user experienceHow Structured data benefits search engines and user experience
How Structured data benefits search engines and user experience
 

Business Intelligence Solution Using Search Engine

  • 1. BUSINESS INTELLIGENCE SOLUTION USING SEARCH ENGINE Minor Project Prepared For the Partial fulfillment of the Bachelor Of Engineering Prepared by Guided By Ankur Mukherjee IshwarLal Deshmukh Sir Prateek Barapatre Shetanshu Parihar
  • 2. BUSINESS INTELLIGENCE SOLUTION Business Intelligence (BI) solutions are designed to allow companies to easily turn the volumes of data they collect and store into meaningful information – to best manage their operations. When key information is readily accessible, you can make better and timelier business decisions
  • 3. “ Business Intelligence” (BI) is a business management term that refers to applications and technologies used to gather, provide access to, and analyze data and information about their company operations. Business intelligence systems can help companies have a more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and they can help companies to make better business decisions.
  • 4. The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Business Intelligence (BI) solutions have for many years been a hot topic among companies due to their optimization and decision making capabilities in business processes
  • 5. WEB SEARCH ENGINES Index-based : search the Web, index Web pages, and build and store huge keyword-based indices. Help locate sets of Web pages containing certain keywords. Deficiencies: 1. A topic of any breadth may easily contain hundreds of thousands of documents. 2. Many documents that are highly relevant to a topic may not contain keywords defining them (polysemy).
  • 6. WEB TEXT MINING The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges. Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes. For example: 10 random, average people on the streets say Mr. XYZ is a good dentist. 5 reputed doctors, including dentists, recommend Mr. ABC as a better dentist. Whom to choose?
  • 7. BI – TEXT MINING With the widespread inclusion of document, especially text, in the business systems, business executives can not get useful details from the large collection of unstructured and semi structured written materials based on natural languages within our traditional business intelligence systems. It is the right time to develop the powerful tool to expand the scope of business intelligence to gain more competitive advantages for the business.
  • 8. Data mining has been touted to be the solution for the business intelligence. We can learn its good performance form the classical example that data mining can scan a large amount of retail sales to find the money-making purchasing patterns of the consumers to decide which products would be placed close together on shelves. Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition.
  • 9. Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition. Commonly, text mining is the discovery by computer of previously unknown knowledge in text, by automatically extracting information from different written resources. Text mining can represent flexible approaches to information management, research and analysis. Thus text mining can expand the fists of data mining to the ability to deal with textual materials
  • 10. MINING THE WORLD WIDE WEB The WWW is huge, widely distributed, global information service centre for: 1. Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. 2. Hyper-link information. 3. Access and usage information. WWW provides rich sources for data mining . Challenges faced are: 1. Too huge for effective data warehousing and data mining. 2. Too complex and heterogeneous: no standards and structure. The Data is growing and changing rapidly.
  • 11.  
  • 12. CRAWLERS The crawlers are implemented as multi-threaded objects in Java. Each crawler has many (possibly hundreds) threads of execution sharing a single synchronized frontier that lists the unvisited URLs. Each thread of the crawler follows a crawling loop that involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL.
  • 13. Tag Tree representation of HTML snippet .
  • 14. PROBLEM IDENTIFICATION Searching for URLs of related business entities is a type of business intelligence problem. The entities could be related through the area of competence, research thrust, comparable nature (like start-ups) or a combination of such features. We start by assuming that a short list of URLs of related business entities is already available. However, the list needs to be further expanded. The short list may have been generated manually with the help of search engines, business portals or Web directories. An analyst may face some hurdles in expanding the list of relevant URLs. Such hurdles could be due to lack of appropriate content in relevant pages, inadequate user queries, staleness of search engines' collections, or bias in search engines' ranking. Similar problems plague information discovery using Web directories or portals. The staleness of a search engine's collection is highlighted by the dynamic nature of the Web. Hence, it is reasonable to complement traditional techniques with topical crawlers to discover up-to-date business information.
  • 15. METHODOLOGY With the ubiquity of the Internet and Web, search engines have been sprouting like mushrooms after a rainfall. However, innovative search engines and guided search capabilities have started appearing only in recent years. For instance, Google, which is one of the popular search engines, supports Web Services that allow external applications to issue Web search queries that are actually processed using a Google’s commodity cluster computer made up of 15,000 PC nodes. The goals of these applications are to help ease and guide the searching efforts of novice web users towards their desired objectives.
  • 16. SYSTEM FEATURES A search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called Page Rank . Second, it utilizes link to improve search results.
  • 17. PAGE RANK CALCULATION Counting citations or back links to a given page gives some approximation of a page's importance or quality. Page Rank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. Page Rank can be defined as follows: Let us assume page A has pages T1 to Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The Page Rank of a page A is given as follows: PR(A) = (1-d) + d ( PR(T1) / C(T1) + ... + PR(Tn) / C(Tn) )
  • 19. REPOSITORY The repository contains the full HTML of every web page. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors
  • 20. INDEXER Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work. Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID.
  • 21. SEARCHING The goal of searching is to provide quality search results efficiently. Steps involved are: 1.Parse the query. 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms.
  • 23. CRAWLTABLE It has three fields Serial which is just a serial number, URLAddress which is crawled URLaddress which is available in server, and Iscrawled which is meant for weather URLaddress is crawled or not
  • 24. INDEXTABLE It contains three fields: Keyword which is the meta text, URL address which is crawled URL address which is available in server, and Frequency which shows the number of Hits to the particular URL.
  • 25. RESULT AND DISCUSSION 1. Starting the Search Engine
  • 31. OUTPUT WINDOW (SEARCH RESULT FOR STRING “A”)
  • 32. CONCLUSION AND SCOPE OF FUTURE WORK We would like to implement Phrase search, example- suppose a string entered by the user “search broser” then the search engine will say : Did You Mean : “ Search Browser ”. The future development can also be done by implementing Filters in the search engine just like Google search engine. The future work also includes graphical results of the searched string