DOM Tree based Approach for Extraction of Content from Web News Page

Back to Accomplishments

Accomplishments

Details
Share

Category

Articles

Authors

Snehal Khade & Jyothi Rao

Publisher

Aes Journals In Engineering, Technology

Publishing Date

01-Oct-2013

volume

Issue

Pages

292-297

Abstract
PDF Full Text

With the huge amount of data on the World Wide Web, information extraction has become an increasingly important technology to help users locate desired Web information. Due to the heterogeneous nature of Web content, it is difficult to design a general Web information extracting approach that fits all application domains. When building a system for searching or mining Web content, a first task is extracting the main content and removing extraneous data such as navigation menus, functional and design elements, and commercial advertisements. It is very important to filter out such noise from web pages like web new pages. Also, when showing Web news pages on small screens like mobile phones or sending text to screen readers that translate the text to a more appropriate format like text-to-speech for visually impaired people, the content extraction operation is very valuable. Content extraction is defined as the process of determining those parts of an HTML document that represent the main textual content. The problem, however, is to find a solution that is generic which is portable to many types of Web news pages, accurate that finds all important content in a precise way and efficient where often a large number of Web pages are processed . An approach is designed which searches for relevant web pages using a web crawler then using a DOM tree based approach extracts the content from web news page by filtering out noise and the information retrieval agent extracts the key paragraph from the extracted content. Keywords-information extraction; Web Crawler, Document Object model; Web content extraction; information retrieval agent;

Related Items

JYOTHI RAO. (2014).

Semantic search based on ontology alignment for information retrieval. International Journal of Computer Applications,107(10): 25-33.doi: http://doi.org/10.5120/18789-0125

JYOTHI RAO. (2013).

DOM Tree based Approach for Extraction of Content from Web News Page. JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING,2(2): 292-297.

View All

Apply Now Enquire Now