Accomplishments

DOM Tree based Approach for Extraction of Content from Web News Page
- Abstract
-
PDF Full Text
With the huge amount of data on the World Wide Web, information extraction has become an increasingly important technology to help users locate desired Web information. Due to the heterogeneous nature of Web content, it is difficult to design a general Web information extracting approach that fits all application domains. When building a system for searching or mining Web content, a first task is extracting the main content and removing extraneous data such as navigation menus, functional and design elements, and commercial advertisements. It is very important to filter out such noise from web pages like web new pages. Also, when showing Web news pages on small screens like mobile phones or sending text to screen readers that translate the text to a more appropriate format like text-to-speech for visually impaired people, the content extraction operation is very valuable. Content extraction is defined as the process of determining those parts of an HTML document that represent the main textual content. The problem, however, is to find a solution that is generic which is portable to many types of Web news pages, accurate that finds all important content in a precise way and efficient where often a large number of Web pages are processed . An approach is designed which searches for relevant web pages using a web crawler then using a DOM tree based approach extracts the content from web news page by filtering out noise and the information retrieval agent extracts the key paragraph from the extracted content. Keywords-information extraction; Web Crawler, Document Object model; Web content extraction; information retrieval agent;