As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the ...
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable conte ...
Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quot ...
The World Wide Web is one of the most widely used information resources. Understanding the web better will enable us to benefit more of it. In this thesis we develop techniques to learn the properties of the web pages like language and topic using only the ...