home -- outline -- lectures -- assignments -- discussion -- tips -- links --



Intro to the Web: Friday, Jan 9

Publishing on the Web
  • Almost anyone can publish on the World Wide Web.
  • The Web is unregulated, unlike TV and radio, which have a limited number of channels.
  • There is no central administration to organize the Web, as there is for online services such as America Online.

Finding Sites on the Web
  • Know or guess the URL (Web address)
  • Use a hierarchical index (e.g. Yahoo)
  • Use a search engine (e.g. AltaVista)

Hierarchical Indexing Services
Yahoo
Search Engines
  • Use a "crawler" to visit sites on the Web.
  • Create a database on site that is searched by users (not the original pages).
  • Return links to the original pages as a result of user searches.

Crawler (a.k.a. "Robot" or "Spider")
  • Program that visits sites on the Web by following hyperlinks.
  • Since the Web is "well-connected", can start almost anywhere and crawl almost all of the Web (permit submission of URL's).
  • Crawlers perform "breadth-first" search to avoid overloading individual sites.
  • Should obey robot exclusion principle.

Distinguishing Characteristics
  • Coverage
    • what number of Web pages are indexed.
    • how much of each page is indexed.
  • What fraction of responses are no longer accurate?
  • Power and ease-of-use of query language.
  • Speed of response.

Recommend Search Engines
  • AltaVista (http://altavista.digital.com/)
    • Indexes every word on most pages of the Web.
    • Has a powerful query language.
  • Hotbot (http://www.hotbot.com/)
    • as the most complete and up-to-date index (crawls 10 million documents per day).
    • Almost as powerful, easy to use interface.

Evaluation
  • Looking for the phrase "Kristian J. Hammond"
  • Excite: 11, AltaVista: 81, HotBot: 91
  • AltaVista can search for words "near" each other, and many HTML elements, e.g. links
  • Excite permits "similarity searches".

Excite similarity search
  • Results of search for Kristian J. Hammond University of Chicago;
  • Results of clicking [More Like This] on
    • Infolab: Kristian Hammond
  • Can you find an example where the similarity search gives relevant results?

Using AltaVista (Standard Search)
  • Default search is "or" of search terms.
  • Can require words be present/absent using +/- prefix
  • Ranks the results based on a scoring algorithm.
  • Phases are distinguished by quotes
    • e.g. +noir +film -"pinot noir"

Standard Search (cont.)
  • Lower case matches both upper and lower; upper case matches only upper
    • e.g. turkey -Turkey
  • Accents are permitted; non-accented matches both with and without
    • e.g. café

Structural Elements
  • title: within page title
    • e.g. title:"University of Chicago"
  • host: within the host name of the system
    • e.g. host:cs.uchicago
  • link: within the URL of a hyperlink
    • e.g. link:cs.uchicago/info

Structural Elements, cont.
  • anchor: within the text of a hyperlink to another document
  • url: pages with text as part of their URL
  • image: within URL of an image

AltaVista Scoring Algorithm
  • A document has a higher score if:
    • the query words or phrases are found in the first few words of the document.
    • the query words or phrases are found close to one another in the document.
    • the document contains more than one instance of the query word or phrase.

AltaVista: Advanced Query
  • Uses operators and, or, not, near, and parentheses:
    • Simple query:
      • +noir +film -"pinot noir"
    • Equivalent advanced query:
      • noir and film and not pinot noir
  • Note: sorting is denoted separately in Advanced Search; default is not to sort

Advanced Search, cont.
  • Operator near useful for first and last names (with or without middle name or initial)
  • Can sort on words other than the query word(s):
    • e.g. search term: Chicago and restaurant, sort: romantic (see results).
  • Can specify range of dates

What searching isn't meant to be
  • Organization isn't really feasible for the web: it is for your pages.
  • Search engines are not a fix for bad organization
  • Infostructure of your pages is critical.
  • Think hard about the flow of information before you start designing.