 |
Intro to the Web:
Friday, Jan 9
Publishing on the Web
- Almost anyone can publish on the World Wide Web.
- The Web is unregulated, unlike TV and radio, which have
a limited number of channels.
- There is no central administration to organize the Web,
as there is for online services such as America Online.
Finding Sites on the Web
- Know or guess the URL (Web address)
- Use a hierarchical index (e.g. Yahoo)
- Use a search engine (e.g. AltaVista)
Hierarchical Indexing Services
Yahoo
- For example, looking for Powell's Technical Bookstore
in Portland, Oregon
Search Engines
- Use a "crawler" to visit sites on the Web.
- Create a database on site that is searched by users (not
the original pages).
- Return links to the original pages as a result of user
searches.
Crawler (a.k.a. "Robot" or "Spider")
- Program that visits sites on the Web by following hyperlinks.
- Since the Web is "well-connected", can start
almost anywhere and crawl almost all of the Web (permit submission of URL's).
- Crawlers perform "breadth-first" search to
avoid overloading individual sites.
- Should obey robot exclusion principle.
Distinguishing Characteristics
- Coverage
- what number of Web pages are indexed.
- how much of each page is indexed.
- What fraction of responses are no longer accurate?
- Power and ease-of-use of query language.
- Speed of response.
Recommend Search Engines
- AltaVista (http://altavista.digital.com/)
- Indexes every word on most pages of the Web.
- Has a powerful query language.
- Hotbot (http://www.hotbot.com/)
- as the most complete and up-to-date index (crawls 10
million documents per day).
- Almost as powerful, easy to use interface.
Evaluation
- Looking for the phrase "Kristian J. Hammond"
- Excite: 11, AltaVista: 81, HotBot: 91
- AltaVista can search for words "near" each
other, and many HTML elements, e.g. links
- Excite permits "similarity searches".
Excite similarity search
- Results of search for Kristian J. Hammond
University of Chicago;
- Results of clicking [More Like This]
on
- Infolab: Kristian Hammond
- Can you find an example where the similarity search gives
relevant results?
Using AltaVista (Standard Search)
- Default search is "or" of search terms.
- Can require words be present/absent using +/- prefix
- Ranks the results based on a scoring algorithm.
- Phases are distinguished by quotes
- e.g. +noir +film -"pinot noir"
Standard Search (cont.)
- Lower case matches both upper and lower; upper case matches
only upper
- Accents are permitted; non-accented matches both with
and without
Structural Elements
- title: within page title
- e.g. title:"University of Chicago"
- host: within the host name of the system
- link: within the URL of a hyperlink
- e.g. link:cs.uchicago/info
Structural Elements, cont.
- anchor: within the text of a hyperlink to another document
- url: pages with text as part of their URL
- image: within URL of an image
AltaVista Scoring Algorithm
- A document has a higher score if:
- the query words or phrases are found in the first few
words of the document.
- the query words or phrases are found close to one another
in the document.
- the document contains more than one instance of the query
word or phrase.
AltaVista: Advanced Query
- Uses operators and, or, not, near, and
parentheses:
- Simple query:
- +noir +film -"pinot noir"
- Equivalent advanced query:
- noir and film and not pinot noir
- Note: sorting is denoted separately in Advanced Search;
default is not to sort
Advanced Search, cont.
- Operator near useful for first and last names
(with or without middle name or initial)
- Can sort on words other than the query word(s):
- e.g. search term: Chicago and restaurant, sort: romantic
(see results).
- Can specify range of dates
What searching isn't meant to be
- Organization isn't really feasible for the web: it is for your pages.
- Search engines are not a fix for bad organization
- Infostructure of your pages is critical.
- Think hard about the flow of information before you start designing.
|