How to Design a Search Engine to Actually Search

There are several simple techniques to develop a search-engine that actually finds information, rather than just millions of matching pages. Searches should default to looking within paragraphs/sections, not across entire pages. Today's popular search-engines are mostly primitive word-scanners, not usually searching for sections of information, nor hunting the exact names being requested, such as with dashes & slashes in ID names.

Steps

  1. Paradigm shift. Searching for actual information requires: a new way of thinking about searching text; and new ways to group text within a page: it is too primitive to attempt hunting information by looking at all text on a page; advanced techniques are needed to split a page into related sections. Just as grouping letters into words is a major step above searching for strings of characters/letters, grouping words into sections or paragraphs is the next major break-through in search-engine technology (as trivial as it might seem). This is part of the paradigm shift: instead of looking at a page as a stream of characters/letters, it is viewed as words; then, instead of looking at a page as a stream of words, it should be searched as sections/paragraphs of words.
  2. Forget search engines today. Forget the way popular search-engines work today (2006): they are primitive compared to techniques developed 25 years ago at NASA: almost no search-engines today can pinpoint information the easy way, but rather obsess on matching millions of related pages: it appears to be search-engine envy: with my results are bigger than yours. Judging search-performance should not be based on how many millions of pages were matched but rather: Was the information pinpointed? How quickly were the questions answered? In any technological age, the current techniques can be viewed as primitive compared to better ideas in the future; however, to consider today’s search-engines as stone-age dinosaurs is the beginning of wisdom. Many search-engines display ads on every page: it is in the interest of those ads to extend & prolong a search to display many pages of ads, rather than to pinpoint information.
  3. Search within sections/paragraphs. Plan the search-engine to look within paragraphs or sections of text, rather than searching entire pages for matching words. Often related words, that pinpoint a topic, usually occur within one sentence of each other. If paragraphs are too difficult to determine, allow a search-bracket of n-words (such as 30 words) to confine the search to logically related text. In practice, searching across entire pages to find so-called related words is one of the most ignorant techniques ever thought of on a whim: it might be trivial to search entire pages; however, more advanced techniques are needed beyond today's primitive, low-tech search-engines with the mindless search-all mentality. (The problem is rampant: even some book-searches hunt words across entire pages & cannot find words within just a paragraph/phrase.) Being trivial is no excuse for continuing to use ignorant search-techniques. The advanced techniques are not that much more difficult to implement.
  4. Search literal words. When information amasses, it is critical to differentiate between a/an/the to pinpoint information. Assuming to ignore some words just empowers a bias that insults the intelligence of potential users: perhaps have a rare option to ignore a list of words that includes a/an/the/of/in but, by default, search for every word specified. Let users learn to omit restrictive words; implicit omission of some words is as limiting as implicit declaration of misspelled variable-names in computer software: don't do it. (If someone misspells off as of then what happens? See? Understand the danger of implicitly ignoring words.)
  5. Search literal characters. If able technically, expect to search for literal strings such as Project XRAY-10/NOVA where the dash ('-') and slash ('/') are critical to the search: in practice, the searched text can have those characters converted to spaces when they are not in the search-phrase requested by a user; however, if pre-storing the searched words, then both forms can be indexed/stored (both XRAY 10 NOVA and literal name XRAY-10/NOVA can be indexed).
  6. Expect intelligence. Don't be cruel or critical of today's primitive search-engines & book-search programs; many people are intelligent enough to progress beyond the low-tech search-ideas of today, both as developers & users of the new-wave of advanced search-techniques. Computer technology, as a vast array of ideas, is complicated enough so that almost anyone can overlook obvious advances & get stuck in yester-year's technology, even thousands of computer professionals. The field is a mix of smart + dense: pages can be stored in complex, sophisticated databases but cannot be searched by paragraph, only by low-tech page-wide scans.
  7. Simple prototype. To test the above new ideas, a simple prototype search-program could be developed to search for phrases by converting a line of text into words separated by spaces, with an added trailing-space after the last word on the line; then, each word in the search-phrase is padded with a trailing-space to scan against the blank-terminated words in the text string. Each text-string in the file or web page would be searched in similar fashion. Keep a counter until all search-words are counted as matching.
  8. Piggyback search. Since many of today's search-engines match too many pages, software could be written to scan those matched pages to pinpoint information. A program could retrieve each matched page, and searching within paragraphs while checking for dashes/slashes, that program could pinpoint the information within hundreds of web pages matching the low-tech search. Such a program could also hunt for a/an/the within each matching web page, and pinpoint the results without the user wading through many pages of ads.
  9. Proven techniques. Don't say, That's too advanced; they'll never understand searching by paragraphs & literal names. People have an amazing capacity to move beyond limiting ideas of old search-techniques. Many of the above techniques were proven, in actual end-user applications, 25 years ago at NASA. That was an entire generation ago. It's just another case of back to the future in technology. The Renaissance overcame the Dark Ages, so better search-technology can, in fact, be achieved again.

Tips

  • Keyword near. Some search-engines can already limit searches to partial paragraphs/sections by using the keyword NEAR in the search phrase. Experiments using the NEAR keyword can help demonstrate the advantages of limiting searches to paragraphs or sections, rather than searching across entire pages.

Warnings

  • Spam killing search engines. The foolish, low-tech approach of matching words across the entire page, rather than matching within paragraphs/sections has helped promote spam-pages that include 10,000 unrelated words, hoping to spam-match into the search-results. So, today's search-engines are becoming flooded with spam-pages and could become practically useless unless they deter further spam-pages. When writing a program to scan within a particular search-engine's results, be prepared to change to yet another search-engine that isn't being killed by spam as fast.

Article provided by wikiHow, a wiki how-to manual. Please edit this article and find author credits at the original wikiHow article on How to Design a Search Engine to Actually Search. All content on wikiHow can be shared under a Creative Commons license.

Vinayak Mishra

Vinayak Mishra, a Cricket Enthusiast with keen interest in web and mobile applications. Hails from Mithila, Nepal, married to Rani and dad to his bundle of joy Bunu. Lives in New Delhi, India. You can follow him on Twitter or check out his website, vnykmshr.com.