Zen of Information: Indexing files

A naïve approach to document retrieval was implemented in the early days of computer-based text processing: simple word match. For this, the computer would read each word of each document, one word and one document at a time. The process included a comparison of each word with the sample word initially supplied by the user. The goal was to match sequences of bits, what the user provide to what was in the documents. The reader understands that this comparison was not at the level of meaning but at the level of symbols. Meaning is at the level of information and symbol is at the level of package. The symbol, or word, or sequences of bits and bytes, is the capsule of information.

The complexity of the operation would increase if two or more words were supplied. Moreover, the semantic relationship of the supplied words was also an issue, or rather how would the words be combined. In all, the processing of text using the sequential methodology of scanning was terribly expensive.

This obstacle required a new approach to use them taking advantage of their capabilities. This is the same approach that is at the foundation of many novel computer applications. Special intermediate files were created with a particular organization that encoded the relative position of words in the stored documents. These are the index files, or indexing. More specifically, the inverted index files. These files store a list of all the word in all the documents in a collection, including the name of the document (in terms of file name) and their position of the word in the document. This organization allows for all types of automatic, or computer-based, operations expanding the capabilities of human processors.

Zen of Information

Monday, July 26, 2010

Indexing files

No comments: