Zen of Information: August 2010

Thursday, August 19, 2010

Search interfaces

Information systems can be said to have three major components: technology, information, and the user base. The last Century brought up several generational cycles of technology, particularly on the hardware side. Miniaturization improved prices and faster transactional speeds, which increased availability of inexpensive solutions for business and individuals. The growth in number and diversity of software applications resulted in an explosion of possibilities for the creation and the processing of information.

There is software to facilitate writing, to read, to store, to analyze, to modify, to publish, etc. Creations can be in text, in images, in moving images, sound, etc. Increased connectivity has empowered users who can now communicate and get information in ways only imagined before. But the transition to a totally automated environment has not been smooth across the information profession landscape.

The most obvious example is the old card catalog, which is now computer based and present in most libraries. It is usually known as the online catalog. Information that used to be available in the 3x5 cards is now available through computer terminals. The replication is so perfect that it reveals an issue: With a few exceptions, online catalogs are for the most part simplistic replicas of the old manual catalogs.

And it is this reality what speaks loud about an even bigger issue, that the design of those systems do not exploit the capabilities of the technology to go beyond what is known and use it to expand human capabilities and accomplish new tasks or find new uses of the information.

Perhaps the online catalogs and their complex options are empowering to some users but one could ask if they may not also explain the popularity of simple interfaces, such as those offered by google, yahoo and other Internet search engines. The multiple access points offered by many library interfaces are separated as individualized operation. Most, or all of them, such as advanced search capabilities, are consolidated into internal processes and functionality in the services mentioned above. Whereas there is added complexity to the internal operations, their main goal seems to be the improvement of the user experience. And the users come back to use them again and again.

The question is, why do libraries settle for interfaces that fail to fully exploit the power of computer systems?

Tuesday, August 10, 2010

Vocabularies to represent information

Documents may include narratives, arguments, sections, and multiple other semantic components that make up an interrelated network of concepts. To represent all of that information at least two types of descriptors are used. One is generated from free language and another is generated from a list of terms, dubbed controlled vocabulary.

Free language descriptors are also called folksonomy and any term would be acceptable for the most part. Controlled vocabulary is a list of terms (a thesaurus) or lexical organizations (ontology) to account for the most salient, or significant concepts in the collection. A third type is often left alone as if it were invisible, unnecessary or useless, and perhaps taken for granted because of its immediacy: the actual vocabulary in the text of the documents. In all cases, one important issue stems from the limitations and characteristics of language and it is in reference to the decisions about which information elements to include or to exclude.

These three tools are used in the creation of representative surrogates of the documents. All of them have positive and negative characteristics. Their evaluation as sources must consider standardization, scope of coverage, specificity, exhaustivity, length in number of entries, etc. All of them have tradeoffs.

For example, free text representation may be very flexible but lacks standardization across systems. Controlled vocabularies lack flexibility and to properly include the specificity of significant concepts may be difficult. The scope of the language in the document set is only one of multiple potential ways to represent a concept because there exists a multiplicity of possible ways, such as synonyms.

Representation of information is a difficult chore.

Monday, August 9, 2010

Document description and categories

Categorization of information, and the use of categories as a form of surrogates of objects appeared in early history. Documents of one type, such as bills of sale, would be organized in generally the same location. They would also be separate from other documents, such as cargo inventories. This principle can still be seen libraries all over the world. Organizing documents into categories facilitates access to the documents.

As the number and variety of documents grew, the number of categories also grew in size and complexity. The creation of additional categories and subcategories made sense, which in turn, increased the specificity of descriptiveness of that particular scheme. Categories had always been considered access points to the documents but increased the specificity of new schemes affected the perception of what their role could be. In time, from being access or entry points to documents by describing what the documents were about, or aboutness, the subcategories were promoted to being used as representations of the information in the documents.

If documents are analogous to capsules of information, information is inside of documents. Categories and subcategories were useful to describe the capsules, or documents. Today, their usefulness has expanded to describe the capsule’s contents. The problem is that these representation constructions cannot capture all but only a limited set of all the information in the document set.

Users of systems never know what information is not being included in the set of surrogate representations.

This problem is a fundamental failure of document retrieval systems that use this type of representation alone. If some information is not represented, it will not be found.

Sunday, August 8, 2010

Controlled Vocabularies and Representation

An information space is defined by a set or collection of documents. A list of terms is built to represent the information elements in the information space so that terms, either alone or in combination, can be used to represent all of the documents in the set. The list of terms is known as the controlled vocabulary. A short list of the terms can be used as a surrogate of each document. The list of terms is assumed to convey an accurate representation of the information space, but also of the particular properties of the document set implying that a finite list of descriptors can be used to represent all the information in the documents, or at least the most relevant information. In addition, the representation will have to accurately capture the purpose, scope, audience, and level of expertise found in the documents.

The parallel with the use of letters is clear. After all, words and their myriad of meanings are the result of combining a finite number of letters. Likewise, a finite number of carefully selected terms can represent the universe of ideas and concepts in the document collection. This powerful argument is behind the creation of lists of keywords as representational building blocks of complex information concepts.

But, is it true that words can represent everything?

Thursday, August 5, 2010

Controlled vocabularies

Adherence to a standardized scheme of categorization is widely recognized and accepted as valid and useful to facilitate information management, particularly its storage and access. The categories in one of these schemes capture all the conceptual categories that make up an information space, the particular conceptual universe that is being considered. This is a tall order, and information professionals have developed a variety of schemes to address a multitude of such environments. Dewey Decimal and the Library of Congress are two of the most popular schemes whose categorization can reference the conceptual contents of the resources, works, or objects, physical or electronic.

The pragmatic application of this principle has shown to be useful and sound but it places a strong demand on those who implement them, maintain them and use them. Categorization systems must account for the rapid growth of resources, the innovation and inventive of authors and creators, and the evolution of language. All of these are inherent properties of human nature and represent a moving target for the information profession.

These demands are not new but have become more visible over the last years due to the explosion in volume, quality and variety of information generated. To keep up with this pressure, the information profession developed tools such as dictionaries, index, thesauri, lists, pathfinders, suggested materials, etc.

Some of those tools are referred to as controlled vocabularies, and they expand the classes in the corresponding categorization scheme. They are formed by either keywords that represent categories, or semantic constructions that relate two or more concepts. It is important to emphasize that controlled vocabularies are used as surrogates to represent content in information objects, and that they are used for storage and access, as entry points to the documents.

There are document-like surrogates, such as abstracts and summaries, also used as entry point in various systems but they are not considered controlled vocabularies. At the lowest level in the hierarchical taxonomy of the categorization schemes are the specific keywords to be used for objects in that category. Those keywords are used alone or in combination and are assumed to sufficient to describe all of the significant information in the particular domain.

Tuesday, August 3, 2010

Processing text, or word processing?

Text processing and word processing can be confusing because text and words seem to be the same. However, there is an important distinction when they are used in reference to computer-based processing. Text normally refers to plain encoding of alphanumeric characters. Files encoded as text are also known as text files. They are the most basic and common type of computer encoding.

Word processing adds the formatting of text and other information related to the text. These files are normally encoded differently than text files. These are normally referred to as binary files. Users can access a text file directly from the operating system but binary files require a type of program to decode the data.

The astute reader can see that a document so defined may not only be of words but also of any other types of elements. This is, in fact, true. A binary file can be of anything and a program would be required to decode the document, identify its elements and process them whether for display or any other type of action. In particular, a program used to access a word processing file, requires a program known as a word processor, such as the popular Microsoft Word for Windows.

Summarizing, a binary file can encode all types of documents, including still images or audio. A word processing file is a type of binary file, which may also include format, location and other such information about the file's content. A text file is formed by the alphanumeric characters of the text. Text files are the common communication encoding that different computers can use. For this reason, text is the de-facto communication encoding in the web and computer-to-computer protocols across the Internet.

Sunday, August 1, 2010

Information Retrieval Systems

Applying computer technologies to the number crunching needs of the day was one of the drivers in the implementation and maturity of computer technologies, but text processing was not far behind. After all, mathematics and text converged in areas related to encoding, cryptology and compression.

Engineers were aware that the new technologies of the 1950s might make possible multiple applications, including those to create, edit, and otherwise process text documents. By the 60s the transistor had enabled a revolution in miniaturization. At about the same time, advances in database technology were providing ideas about storage and access. By the 70s, network technology took computers to a larger scale.

During all this time, a small cadre of scientists and practitioners in the little known art of information retrieval had been researching and implementing systems to process text. This included the creation and editing, as well as other supporting tasks such as the storage and retrieval of documents, which were, in their own right, complete, elaborate and complex.

Terms like text processing, word processing and information retrieval were still confusing but were slowly starting to convey distinctive and different types of activities. The idea of a system that processes information was not far fetched anymore and the dream of Vannevar Bush was now possible. Information systems consolidated networks, databases, and all the newly implemented ideas related to the process of information. The information retrieval (IR) system would only be one component.

Search engines

The IR system was the component in the system that located a document to satisfy a given query that a user would create. The paradigm consider that the user query presented to the system was in the form of a template that started with “I want a document that contains…” and then the user would type a list of words and perhaps some parameters that provided a semantic relationship among the words. These parameters would be BOOLEAN (such as AND, OR, NOT) or distance (as how physically close two words should be), or indicate that synonyms or other semantic expansion should be used, etc. The system would then identify a document, or several documents that matched the query and possibly present them in a type of ranking order.

The only difference between an IR system and a search engine is the name, and may be also some of the document processing functionality.