Zen of Information

Saturday, September 18, 2010

Information and Entropy

Shannon and Weaver (S&W) presented an engineering solution to an information problem with something that would eventually be known as information theory. Though very mathematically, its philosophical roots are intrinsically linked to a physical property of matter known as entropy, or the amount of thermal energy, or kinetic energy of the electrons in a closed system. In another definition, entropy also measures a degree of disorder.

S&W equate uncertainty, or information that is not known, to entropy, or degree of disorder in the closed system, by assuming that the probabilistic amount of order in nature with respect to information, or that the natural state of organized information in a closed system is constant and equal to 100%. Thus the goal of their method is to totally reduce, minimize, and eliminate uncertainty.

Let’s consider the next examples. It is also important to know that S&W addressed the problem of noise reduction in electronic transmissions. In one case, a message is correctly transmitted and received. There is no loss of information and uncertainty is equal to 0. In another case, the message is garbled and the information is totally lost, so that uncertainty is equal to 1. Probability is bound by 0 and 1.

The usefulness of S&W appears in all its majestic importance when the message is partially garbled. Is in those cases when the importance of the assumptions used to calculate probabilities emerged. Faster computer processing and inexpensive data storage enabled better corrective solutions. S&W provided the theoretical approach and computer technologies allowed its implementation.

But in general, with respect to computer-based information processing, and in particular with respect to information retrieval, there is great temptation to use the theory without realizing that the environment for which it was developed resulted in a closed system. Language utilization for communication is an open system.

In other words, information theory has limitations. It behooves the researcher and student of information processing to understand that S&W bound this framework and environment in a very particular way. The enticing opportunity of information theory should be balanced with the reality of the results.

For these reasons, one should be asking why is it that systems work at all when using this and other unsuitable theories rather than trying to gain efficiency on processes that are theoretical flawed.

Sunday, September 12, 2010

The Information Science Space

Living organisms are consumers of information, indicating the importance of asking basic questions about human consumption of information. Basic human activities, such as breathing require human processing of information.

Processing of information is a cognitive activity that is manifested through thoughts and actions. Cognition is, therefore, an important field of study for Information Science thus part of the Information Science Space.

The expansion of computer technologies in the 1950s boosted Cognitive Science. Cognitive theories, models and systems would be implemented and tested quickly and easily. As computer information systems gained territory in daily life, their role in society cannot be denied.

The field of study known as human-computer interaction (HCI), as several others in the interface of computer technology with humans and with information, has established the bona-fide inclusion of computer technology in the Information Science Space.

NOTE: The study of computer technology takes place in a family of fields under different disciplines such as computer science and information systems.

Because human life does not take place in a vacuum but is part of an interactive community, several other sociological aspects emerge. Various components of this rich human-to-human and human-to information interactions are substantive for multiple disciplines from the humanities to the social sciences. This is the reason why the study of Information Science is a multidisciplinary study.

Several fundamental issues:

Is it true that living organisms are consumers of information?
If so, one could ask if there are any types of human processes and behavior not immediately related to information.
One can define information processing very broadly so as to include breathing, or very narrowly so as to include memorization.
Information science is not only concerned with information use but also with all processes related to information.
Understanding the use of information and other information processes remains difficult in part due to the lack of agreement about how specifically information needs to be defined and operate upon.

Thursday, August 19, 2010

Search interfaces

Information systems can be said to have three major components: technology, information, and the user base. The last Century brought up several generational cycles of technology, particularly on the hardware side. Miniaturization improved prices and faster transactional speeds, which increased availability of inexpensive solutions for business and individuals. The growth in number and diversity of software applications resulted in an explosion of possibilities for the creation and the processing of information.

There is software to facilitate writing, to read, to store, to analyze, to modify, to publish, etc. Creations can be in text, in images, in moving images, sound, etc. Increased connectivity has empowered users who can now communicate and get information in ways only imagined before. But the transition to a totally automated environment has not been smooth across the information profession landscape.

The most obvious example is the old card catalog, which is now computer based and present in most libraries. It is usually known as the online catalog. Information that used to be available in the 3x5 cards is now available through computer terminals. The replication is so perfect that it reveals an issue: With a few exceptions, online catalogs are for the most part simplistic replicas of the old manual catalogs.

And it is this reality what speaks loud about an even bigger issue, that the design of those systems do not exploit the capabilities of the technology to go beyond what is known and use it to expand human capabilities and accomplish new tasks or find new uses of the information.

Perhaps the online catalogs and their complex options are empowering to some users but one could ask if they may not also explain the popularity of simple interfaces, such as those offered by google, yahoo and other Internet search engines. The multiple access points offered by many library interfaces are separated as individualized operation. Most, or all of them, such as advanced search capabilities, are consolidated into internal processes and functionality in the services mentioned above. Whereas there is added complexity to the internal operations, their main goal seems to be the improvement of the user experience. And the users come back to use them again and again.

The question is, why do libraries settle for interfaces that fail to fully exploit the power of computer systems?

Tuesday, August 10, 2010

Vocabularies to represent information

Documents may include narratives, arguments, sections, and multiple other semantic components that make up an interrelated network of concepts. To represent all of that information at least two types of descriptors are used. One is generated from free language and another is generated from a list of terms, dubbed controlled vocabulary.

Free language descriptors are also called folksonomy and any term would be acceptable for the most part. Controlled vocabulary is a list of terms (a thesaurus) or lexical organizations (ontology) to account for the most salient, or significant concepts in the collection. A third type is often left alone as if it were invisible, unnecessary or useless, and perhaps taken for granted because of its immediacy: the actual vocabulary in the text of the documents. In all cases, one important issue stems from the limitations and characteristics of language and it is in reference to the decisions about which information elements to include or to exclude.

These three tools are used in the creation of representative surrogates of the documents. All of them have positive and negative characteristics. Their evaluation as sources must consider standardization, scope of coverage, specificity, exhaustivity, length in number of entries, etc. All of them have tradeoffs.

For example, free text representation may be very flexible but lacks standardization across systems. Controlled vocabularies lack flexibility and to properly include the specificity of significant concepts may be difficult. The scope of the language in the document set is only one of multiple potential ways to represent a concept because there exists a multiplicity of possible ways, such as synonyms.

Representation of information is a difficult chore.

Monday, August 9, 2010

Document description and categories

Categorization of information, and the use of categories as a form of surrogates of objects appeared in early history. Documents of one type, such as bills of sale, would be organized in generally the same location. They would also be separate from other documents, such as cargo inventories. This principle can still be seen libraries all over the world. Organizing documents into categories facilitates access to the documents.

As the number and variety of documents grew, the number of categories also grew in size and complexity. The creation of additional categories and subcategories made sense, which in turn, increased the specificity of descriptiveness of that particular scheme. Categories had always been considered access points to the documents but increased the specificity of new schemes affected the perception of what their role could be. In time, from being access or entry points to documents by describing what the documents were about, or aboutness, the subcategories were promoted to being used as representations of the information in the documents.

If documents are analogous to capsules of information, information is inside of documents. Categories and subcategories were useful to describe the capsules, or documents. Today, their usefulness has expanded to describe the capsule’s contents. The problem is that these representation constructions cannot capture all but only a limited set of all the information in the document set.

Users of systems never know what information is not being included in the set of surrogate representations.

This problem is a fundamental failure of document retrieval systems that use this type of representation alone. If some information is not represented, it will not be found.

Sunday, August 8, 2010

Controlled Vocabularies and Representation

An information space is defined by a set or collection of documents. A list of terms is built to represent the information elements in the information space so that terms, either alone or in combination, can be used to represent all of the documents in the set. The list of terms is known as the controlled vocabulary. A short list of the terms can be used as a surrogate of each document. The list of terms is assumed to convey an accurate representation of the information space, but also of the particular properties of the document set implying that a finite list of descriptors can be used to represent all the information in the documents, or at least the most relevant information. In addition, the representation will have to accurately capture the purpose, scope, audience, and level of expertise found in the documents.

The parallel with the use of letters is clear. After all, words and their myriad of meanings are the result of combining a finite number of letters. Likewise, a finite number of carefully selected terms can represent the universe of ideas and concepts in the document collection. This powerful argument is behind the creation of lists of keywords as representational building blocks of complex information concepts.

But, is it true that words can represent everything?

Thursday, August 5, 2010

Controlled vocabularies

Adherence to a standardized scheme of categorization is widely recognized and accepted as valid and useful to facilitate information management, particularly its storage and access. The categories in one of these schemes capture all the conceptual categories that make up an information space, the particular conceptual universe that is being considered. This is a tall order, and information professionals have developed a variety of schemes to address a multitude of such environments. Dewey Decimal and the Library of Congress are two of the most popular schemes whose categorization can reference the conceptual contents of the resources, works, or objects, physical or electronic.

The pragmatic application of this principle has shown to be useful and sound but it places a strong demand on those who implement them, maintain them and use them. Categorization systems must account for the rapid growth of resources, the innovation and inventive of authors and creators, and the evolution of language. All of these are inherent properties of human nature and represent a moving target for the information profession.

These demands are not new but have become more visible over the last years due to the explosion in volume, quality and variety of information generated. To keep up with this pressure, the information profession developed tools such as dictionaries, index, thesauri, lists, pathfinders, suggested materials, etc.

Some of those tools are referred to as controlled vocabularies, and they expand the classes in the corresponding categorization scheme. They are formed by either keywords that represent categories, or semantic constructions that relate two or more concepts. It is important to emphasize that controlled vocabularies are used as surrogates to represent content in information objects, and that they are used for storage and access, as entry points to the documents.

There are document-like surrogates, such as abstracts and summaries, also used as entry point in various systems but they are not considered controlled vocabularies. At the lowest level in the hierarchical taxonomy of the categorization schemes are the specific keywords to be used for objects in that category. Those keywords are used alone or in combination and are assumed to sufficient to describe all of the significant information in the particular domain.