Zen of Information: July 2010

Friday, July 30, 2010

Retrieving information, or documents?

The answer depends on whether one document in its totality satisfies the totality of the information need. In other words, a single document completely satisfies the user’s need. Ideally, the document will perfectly fit the information need without adding anything to it.

But information needs come in all shapes and forms whereas documents are pretty much fixed. This is true even for dynamic documents once they are created. Dynamic documents are ad hoc creatures that nevertheless encapsulate static content from that point forward.

Most documents are, in fact, compositions that authors put together in response to some urge. However, documents are also composite of integrated elements. In text, these would be sections, paragraphs, chapters, etc. Going at an even deeper level of granularity, one can count words and even letters as atomic elements. Even phonemes may enter the scene as yet another form of information. But in the end, these elements are parts of the document. Current computer-based information systems deliver documents because documents are convenient unit of analysis.

However, much activity has been taking place in areas of computer-based processing of information. Text mining is one of them, and it refers to the identification of segments in a narrative that have a particular meaning and can be followed up or build upon to construct and support bold statements.

Tuesday, July 27, 2010

Information representation

The search process has been greatly simplified throughout the years. A user enters one or several words. The system finds the names of files where the words appear. If the user supplies a combination of words, the system can find them individually or in some type of semantic relationship as in Boolean, or logic, operations. It is possible to ask the equivalent of “I want documents that discuss giraffes in Costa Rica”.

Indexing files form a representation system, a system representing the information in a collection of documents, a computer-based bibliographic representation. But there are other bibliographic representations. Document titles and other types of summaries are good examples of representation systems as well.

The reader will notice that each type of representation provides their own set of mutual semantic relationships, degree of specificity, and other attributes that correspond to their particular interpretations of the original information.

Monday, July 26, 2010

Indexing files

A naïve approach to document retrieval was implemented in the early days of computer-based text processing: simple word match. For this, the computer would read each word of each document, one word and one document at a time. The process included a comparison of each word with the sample word initially supplied by the user. The goal was to match sequences of bits, what the user provide to what was in the documents. The reader understands that this comparison was not at the level of meaning but at the level of symbols. Meaning is at the level of information and symbol is at the level of package. The symbol, or word, or sequences of bits and bytes, is the capsule of information.

The complexity of the operation would increase if two or more words were supplied. Moreover, the semantic relationship of the supplied words was also an issue, or rather how would the words be combined. In all, the processing of text using the sequential methodology of scanning was terribly expensive.

This obstacle required a new approach to use them taking advantage of their capabilities. This is the same approach that is at the foundation of many novel computer applications. Special intermediate files were created with a particular organization that encoded the relative position of words in the stored documents. These are the index files, or indexing. More specifically, the inverted index files. These files store a list of all the word in all the documents in a collection, including the name of the document (in terms of file name) and their position of the word in the document. This organization allows for all types of automatic, or computer-based, operations expanding the capabilities of human processors.

Organizing and representing information

There are many types of documents. Also, they come in many formats. Thy go from small leaflets to volumes of books. In general, a document is a package of information created to preserve some particular focal information and items related to that information. The components are normally organized in some order that may be sequential, hierarchical, or a combination of both.

Packages are convenient containers but also serve for storage purposes. Documents with related content, or the information they carry, are organized in collections. Semantically speaking, in an abstract space of information, related documents would be place closer to each other than to unrelated documents. The semantic distance is a measure of how similar documents are to each other.

Electronic documents have particular characteristics that make them amenable to automatic processing, such as encryption and compression. Although it is important to differentiate information from the package wherein it exists, most processing of information, particularly computer-based information processing treats the symbol or package as equal to the information it carries. In other words, the paper is only a word, it is not really paper.

Likewise, computer-based information processing is really processing of words, symbols, bits and bytes, but not meanings and concepts.

Thursday, July 22, 2010

Packages and capsules of information

There is so much information that even with computers we find it difficult to keep track of everything. For this reason alone we usually don’t directly examine the actual information objects but their representation. We go to an electronic bookstore and look at images, reviews and summaries of books. A search engine gives us titles and brief sentences from the web pages, or snapshots of images. To have contact with the actual information objects is expensive and time consuming. Just 100 years ago we needed to wait months to go from one part of the world to another. There are Internet applications that can take us around the globe for a virtual visit in seconds. If we want to get specialized pictures of exotic life forms we can probably find them in a book or in a video. We interact with representations most of the time. Even words are representations.

Representation of information is related to its use, to its storage for later use, to its retrieval by command. It is not only in reference to certain obvious types of media such as print and electronics but also to other more abstract forms as in the discussions of concept representation, organizational learning and knowledge, or mental structures.

Void of a definition of information, to look at representation one must at least define a unit of information and examine how it is packaged. Examining information as packages may be useful. After all, we are familiar with books, documents, and other packages of information and not with information itself. Packages of information are concrete and information is an abstract construct or phenomenon.

Based on the idea that information answers some information need, and that it exists in some package of information, identifiers of these questions are categorized according to type of question. The set of issues related to this categorization is known as information organization; the materials that are organized are the information packages and the capsules of information at varying degrees of specificity in each package.

Example of questions answered by different information capsules that may exist in one or several packages of information: Who wrote Tom Sawyer? What books did Mark Twain write? Who is Samuel Langhorne Clemens. All of these questions are related but separate. Each represents something, a capsule of information. Each question is a representation in and of itself. The answers also represent the same thing in an illustration of how a capsule of information may have multiple representations, and that capsules may be part of multiple packages.

A short linguistic construction that combines all of these related pointers would address all questions, and may be considered a representation. This construction could take the form of a sentence such as Samuel Langhorne Clemens, also known as Mark Twain, wrote the novel Tom Sawyer. A larger paragraph would expand the idea, convey the same topic and include new not asked pieces of information. Likewise, there are multiple books on the life and work of Mark Twain, which would span the topic even more.

Wednesday, July 21, 2010

Does information have parts, or is it a whole?

George Miller wrote: The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information (The Psychological Review, 1956, vol. 63, pp. 81-97) advancing the idea of limits in the capacity of short term memory. In general, there seem to be limiting thresholds of the amount of information one can practically receive, decode, and integrate.

Specificity may be one a way to encapsulate information and help overcome this limitation. Telephone numbers can be used to illustrate this point. A telephone number has 10 digit separated into 3 parts, area code, exchange router and 4-digit number. It is easier to remember them in terms of chunks rather than by its individual digits. This example also illustrates a second point, the encapsulation of information.

Observing a surface, there are at least two measures of importance about it: perspective and distance. Perspective is the relative position from where the observation was done and determines what can be seen. Distance determines the detail, or resolution, of what is seen. These two are important when considering ambiguity.

These two measures appear with the same meaning and role in reference to an information package. Perspective is a function of personal inclination and preference adding bias to the interpretation of the package. Distance affects the semantic size (how narrow or how broad is a particular information unit) of the component in the package. The semantic size is the degree of specificity; what can be discerned as a unit of semantic value in the package.

How a unit of information is coded or stored will determine is processing because it points directly to the degree of specificity of that unit.

NOTE: Documents can be seen as units of information. Libraries and search engines treat documents as units of information. Books can be made up of self-contained chapters that are units in their own right.

Monday, July 19, 2010

information as whole and part

Educational psychologists recognize different types of learning but in general they are understood as one of the products of getting exposed to information. Other posts have explored some of the characteristics of information. In most cases these effects are proposed as inferences because of the elusive nature of information and the difficulties of directly seeing its behavior. Observable phenomenon is due to human reaction to a given unit of information, which in many cases is difficult to compose. A good analogy was presented earlier with where without defining information, it is the substance encapsulated by a container.

In an attempt to understand the characteristics of information, one can see it as a substance of a certain degree of complexity that can be incorporated into a body of knowledge. From the opposite perspective, information can then be a portion of a body of knowledge, thus a body of knowledge can be seen as a whole, or network, of integrated components that could be defined as information.

In this description, information is a part of the whole and can have any size shape or form that is part of that whole. An independent unit on its own right, this component may have subcomponents which may be similar in nature, that is also be information units themselves. The idea of self-similarity can be borrowed from the physical sciences to describe this relationship at some level.

Saturday, July 17, 2010

Graduate studies

Education at the graduate and the undergraduate levels are different in a fundamental way. Undergraduate studies emphasize basic skills and memorization. Students at the Master’s level get deep into their subject matter adhering to sets of best practices. At the Ph.D. level graduate students go into theory and models that are fundamental to their educational interest. The main difference between the two levels of graduate studies is not just the final objective or level of knowledge but at the degree of abstractedness of the content of study.

A graduate with a Master’s degree is expected to complete practical tasks whereas a graduate with a Ph.D. is expected to work at a theoretical level wherein different tasks can be developed, studied, and evaluated.

Their main difference in training is that a Ph.D. will mostly work at an abstract level whereas a Master’s has expertise to implement and complete practical tasks within prescribed rules, guidelines or paradigm with limited regards to the foundational theoretical constructs upon which those are built. Master's apply theory while Ph.D develop them.

Disclaimer: This brief discussion only applies to differences in the studies that Master's and Ph.D. students undertake, and not to the actual differences in intellectual capabilities or activities the same students end up doing in their professional life, which may clearly contradict what is presented here.

Friday, July 16, 2010

Physics, information theory

The previous posts describe some of the components and characteristics of an information space. It is clear for the discussions that a definition of information is elusive. In once case the problem of definition is equated to that of light. We can only see light as it reveals an object while remaining invisible itself. Is it possible that information is likewise recognized only as it reveals something else, or when is seen as something else acquires shape? Just as light is, could information be considered a form of energy?

This idea might not be too far fetched. After all, physicists discuss information as something that is lost, or something that is preserved. If permanence can be expressed as preservation or loss in quantifiable nomenclature characterization of variation in amount of information may be possible without having to define information. In other words, measurement is by proxy. Information gradients are problematic but their existence is in principle an accepted fact of physics.

Forcing the topic, one could also argue that information also carries a qualitative dimension, which the measure ignores. After all, an observable consequence may be measured but unknowns in the stimuli or in the interaction between object and component stimulus would raise questions about other associated factors. The result is an unknown environment with respect to the qualitative values of information, even as information gradients might be measured.

This terminology is related to a specific disciplinary endeavor known as information theory, which borrows heavily from the physical sciences, more in particular, from the second law of thermodynamics. This law, important for classical physics, establishes that a closed system neither gains nor loses energy. Claude Shannon would present his idea of information theory by referring to the gain or loss of information by a system. His ideas, mathematical in nature, have been fundamental in the development of digital networks and signal transmission.

But as modern physics advanced in the second part of the XX Century, and into the new millennium, the second law of thermodynamics is being questioned, in particular about subatomic particles. Will this have any theoretical value as a possible new model for information? Maybe it is too early to tell, or maybe it is an unrealistic wish. The fact is that, as many forms of energy that can be stored, channeled and used, information, its nature and a plausible and encompassing definition, remain elusive.

Thursday, July 15, 2010

Meaning construction

When we attempt to make sense of our surroundings, by necessity, we notice only a fraction of phenomena. Otherwise the amount of signals would overwhelm our senses and overtax our attention. This is done by our mental mechanisms, which include filtering processes to direct our attention on what it deems important. This happens at every instant.

We are only aware of a minimum set of prioritized events. Out of that small set we attend to only a few of them, maybe a couple of them only while our senses keep receiving multiple signals. Those signals may be pre-coordinated, as in a face-to-face conversation when we hear words but also see gestures, or body language. For post-coordination of signals, the brain couples separate and unsynchronized partial channels. In both cases, the objective is to fit all the stimuli we receive within a coherent framework that explains them as a whole. This whole process is known as sense making.

Sense making and meaning construction are similar cognitive processes. Sense making is semi-internal because it works with external stimuli and organizes it into a coherent entity. Construction of meaning integrates the coherent entity into a structure in memory that already exists, expanding it, or creates a new structure. Any one of these structures may be novel or an ontological replica of another.

Meaning construction results in a greater structural entity than the interpretation of the initial collection of signals and stimuli. It is a purely internal cognitive process of integration or creation that either expands an existing structure or creates new ones.

At a first step, our senses capture signals, then they are cognitively joined to make a temporary whole or conceptual structure in memory that is interpreted and placed into a greater whole conceptual structure, also in memory. The need to consider these as two separate processes arises from the intermediate need to combine and interpret the separate signals first and to place it into another structure at a second stage.

Question: Are these structures, or structural entities, information, knowledge?
Question: Where is information?

Wednesday, July 14, 2010

Sense making to construct meaning

As part our humanity, we attempt to make sense of our surroundings. By necessity, what we notice is only a minimal amount of phenomena to prevent overwhelming our senses and overtaxing our attention. As a result, our mental mechanisms include filtering processes to direct our attention on what it deems important at every instance. We are only aware of a minimum set that prioritizes events to give attention to only some of the signals some of our senses are receiving. The signals may be pre-coordinated, as in a face-to-face conversation where we hear words and calibrate them with the gestures, or body language. Post-coordination of signals is when the brain must couple separate and not synchronized partial channels. In both cases, the objective is to fit all the stimuli we receive within a coherent framework that explains them as a whole. This whole process is known as sense making.

Sense making and meaning construction are similar cognitive processes. Sense making is semi-internal in that it works with external stimuli. It organizes the external stimuli into a coherent entity. Construction of meaning integrates that coherent entity into a structure in memory that already exists or that is an ad hoc creation. The structure may be novel or an ontological replica of another. Meaning construction results in a greater structural entity than the interpretation of the initial collection of signals and stimuli. It is a purely internal cognitive process of integration or creation that either expands an existing structure or creates a new structure of knowledge.

Question: This sounds like data is transformed into knowledge, where is information?

Tuesday, July 13, 2010

Scientific Research?

Invalidating research by questioning the immediate applicability of research results is an unsustainable position. To begin with, results from research is primary information. The value of the results is not always explicit, clear or immediate.

Immediate value of research is relative to the disciplinary body of knowledge within which the particular research was placed. It is within that space that the work would need to be examined first, but there is more. Natural phenomenon does not stand alone. It is part of an intrincate and most often invisible network of influencing factors. Boundaries among disciplines are artificial and were forced by organizational needs of institutions and other such enterprises. Everything is connected to everything else in nature.

In this context, the influence of a particular research experiment is normally part of a line of research, or inquire with unknown effects on other external disciplines. This status remains so until the time when it is discovered by researchers in those other disciplines. Interdisciplinary crossover is not only possible but desirable.

On the other hand, regardless of applicability, there may be research and lines of inquire with weak theoretical assumptions. This would be a problem. Unequivocally acceptance of basic assumptions within a theoretical framework leads into weak science, waste of resources and mediocrity. Some researchers would complain about this statement and it may not be their experience or obvious within their own areas of expertise. After all, most researchers receive strong training on research methodologies to critically evaluate possible pitfalls in their own work. This includes a thorough understanding of the assumptions on which those methodologies rest.

The study of properties, critical components, factors or their interrelations in theoretical frameworks is a type of research that routinely receives funding awards because their results either affirm or question the fundamental assumptions of the framework. But readers of reports on such research should beware of the assumptions hid in the methodology. Readers should judge the methodology and the conclusions.

It is important that researchers, are not only trained about methodologies and their fundamental assumptions, but that they are also trained to question those fundamental assumptions. This is particularly important about their specific expertise and assumptions in their own line of research.

Any student who is not being trained to ask questions, to think critically even about their education, is missing the point. Conversations with young researchers make me think that the value of strong theories is being minimized, dismissed, or -- worse yet -- unknown by them in their research activities. Trained researchers should be expected to respond for the state of the fundamental assumptions in their theoretical paradigms.

Monday, July 12, 2010

Context and Specificity

Context refers to the relationship between part and whole so that part acquires a greater degree of specificity in the interpretation, or meaning, that is different that when part is examined as an isolated entity.

Specificity is a property of the scope of meaning, of what was understood by the reader, on what was read, of what is carried by the symbols, the letters, the words, etc. For example, the expression “brown fox” is less specific than “the small brown fox jumped over the tree”. One expression is more specific than another when it captures more details.

These properties are useful when examining human understanding of texts and documents. A simplistic examination may indicate that meaning and understanding are some objective characteristics of text and documents. After all, grammar provides a solid foundation to the combination of words to accommodate a variety of meanings.

The understanding of those meanings would have to occur at some standardized level that by necessity would be the lowest level in a universe of multiple levels of understanding and meanings.

Understanding or meaning is derived from the text itself and from how much the reader is able to relate to the content. These relationships form a space, which build context to text. Placing a text in a context is contextualization.

Text and context, once together, form information. This is what is extracted from the expression, or text. Ideally, the writer, or constructor of the expression, has captured some meaning in an expression. If the reader captures the same meaning, and at the same level of specificity intended by the creator, it is said that there has been an accurate interpretation, thus accurate communication has taken place. Accurate contextualization and the placement of the text at the original level of specificity must occur for the correct interpretation and understanding of the text to take place.

Variability in the application of contextualization and specificity levels by individuals explain the differences in reading understanding and in meaning construction among different individuals.

Sunday, July 11, 2010

Many Levels of Reading Understanding

Reading provides an excellent field laboratory to explore understanding, particularly when it is pleasure reading. Experiments on reading understanding seem to support the idea that, at least with respect to this type of materials, readers agree about the content they have read.

But, while readers seem agree it is clear that readers understand the content at various levels. Children, young adults, college graduates are just some of the categories of people that in average exhibit varying degrees and levels of understanding. In general, the acceptance that there are multiple levels of reading understanding seems to be universal. This begs questions about those levels. For example:

Are the various reading understanding levels completely distinct, do they partially overlap, or completely overlap as in being organized as layers of understanding?
Can one say that one level of understanding is greater or higher than another level?

The answers seem obvious but, in light of the individual differences among readers, the issues discussed in other posts about information, there is at least one level wherein readers would have the same understanding, a level of common understanding, the lowest level of understanding. If so, which is this level?

This is not a novel idea but rather one that has been and is constantly examined to gain more understanding about the different factors affecting the reading experience. I am sure that there are many people interested and knowledgeable on these issues, such as reading specialists, teachers, researchers on communication, and members of other similar groups.

Saturday, July 10, 2010

And then, there is human nature

It seems that some aspects of nature can be observed and the observations recorded. One way to record is in human memory but it is known to be plagued of bias. Memory seems to accommodate preferences and perspectives of each individual. Moreover, the senses have limitations and our personal background supplies context that could distort acquisition and interpretation of nature’s signals.

The important matter of attention has something to say about the resulting record of an event. Attention is related to individual awareness and interest, and it is directed towards specific areas of the environment in response to some internal reaction to external triggers.

With all of these weak links in the chain of interpretation, it is no wonder that witnesses of the same event come up to remember different things about it, including inexistent segments or components. One may be inclined to argue that anxiety, distractions and unexpected occurrences explain the differences, thus more time and a relaxed atmosphere will enable different people to record exactly the same memories about a given experience, event or stimuli.

Documents, information and meaning

It is possible to envision documents as information that has been packaged. A document is an instance of various components interacting with each other, and that together have a certain meaning. Meaning is, therefore, an inherent property of information as well as of the document.

Meaning emerges from the interpretation of phenomena, including documents as human creations. There are multiple versions of arguments that deal with the nature of meaning, of documents and of information, and of how they relate to each other. However, one would be hard pressed to argue that meaning, documents and information are not related.

A document, thus defined, captures an instantiation of information that is understood as a unit. However, it is clear that it is also a bundle of parts, which are in itself units of information on their own right. Therefore, a document is a network of multiple parts packaged as one unit.

This idea becomes clear when we see a multi-letter word, or a multi-word sentence, etc. Each letter, each word, is a unit of information in its own right but yet form a whole with its own meaning. A similar example is an image made out of strokes, or of pixels, or of visual elements in a coherent collage. The final meaning is related to the relationship of the component elements to each other, as well as of how each part is related to the whole.

Meaning and documents

As it relates to documents as expressions of meaning, meaning can either be constructed when creating the document or derived when one interacts with it.

Meaning can be defined as a subjective interpretation of an objective instantiation, namely a document.

Document is further defined as an instantiation that has been captured in some form or medium, which could be void of explicit a-priori meaning, underlining the reality of possible a-posteriori assignment of meaning to the creation of a document. This is exemplified in the multitude of cases when original documents have served purposes different to those for which they had been originally created.

Friday, July 9, 2010

On meaning

Meaning is captured by some form of entity, such as words, individually or combined.

What type of operation takes place when words are combined? Is meaning additive? Some words are declarative (noun, verbs) and others are modifiers (adjectives, adverbs). The effect of some words over other words is not additive.

What about meaning? What is meaning in relationship to words?
Can meaning be considered absolute?
Can meanings be combined?

If meaning is transmitted, is the sent meaning the same as the received meaning?

Thursday, July 8, 2010

Information is invisible

We often don’t recognize the value and the role of information until such a a time when we need it.

What is information?
When information is information and not something else?
What is information before it is information?

Does something carry information?
Is information tangible? How tangible?
Or, is only the carrier of information tangible?
What is the carrier of that carry information?

Look up: Semiotics.
In this light, how can information be managed, organized?

And, what makes information valuable?
What is its role?

Look up: Representation and Information Representation.

In the discussion of the previous questions, it is clear that pinpointing a definition of information is slippery at best. It is more like a moving target, or perhaps rather like a moving bullet that can be seen when it is at rest but not when it is being used as it was naturally conceived. Similarly to light, information can only be perceived indirectly, as it interacts with the environment. Light remains invisible otherwise.