Zen of Information: Vocabularies to represent information

Documents may include narratives, arguments, sections, and multiple other semantic components that make up an interrelated network of concepts. To represent all of that information at least two types of descriptors are used. One is generated from free language and another is generated from a list of terms, dubbed controlled vocabulary.

Free language descriptors are also called folksonomy and any term would be acceptable for the most part. Controlled vocabulary is a list of terms (a thesaurus) or lexical organizations (ontology) to account for the most salient, or significant concepts in the collection. A third type is often left alone as if it were invisible, unnecessary or useless, and perhaps taken for granted because of its immediacy: the actual vocabulary in the text of the documents. In all cases, one important issue stems from the limitations and characteristics of language and it is in reference to the decisions about which information elements to include or to exclude.

These three tools are used in the creation of representative surrogates of the documents. All of them have positive and negative characteristics. Their evaluation as sources must consider standardization, scope of coverage, specificity, exhaustivity, length in number of entries, etc. All of them have tradeoffs.

For example, free text representation may be very flexible but lacks standardization across systems. Controlled vocabularies lack flexibility and to properly include the specificity of significant concepts may be difficult. The scope of the language in the document set is only one of multiple potential ways to represent a concept because there exists a multiplicity of possible ways, such as synonyms.

Representation of information is a difficult chore.

Zen of Information

Tuesday, August 10, 2010

Vocabularies to represent information

No comments: