Hot-Link Menu in Right-Side Column

Last Updated 6-27-2010

  • More Computers
  • More Games
  • More Electronics
  • More Books
  • More Software
  • Normalized Information Distance

    The Normalized Information Distance is a universal distance measurement for many different objects. Since the Normalized Information Distance is based on Kolmogorov complexity, the actual distance value cannot be computer, nevertheless, there are still many ways to take advantage of the concept. First of all, compression algorithms can be used to approximate Kolmogorov complexity, as long as the objects being compressed can have a string representation.

    Second, page count statistics obtained from the World Wide Web can be used for names and even abstract concepts. These practical applications of the normalized information distance are then applied to machine learning tasks to assist in tasks such as data mining. We will explore both the theoretical and practical implications of Normalized Information Distance. Real world applications include music clustering, bioinformatics and machine translation.

    Kolmogorov complexity measures the absolute information content of individual objects. In the tasks of clustering and data mining, it is also necessary to measure the absolute information distance between individual objects. Measuring absolute information distance is a universal requirement for determining a distance between two objects x and y. The distance between two string is defined as the minimum quantity of information required to translate x into y or vice versa. Unfortunately, because of the universal requirement that this information be based on Kolmogorov complexity, the distance is uncomputable. To get around this dilemma, several formulas and approaches have been developed that come close to approximating Kolmogorov complexity.

    Buy from Centurion

    Continuing along the line that Kolmogorov Complexity is the length of the shortest program that outputs x on input y, has some interesting implications. One of the implications is that if x and y are concatenated, we still have two distinct objects, just concatenated. When we concatenate these programs, then, run them through a compressor, and compare the output of the concatenated result with the individual components, we now have a metric for the similarity between x and y. As x and y become more similary, the output of the concatenation of x and y, will tend to be the same as x and y compressed and unconcatenated.

    Normalized Compression Distance

    The Normalized Information Distance is impractical since it cannot be computed. To overcome this limitation, a practical form of the Normalized Information Distance was developed. This practical form is known as the Normalized Compression Distance. The Normalized Compression Distance utilizes the properties of data compressors (gzip, bzip, & PPMZ) to replace Kolmogorov complexity with a practical value. Making the proper substitutions gives us the following formula:

    NCD =     C(xy) - min(C(x), C(y))
    max(C(x), C(y))


    • C(x) = Size of compressed file: X.
    • C(y) = Size of compressed file: Y.
    • C(xy) = Size of compressed file: concatenated files x & y.
    • min(C(x), C(y)) = min compressed file size of x & y.
    • max(C(x), C(y)) = max compressed file size of x & y.

    Help Support this site - Click this ad


    DNA sequences seem ideally suited for the Normalized Compression Distance technique. A DNA sequence is a finite string with a 4-Letter alphabet: {A,C,G,T}. A hotly debated topic in biology is which 2 of the 3 placental mammalian groups: Rodents, Ferungulates, and Primates are most closely related. The test involves placing the mitochondrial gene sequences of several mammals in a standard data compressor such as PPMZ, and creating a distance matrix. The values in the distance matrix are then used to reconstruct a phylogeny tree giving an NCD classification tree. The results of this experiment are illustrated in the following diagram:

    Figure NCD-1: NCD Generated Mammalian Classification Tree

    Help Support this site - Click this ad

    Our agent has now adopted the goal of driving to Bucharest, and is considering where to go from Arad. There are three roads out of Arad, one toward Sibu, one to Timisoara, and one to Zerind. None of these achieves the goal, so unless the agent is very familiar with the geography of Romania, it will not know which road to follow. In other words, the agent will not know which of its possible actions is best, because it does not yet know enough about the state that results from taking each action. If the agent has no additional information - i.e., if the environment is unknown in the sense defined in Section 2.3 - then it has no choice but to try one of the actions at random.

    But suppose the agent has a map of Romania. The point of a map is to provide the agent with information about the states it might get itself into, and the actions it can take. The agent can use this information to consider subsequent stages of a hypothetical journey via each of the three towns, trying to find a journey that eventually gets to Bucharest. Once it has found a path on the map from Arad to Bucharest, it can achieve its goal by carrying out the driving actions that correspond to the legs of the journey. In general, an agent with several immediate options of unknown value can decide what to do first by examining future actions that eventually lead to states of known value.

    To be even more specific about what we mean by "examining future actions," we have to be more specific about properties of the environment. For now, we will assume that the environment is observable, so that the agent always knows the current state. For the agent driving in Romania, it's reasonable to suppose that each city on the map has a sign indicating its presence to arriving drivers. We will also assume the environment is discrete, so that at any given state there are only finitely many actions to choose from. This is true for navigating in Romania because each city is connected to a small number of other cities. We will assume the environment is known, so that the agent knows which states are reached by each action. Next, we assume our map is accurate. And finally, we assume that the environment is deterministic, so that each action has exactly one outcome. Under ideal conditions, this is true for the agent in Romania - it means that if it chooses to drive from Arad to Sibiu, it does end up in Sibiu.

    Under these assumptions, the solution to any problem is a fixed sequence of actions. The process of looking for a sequence of actions that reaches the goal is called a search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a solution is found, the actions it recommends can be carried out. This is called the execution phase. Figure Agent-1 gives us a simple formulate, search, execute design for the agent:

    Help Support this site - Click this ad

    function Simple-Problem-Solving-Agent (precept) returns an action
        seq, an action sequence, initially empty
        state, some description of the current world state
        goal, a goal, initially null
        problem, a problem formulation

      state ← Update-State (state,percept)
      if seq is empty then
        if seq = failure then return a null action
      action ← First(seq)
      seq ← Rest(seq)
      return action
    Figure Agent-1. A simple problem-solving agent. It first formulates a goal and a problem, searches for a sequence of actions that would solve the problem, and then executes the actions one at a time. When this is complete, it formulates another goal and starts over.

    Help Support this site - Click this ad

    After formulating a goal and a problem to solve, the agent calls a search procedure to solve it. It then uses the solution to guide its actions, doing whatever the solution recommends as the next thing to do - typically the first action of the sequence - and then removing that step from the sequence. Once the solution has been executed, the agent will formulate a new goal.

    Notice that while the agent is executing the solution sequence it ignores its percepts when choosing an action because it knows in advance what they will be. An agent that carries out its plans with its eyes closed, so to speak, must be quite certain of what is going on. Control theorists call this an open-loop system, because ignoring the percepts breaks the loop between agent and environment.

    We first describe the process of problem formulation, and then present various algorithms for the Search function.

    In order to continue with this guide, and continue the discussion on Well-Defined Problems and Solutions: Press the Button below:

    Help Support this site - Click this ad


    Artificial Intelligence

    Big O Notation

    NP Completeness

    Intelligent Agents

    Problem Definition

    Formulating Problems

    Toy Problems

    Vacuum World


    8 Queens

    The Knuth Sequence

    Real World Problems

    Route Finding Problems

    Touring Problems

    Traveling Salesperson Problem

    VLSI Layout

    Robot Navigation

    Automatic Assembly Sequencing

    Searching For Solutions

    Frontier Node Expansion

    Search Algorithm Infrastructure

    Child Node Creation

    Measuring Algorithm Performance

    Uninformed Search Strategies

    Breadth-First Search

    Breadth-First Characteristics

    C++ Breadth-First Search Code

    Breadth-First Data Structure

    Breadth-First Main Program

    Breadth-First Make Map

    Breadth-First Make Neighbor

    Breadth-First Point of Origin

    Breadth-First Print Cities

    Breadth-First Initial Listing

    Breadth-First Path Record

    Breadth-First Get Child City

    C++ Breadth-First Search Code

    Breadth-First Print Goal Path

    Uniform-Cost Search

    Depth-First Search

    Depth-First C++ Solution

    Depth-First Data Structure

    Depth-First MakeMap()

    Depth-First Display Data

    Depth-First Initial Listing

    Depth-First GetChildCity()

    Depth-First Path Record

    Depth-First Search Function

    Depth-First PrintGoalPath()

    Depth-First Main()

    Depth-Limited Search

    Iterative Depth Search

    Bidirectional Search

    Comparing Strategies

    Informed Search Strategies

    Greedy Best-First

    A* Search

    Conditions For Optimality

    Optimality of A*

    C++ A* Search Code

    A* Search Data Structure

    A* Search Data Entry

    A* Search GetChildCity()

    C++ Min Priority Queue

    C++ A* Search Code Function

    C++ A* Headers & Prototypes

    C++ A* Search Main()

    Normalized Information Distance

    Can't open VisitorHistory.txt for reading