|
Hot-Link Menu in Right-Side Column |
Last Updated 6-27-2010 |
||||||||||||||||||||||||||||||||||||||||||||
|
Normalized Information DistanceThe Normalized Information Distance is a universal distance measurement for many different objects. Since the Normalized Information Distance is based on Kolmogorov complexity, the actual distance value cannot be computer, nevertheless, there are still many ways to take advantage of the concept. First of all, compression algorithms can be used to approximate Kolmogorov complexity, as long as the objects being compressed can have a string representation. Second, page count statistics obtained from the World Wide Web can be used for names and even abstract concepts. These practical applications of the normalized information distance are then applied to machine learning tasks to assist in tasks such as data mining. We will explore both the theoretical and practical implications of Normalized Information Distance. Real world applications include music clustering, bioinformatics and machine translation. Kolmogorov complexity measures the absolute information content of individual objects. In the tasks of clustering and data mining, it is also necessary to measure the absolute information distance between individual objects. Measuring absolute information distance is a universal requirement for determining a distance between two objects x and y. The distance between two string is defined as the minimum quantity of information required to translate x into y or vice versa. Unfortunately, because of the universal requirement that this information be based on Kolmogorov complexity, the distance is uncomputable. To get around this dilemma, several formulas and approaches have been developed that come close to approximating Kolmogorov complexity.
Continuing along the line that Kolmogorov Complexity is the length of the shortest program that outputs x on input y, has some interesting implications. One of the implications is that if x and y are concatenated, we still have two distinct objects, just concatenated. When we concatenate these programs, then, run them through a compressor, and compare the output of the concatenated result with the individual components, we now have a metric for the similarity between x and y. As x and y become more similary, the output of the concatenation of x and y, will tend to be the same as x and y compressed and unconcatenated. Normalized Compression DistanceThe Normalized Information Distance is impractical since it cannot be computed. To overcome this limitation, a practical form of the Normalized Information Distance was developed. This practical form is known as the Normalized Compression Distance. The Normalized Compression Distance utilizes the properties of data compressors (gzip, bzip, & PPMZ) to replace Kolmogorov complexity with a practical value. Making the proper substitutions gives us the following formula:
where:
PhylogeniesDNA sequences seem ideally suited for the Normalized Compression Distance technique. A DNA sequence is a finite string with a 4-Letter alphabet: {A,C,G,T}. A hotly debated topic in biology is which 2 of the 3 placental mammalian groups: Rodents, Ferungulates, and Primates are most closely related. The test involves placing the mitochondrial gene sequences of several mammals in a standard data compressor such as PPMZ, and creating a distance matrix. The values in the distance matrix are then used to reconstruct a phylogeny tree giving an NCD classification tree. The results of this experiment are illustrated in the following diagram:
Our agent has now adopted the goal of driving to Bucharest, and is considering where to go from Arad. There are three roads out of Arad, one toward Sibu, one to Timisoara, and one to Zerind. None of these achieves the goal, so unless the agent is very familiar with the geography of Romania, it will not know which road to follow. In other words, the agent will not know which of its possible actions is best, because it does not yet know enough about the state that results from taking each action. If the agent has no additional information - i.e., if the environment is unknown in the sense defined in Section 2.3 - then it has no choice but to try one of the actions at random. But suppose the agent has a map of Romania. The point of a map is to provide the agent with information about the states it might get itself into, and the actions it can take. The agent can use this information to consider subsequent stages of a hypothetical journey via each of the three towns, trying to find a journey that eventually gets to Bucharest. Once it has found a path on the map from Arad to Bucharest, it can achieve its goal by carrying out the driving actions that correspond to the legs of the journey. In general, an agent with several immediate options of unknown value can decide what to do first by examining future actions that eventually lead to states of known value. To be even more specific about what we mean by "examining future actions," we have to be more specific about properties of the environment. For now, we will assume that the environment is observable, so that the agent always knows the current state. For the agent driving in Romania, it's reasonable to suppose that each city on the map has a sign indicating its presence to arriving drivers. We will also assume the environment is discrete, so that at any given state there are only finitely many actions to choose from. This is true for navigating in Romania because each city is connected to a small number of other cities. We will assume the environment is known, so that the agent knows which states are reached by each action. Next, we assume our map is accurate. And finally, we assume that the environment is deterministic, so that each action has exactly one outcome. Under ideal conditions, this is true for the agent in Romania - it means that if it chooses to drive from Arad to Sibiu, it does end up in Sibiu. Under these assumptions, the solution to any problem is a fixed sequence of actions. The process of looking for a sequence of actions that reaches the goal is called a search. A search algorithm takes a problem as input and returns a solution in the form of an action sequence. Once a solution is found, the actions it recommends can be carried out. This is called the execution phase. Figure Agent-1 gives us a simple formulate, search, execute design for the agent:
After formulating a goal and a problem to solve, the agent calls a search procedure to solve it. It then uses the solution to guide its actions, doing whatever the solution recommends as the next thing to do - typically the first action of the sequence - and then removing that step from the sequence. Once the solution has been executed, the agent will formulate a new goal. Notice that while the agent is executing the solution sequence it ignores its percepts when choosing an action because it knows in advance what they will be. An agent that carries out its plans with its eyes closed, so to speak, must be quite certain of what is going on. Control theorists call this an open-loop system, because ignoring the percepts breaks the loop between agent and environment. We first describe the process of problem formulation, and then present various algorithms for the Search function. In order to continue with this guide, and continue the discussion on Well-Defined Problems and Solutions: Press the Button below:
|
Home Artificial Intelligence Big O NotationNP Completeness Intelligent Agents Problem Definition Formulating ProblemsToy Problems Vacuum World 8-Puzzle 8 QueensThe Knuth Sequence Real World Problems Route Finding Problems Touring Problems Traveling Salesperson Problem VLSI Layout Robot Navigation Automatic Assembly Sequencing Searching For Solutions Frontier Node Expansion Search Algorithm InfrastructureChild Node Creation Measuring Algorithm Performance Uninformed Search Strategies Breadth-First SearchBreadth-First Characteristics C++ Breadth-First Search Code Breadth-First Data Structure Breadth-First Main Program Breadth-First Make MapBreadth-First Make Neighbor Breadth-First Point of Origin Breadth-First Print Cities Breadth-First Initial ListingBreadth-First Path Record Breadth-First Get Child CityC++ Breadth-First Search Code Breadth-First Print Goal Path Uniform-Cost SearchDepth-First Search Depth-First C++ SolutionDepth-First Data Structure Depth-First MakeMap() Depth-First Display DataDepth-First Initial Listing Depth-First GetChildCity() Depth-First Path RecordDepth-First Search Function Depth-First PrintGoalPath() Depth-First Main()Depth-Limited Search Iterative Depth Search Bidirectional SearchComparing Strategies Informed Search Strategies Greedy Best-FirstA* Search Conditions For Optimality Optimality of A*C++ A* Search Code A* Search Data StructureA* Search Data Entry A* Search GetChildCity() C++ Min Priority QueueC++ A* Search Code Function C++ A* Headers & PrototypesC++ A* Search Main() Normalized Information Distance |