Mind The Gap

This paper provides a brief introduction to neural networks and the task of face recognition with ANNs. It’s based on “Machine Learning”, written by Tom M. Mitchell and some resources of the World Wide Web.

See chapter “References” for more informations.

The aim of this paper is not to give a detailed introduction to neural networks, because there are many good books and online resources doing a great job in this. Just use Google do find some resources.

After a very brief introduction to gradient descent, backpropagation and neural networks I will show, how to design a neural network to recognize faces. The data set (the images) can be found here: http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html, the source
code here: http://code.google.com/p/mindthegap/

1. Backpropagation and Gradient Descent
In the paper „Introduction“ we have developed a learner to learn the task of playing Tic Tac Toe.We have used a linear combination of weights and features to represent the learning task and have used the LMS update rule to adjust the weights. With the LMS update rule we have minimized the error between the output and the target output. The LMS update rule is a stochastic gradient descent. This linear representation of a learning task is just suitable for problems, which are linear separable. In many cases this is not possible, so different (more complex) represenations have to be found. One represenation are neural networks. Depending
on the structure of the network (number of different layers), neural networks can represent very complex nonlinear hypothesis spaces. But beside some differences our simple Tic Tac Toe learner shares some basic concepts with neural networks: the basis of both are features and weights.

Backpropagation is a (supervised) learning algorithm for feed-forward networks. As many other algorithm it uses gradient descent to find weights which minimize the error. Gradient descent involves computing the error gradient and adjusting the weights in the direction of the gradient.

I´ve implemented some simple neural networks to learn some simple tasks. The source code is here [2] available. There are neural networks which learn simple identity functions, simple inverse functions etc. There is also a delta rule implementation, which can be used to compare different types and parameters of the gradient descent.
You can compare
• plain gradient descent
• stochastic gradient descent (approximation)
• different learning rates and
• decay learning rates
For an overview about neural networks see [3].

2. Designing the ANN
Beside the implementation of the backpropagation algorithm our neural network to learn face recognition needs some important design decisions. First we have to design the overall structure of the network. We decide to use two sigmoid layers (one hidden layer and one output layer) consisting of some (currently) unknown number of units. Second we have to consider, which features our network should learn. The images are a 32x30 grid of pixels. Each pixel has an integer value from 0 to 255 representing color values. An obvious solution for our problem domain is to use 960 input units representing the pixel values. Third we have to decide, how
many hidden units we want to use. As suggest in [1] we use 20 units (but we can easily change this and experiment with more or less units). Fourth we have to design the output layer. F.ex. we can use only one output unit to represent the final classification. But we choose again an obvious solution: for every person there should be one output unit (because we expect that the network can learn this output encoding easier). The last decision comprises the target encoding. The target encoding determines the error between current output of the network and the desired output. Because we have one output unit for every person, our target encoding for e.g. the first person (of seven) looks like this: 0.9, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1. So high values
stand for the target classification.

The final structure looks like this:

3 The Implemenation
The implemenatin of the network is straightforward. The class Unit represents a single sidmoid unit. It has methods for calculating the current error of the unit and for adjusting the weights. ImageNeuralNetwork represents the whole network. It defines the structur of the network and has methods for propagating the different layers and for training and testing the network. TargetEncoding defines the above specified target encoding. XVal is a simple crossvalidation implementation. ImageReader reads each single character from the image files and stores them as integers in an array. And ImageLearnerNN puts the pieces together. It uses the crossvalidation implementation to train networks and looks for the ’best’ network. The accuracy
of the ’best’ network is calculated and this process is repeated ten times to get a mean result.

References
[1] Machine Learning, Mitchell, Tom M. (1997), ISBN 0-07-115467-1
[2] Source Code, http://code.google.com/p/mindthegap/
[3] Artificial Neural Network, http://en.wikipedia.org/wiki/Artificial_neural_network
[4] Backpropagation, http://en.wikipedia.org/wiki/Backpropagation
[5] Mitchell Textbook, http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html

This is a brief introduction to Machine Learning. It’s based on “Machine Learning”, written by Tom M. Mitchell and some resources of the World Wide Web. See chapter “References” for more informations (I strongly recommend reading “Machine Learning”),
The source code can be found here: http://code.google.com/p/mindthegap/

I will characterize the nature of machine learning and provide a general overview of the concepts of machine learning tasks in a very practical manner.
Througout we will work on one concrete example and we will develop one possible solution for this given problem. After reading this the reader should have an imagination of machine learning problems and how to solve these.

What is (Machine) Learning?
One of the mayor abilities of men is the process of learning. Much of scientific effort has run into the examination, what behind this ability is. But until now there is no unique definition of learning and we are just beginning to understand, how learning works.
We will define learning very simplified:

Definition: Learning consists of remembering, of the ability of combining well-known facts and the recognition of patterns.

The human learning process is much more complex as here described, but it is good starting point to think about this issue. If we could write a program, which has the same ability to learn as a human, we would come close to the dream of some famous science fiction writers. Nowadays we are far away from this, but algorithm for special problems have been invaded. And with the research and development of such algorithm the understanding of the human ability of learning might improve. In the context of computer programs we define learning as in [1]):

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

We will call a program (respectively the component of a program), which learns, a learner.
Example: Tic Tac Toe
Nearly everybody knows Tic Tac Toe. It is a very simple, easy-to-learn game. As in [1] suggested we will focus on this game. Shortly, two players play against each other, each one has a mark, the aim of this game is to place three of their own marks in a horizontal, vertical or diagonal row in a 3x3 grid. More informations can be found here: [2].

We will now define Tic Tac Toe using the above definition of learning:

Definition: Task T: playing Tic Tac Toe Performance P: percent of games one against opponents Training experience E: playing practice games against itself

In the context of machine learning the last point is the most important. To develop a program to play Tic Tac Toe and to measure the performance of the games, no special knowledge is necessary, this is a plain programming task. The design of the latter one can have a huge impact on success or failure of a learner. We have some possibilities to design the training experience. We can provide the learner training examples consisting of game states and the correct move for each (direct feedback) or we can only provide each move sequences and the final outcome (indirect feedback).

The latter case is mostly the more complex one. There the learner has to infer the quality of a move from the result of the game (won/lost). This is called credit assignment. It is a very difficult task to determine, if a move in the sequence is positive or negative for the final outcome, because a game can be lost even when early moves were optimal.

But this is not the only design decision we have to make. It is possible, that a teacher selects board states and provides the correct moves. Or the learner can select board states and ask the teacher for the correct moves. Or the learner can play against itself with no teacher present. In this case the learner selects board states autonomous and evaluates the moves regarding the final outcome.

Finally the training experience should be similar like the real-world experience (over the real-world examples the performance P must be measured). That means the training experience should represent the real-world. We say “the distribution of training examples should follow the distribution similar to the test (real-world) examples”. Tic Tac Toe is a very simple example, so in our scenario this should not be a difficult problem. But if we have a more complicate learning task (like the checker task described in chapter 1 in [1]), this could be serious issue. If a checker-learner just plays against itself (in the training phase), it might never encounter some important board states, which it would need in the real-world. In such a case we say “the distribution of training examples is not fully representative of the distribution of the real-world examples”. In practice, it is often necessary to learn from a distribution of examples that is not fully representative. It is important to understand, that mastering one distribution does not necessary lead to a good performance over some other distribution. And it is also important to know, that most of the modern machine learning theory is based on the assumption that the distribution of the training examples is similar to the distribution of the test examples. This assumption is necessary for the learner´s ability to learn, but we have to keep in mind, that in practice this assumption has often be violated.

For our learner we have decided that our system will train by playing games against itself. So now we have to define what type of knowledge will be learned. Our Tic Tac Toe system can generate every legal move from any board state. So our system has to learn how to choose the best move from among the legal moves. This legal moves represent a search space and we need the best search strategy. We call a function, which choose the best move from a given board state, target function. For Tic Tac Toe we define the target function this way: V : B -> R, where B is a legal board state and V maps a numeric score R to B. There, better board states get a higher score, worse board states lower score. So in our scenario, our learner has to learn this target function. To select the best move from a board state, the learner has to generate all possible successor board states and has to use V to choose the best board state (and so the best move).
Most real-world examples are to complex to learn V exactly. In general we are looking for a approximation of the target function. We call this function V’. There are many options for V’. For the Tic Tac Toe system we choose a linear combination for V’.

V 0(b) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + w6x6 (1)

where w0 through w6 are weights and the x´s are so called features:

x1: number of blue marks
x2: number of red
x3: number of two in one row/column/diagonal (blue)
x4: number of two in one row/column/diagonal (red)
x5: number of red in winning position
x6: number of red in winning position

With this target function our learner has just to adjust the weights. This is our whole learning task. The weights will determine the importance of each feature. So we complete our definition of the Tic Tac Toe learning task:

Definition:
Task T: playing Tic Tac Toe
Performance P: percent of games one against opponents
Training experience E: playing practice games against itself
Target function: V:B -> R
Target function representation: V'(b) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 + w6x6 (2)

Note: With this definition we have reduced the learning of a Tic Tac Toe strategy to the adjusting of weights (w0 through w6) in the target function representation.

We have decided, that our learner will train by playing against itself. So the only
information the learner is able to access is, whether the game was won or lost.
We have said, that the learner has to learn to choose the best move. Therefore the
learner has to store every single board state and has to assign a score to each board
state. The board with the best score represent the best move. It is very simple to
assign a score to the last board (the board before the end of the game): If the game
was won, we assign +100, if it was lost, we assign -100. So our next challenge is
to assign a score to the intermediate board states. If a game was lost, it does not
mean, that every intermediate board state is bad. It is i.e. possible, that a game was
perfect and just the last move was fatally bad. In [1] a very surprising solution for
this problem is presented:

Vtrain (b) <−Vc (Successor(b))
Vc is the current approximation of V, Successor denotes the next board state follo- wing b and Vt rain is the training value of the board b. So, to summarize, we use the successor board state of b to calculate the score of the training board state b. The last thing we need is the learning algorithm to adjust the weights. We decide to use the least mean squares, the LMS training rule. LMS training rule helps us to minimize the error between the training values and the values of the current approximation. The algorithm adjust the weights a small amount in the direction that reduces the error. Note that there are other algorithms, but for our problem this algorithm is sufficient.

Now we can design the Tic Tac Toe system;
• learner plays against itself
– calculates the features of every board state (xi )
– calculates the score of every board state using the features
– uses the current weights to choose the current best move
• calculates the training scores for the boards (using the successor board state)
– if game was won, set last training score to +100
– if game was lost, set last training score to -100
– if game was a tie, set last training score to 0
• for each training score
– adjust weights using:
wi <−wi + n(Vtrain (b) − Vc(b)) ∗ xi
n is a small constant (e.g. 0.1). Vt rain(b) - V c (b) is the error, we can see, that we change the weights to reduce the error between the training examples.

The implementation of the TicTacToe-system contains three classes:
1. Game: represents one game
2. TicTacToeSimpleOpponent: A TicTacToe player, who randomly makes his moves
3. TicTacToeLearner: uses the LMS training rule to learn playing TicTacToe (the training partner is TicTacToeSimpleOpponent)

After some experiments with the parameters (number of training loops, ...) I got following results: If two TicTacToeSimpleOpponents play against each other, the TicTacToeSimpleOpponent, who starts, will win around 59% of the games and loses 29%.

A TicTacToeLearner, who uses 70 rounds to train with a TicTacToeSimpleOpponent, wins about 70% of following games against TicTacToeSimpleOpponents.

If we increase the training rounds to 500, it will win approximately 86% of the games and will lose only 6% of it. We can increase the training rounds to 1500, but the result is just a marginal increase of won games.

References
[1] Machine Learning, Mitchell, Tom M. (1997), ISBN 0-07-115467-1
[2] Tic Tac Toe, http://en.wikipedia.org/wiki/Tic_

Mind The Gap

Donnerstag, 9. August 2007

Face Recognition with Neural Networks

Mittwoch, 11. Juli 2007

Introduction to Machine Learning

Blog-Archiv