Archive-name: ai-faq/neural-nets/part1
Last-modified: 1999-06-21
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)
Copyright 1997, 1998, 1999 by Warren S. Sarle, Cary, NC, USA.
  ---------------------------------------------------------------
    Additions, corrections, or improvements are always welcome.
    Anybody who is willing to contribute any information,
    please email me; if it is relevant, I will incorporate it.

    The monthly posting departs at the 28th of every month.
  ---------------------------------------------------------------
This is the first of seven parts of a monthly posting to the Usenet newsgroup comp.ai.neural-nets (as well as comp.answers and news.answers, where it should be findable at any time). Its purpose is to provide basic information for individuals who are new to the field of neural networks or who are just beginning to read this group. It will help to avoid lengthy discussion of questions that often arise for beginners.
   SO, PLEASE, SEARCH THIS POSTING FIRST IF YOU HAVE A QUESTION
                           and
   DON'T POST ANSWERS TO FAQs: POINT THE ASKER TO THIS POSTING
The latest version of the FAQ is available as a hypertext document, readable by any WWW (World Wide Web) browser such as Mosaic, under the URL: "ftp://ftp.sas.com/pub/neural/FAQ.html".

 These postings are archived in the periodic posting archive on host rtfm.mit.edu (and on some other hosts as well). Look in the anonymous ftp directory "/pub/usenet/news.answers/ai-faq/neural-nets" under the file names "part1", "part2", ... "part7". If you do not have anonymous ftp access, you can access the archives by mail server as well. Send an E-mail message to mail-server@rtfm.mit.edu with "help" and "index" in the body on separate lines for more information. You can also go to http://www.deja.com/ and look for posts containing "Neural Network FAQ" in comp.ai.neural-nets.

For those of you who read this FAQ anywhere other than in Usenet: To read comp.ai.neural-nets (or post articles to it) you need Usenet News access. Try the commands, 'xrn', 'rn', 'nn', or 'trn' on your Unix machine, 'news' on your VMS machine, or ask a local guru. WWW browsers are often set up for Usenet access, too--try the URL news:comp.ai.neural-nets.

The FAQ posting departs to comp.ai.neural-nets on the 28th of every month. It is also sent to the groups comp.answers and news.answers where it should be available at any time (ask your news manager). The FAQ posting, like any other posting, may a take a few days to find its way over Usenet to your site. Such delays are especially common outside of North America.

All changes to the FAQ from the previous month are shown in another monthly posting having the subject `changes to "comp.ai.neural-nets FAQ" -- monthly posting', which immediately follows the FAQ posting. The `changes' post contains the full text of all changes. There is also a weekly post with the subject "comp.ai.neural-nets FAQ: weekly reminder" that briefly describes any major changes to the FAQ.

This FAQ is not meant to discuss any topic exhaustively.

 Disclaimer:

This posting is provided 'as is'. No warranty whatsoever is expressed or implied, in particular, no warranty that the information contained herein is correct or useful in any way, although both are intended.
To find the answer of question "x", search for the string "Subject: x"
 
 

========== Questions ==========

Part 1: Introduction
What is this newsgroup for? How shall it be used?
What if my question is not answered in the FAQ?
Where is comp.ai.neural-nets archived?
May I copy this FAQ?
What is a neural network (NN)?
Where can I find a simple introduction to NNs?
What can you do with an NN and what not?
Who is concerned with NNs?
How many kinds of NNs exist?
How many kinds of Kohonen networks exist? (And what is k-means?) How are layers counted?
What are cases and variables?
What are the population, sample, training set, design set, validation set, and test set?
How are NNs related to statistical methods?
Part 2: Learning
What is backprop?
What learning rate should be used for backprop?
What are conjugate gradients, Levenberg-Marquardt, etc.?
How should categories be coded?
Why use a bias/threshold?
Why use activation functions?
How to avoid overflow in the logistic function?
What is a softmax activation function?
What is the curse of dimensionality?
How do MLPs compare with RBFs?
What are OLS and subset/stepwise regression?
Should I normalize/standardize/rescale the data?
Should I nonlinearly transform the data?
How to measure importance of inputs?
What is ART?
What is PNN?
What is GRNN?
What does unsupervised learning learn?
Part 3: Generalization
How is generalization possible?
How does noise affect generalization?
What is overfitting and how can I avoid it?
What is jitter? (Training with noise)
What is early stopping?
What is weight decay?
What is Bayesian learning?
How many hidden layers should I use?
How many hidden units should I use?
How can generalization error be estimated?
What are cross-validation and bootstrapping?
How to compute prediction and confidence intervals (error bars)?
Part 4: Books, data, etc.
Books and articles about Neural Networks?
Journals and magazines about Neural Networks?
Conferences and Workshops on Neural Networks?
Neural Network Associations?
On-line and machine-readable information about NNs?
How to benchmark learning methods?
Databases for experimentation with NNs?
Part 5: Free software
Freeware and shareware packages for NN simulation?
Part 6: Commercial software
Commercial software packages for NN simulation?
Part 7: Hardware and miscellaneous
Neural Network hardware?
What are some applications of NNs?
General
Agriculture
Chemistry
Finance and economics
Games
Industry
Materials science
Medicine
Music
Robotics
Weather forecasting
Weird
What to do with missing/incomplete data?
How to forecast time series (temporal sequences)?
How to learn an inverse of a function?
How to get invariant recognition of images under translation, rotation, etc.?
How to recognize handwritten characters?
What about Genetic Algorithms and Evolutionary Computation?
What about Fuzzy Logic?
Unanswered FAQs
Other NN links?
------------------------------------------------------------------------

Subject: What is this newsgroup for? How shall it be used?

The newsgroup comp.ai.neural-nets is intended as a forum for people who want to use or explore the capabilities of Artificial Neural Networks or Neural-Network-like structures.

 Posts should be in plain-text format, not postscript, html, rtf, TEX, MIME, or any word-processor format.

Do not use vcards or other excessively long signatures.

Please do not post homework or take-home exam questions. Please do not post a long source-code listing and ask readers to debug it. And note that chain letters and other get-rich-quick pyramid schemes are illegal in the USA; for example, see http://www.usps.gov/websites/depart/inspect/chainlet.htm

 There should be the following types of articles in this newsgroup:
 
 

Requests

Requests are articles of the form "I am looking for X", where X is something public like a book, an article, a piece of software. The most important about such a request is to be as specific as possible!

 If multiple different answers can be expected, the person making the request should prepare to make a summary of the answers he/she got and announce to do so with a phrase like "Please reply by email, I'll summarize to the group" at the end of the posting.

 The Subject line of the posting should then be something like "Request: X"

Questions

As opposed to requests, questions ask for a larger piece of information or a more or less detailed explanation of something. To avoid lots of redundant traffic it is important that the poster provides with the question all information s/he already has about the subject asked and state the actual question as precise and narrow as possible. The poster should prepare to make a summary of the answers s/he got and announce to do so with a phrase like "Please reply by email, I'll summarize to the group" at the end of the posting.

 The Subject line of the posting should be something like "Question: this-and-that" or have the form of a question (i.e., end with a question mark)

Students: please do not ask comp.ai.neural-net readers to do your homework or take-home exams for you.

Answers

These are reactions to questions or requests. If an answer is too specific to be of general interest, or if a summary was announced with the question or request, the answer should be e-mailed to the poster, not posted to the newsgroup.

Most news-reader software automatically provides a subject line beginning with "Re:" followed by the subject of the article which is being followed-up. Note that sometimes longer threads of discussion evolve from an answer to a question or request. In this case posters should change the subject line suitably as soon as the topic goes too far away from the one announced in the original subject line. You can still carry along the old subject in parentheses in the form "Re: new subject (was: old subject)"

Summaries

In all cases of requests or questions the answers for which can be assumed to be of some general interest, the poster of the request or question shall summarize the answers he/she received. Such a summary should be announced in the original posting of the question or request with a phrase like "Please answer by email, I'll summarize"

 In such a case, people who answer to a question should NOT post their answer to the newsgroup but instead mail them to the poster of the question who collects and reviews them. After about 5 to 20 days after the original posting, its poster should make the summary of answers and post it to the newsgroup.

 Some care should be invested into a summary:

  • simple concatenation of all the answers is not enough: instead, redundancies, irrelevancies, verbosities, and errors should be filtered out (as well as possible)
  • the answers should be separated clearly
  • the contributors of the individual answers should be identifiable (unless they requested to remain anonymous [yes, that happens])
  • the summary should start with the "quintessence" of the answers, as seen by the original poster
  • A summary should, when posted, clearly be indicated to be one by giving it a Subject line starting with "SUMMARY:"
  • Note that a good summary is pure gold for the rest of the newsgroup community, so summary work will be most appreciated by all of us. Good summaries are more valuable than any moderator ! :-)

    Announcements

    Some articles never need any public reaction. These are called announcements (for instance for a workshop, conference or the availability of some technical report or software system).

     Announcements should be clearly indicated to be such by giving them a subject line of the form "Announcement: this-and-that"

    Reports

    Sometimes people spontaneously want to report something to the newsgroup. This might be special experiences with some software, results of own experiments or conceptual work, or especially interesting information from somewhere else.

     Reports should be clearly indicated to be such by giving them a subject line of the form "Report: this-and-that"

    Discussions

    An especially valuable possibility of Usenet is of course that of discussing a certain topic with hundreds of potential participants. All traffic in the newsgroup that can not be subsumed under one of the above categories should belong to a discussion.

     If somebody explicitly wants to start a discussion, he/she can do so by giving the posting a subject line of the form "Discussion: this-and-that"

     It is quite difficult to keep a discussion from drifting into chaos, but, unfortunately, as many many other newsgroups show there seems to be no secure way to avoid this. On the other hand, comp.ai.neural-nets has not had many problems with this effect in the past, so let's just go and hope...

    Jobs Ads

    Advertisements for jobs requiring expertise in artificial neural networks are appropriate in comp.ai.neural-nets. Job ads should be clearly indicated to be such by giving them a subject line of the form "Job: this-and-that". It is also useful to include the country-state-city abbreviations that are conventional in misc.jobs.offered, such as: "Job: US-NY-NYC Neural network engineer". If an employer has more than one job opening, all such openings should be listed in a single post, not multiple posts. Job ads should not be reposted more than once per month.
     
     
    ------------------------------------------------------------------------

    Subject: What if my question is not answered in the FAQ?

    If your question is not answered in the FAQ, you can try a web search. The following search engines are especially useful:
    http://www.google.com/
    http://search.yahoo.com/
    http://www.altavista.com/
    http://www.deja.com/

     Another excellent web site on NNs is Donald Tveter's Backpropagator's Review at http://www.dontveter.com/bpr/bpr.html or http://gannoo.uce.ac.uk/bpr/bpr.html.

    For feedforward NNs, the best reference book is:

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.
    If the answer isn't in Bishop, then for more theoretical questions try:
    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.
    For more practical questions about MLP training, try:
    Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego: Academic Press.

    Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press.

    There are many more excellent books and web sites listed in the Neural Network FAQ, Part 4: Books, data, etc.
    ------------------------------------------------------------------------

    Subject: Where is comp.ai.neural-nets archived?

    The following archives are available for comp.ai.neural-nets:
  • Deja News at http://www.deja.com/
  • ftp://ftp.cs.cmu.edu/user/ai/pubs/news/comp.ai.neural-nets
  • http://asknpac.npac.syr.edu

  • According to Gang Cheng, gcheng@npac.syr.edu, the Northeast Parallel Architecture Center (NPAC), Syracue University, maintains an archive system for searching/reading USENET newsgroups and mailing lists. Two search/navigation interfaces accessible by any WWW browser are provided: one is an advanced search interface allowing queries with various options such as query by mail header, by date, by subject (keywords), by sender. The other is a Hypermail-like navigation interface for users familiar with Hypermail.

    For more information on newsgroup archives, see http://starbase.neosoft.com/~claird/news.lists/newsgroup_archives.html
    ------------------------------------------------------------------------

    Subject: May I copy this FAQ?

    The intent in providing a FAQ is to make the information freely available to whoever needs it. You may copy all or part of the FAQ, but please be sure to include a reference to the URL of the master copy, ftp://ftp.sas.com/pub/neural/FAQ.html, and do not sell copies of the FAQ. If you want to include information from the FAQ in your own web site, it is better to include links to the master copy rather than to copy text from the FAQ to your web pages, because various answers in the FAQ are updated at unpredictable times. To cite the FAQ in an academic-style bibliography, use something along the lines of:
    Sarle, W.S., ed. (1997), Neural Network FAQ, part 1 of 7: Introduction, periodic posting to the Usenet newsgroup comp.ai.neural-nets, URL: ftp://ftp.sas.com/pub/neural/FAQ.html
    ------------------------------------------------------------------------

    Subject: What is a neural network (NN)?

    First of all, when we are talking about a neural network, we should more properly say "artificial neural network" (ANN), because that is what we mean most of the time in comp.ai.neural-nets. Biological neural networks are much more complicated than the mathematical models we use for ANNs. But it is customary to be lazy and drop the "A" or the "artificial".

    There is no universally accepted definition of an NN. But perhaps most people in the field would agree that an NN is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") which usually carry numeric (as opposed to symbolic) data, encoded by any of various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training.

    Some NNs are models of biological neural networks and some are not, but historically, much of the inspiration for the field of NNs came from the desire to produce artificial systems capable of sophisticated, perhaps "intelligent", computations similar to those that the human brain routinely performs, and thereby possibly to enhance our understanding of the human brain.

    Most NNs have some sort of "training" rule whereby the weights of connections are adjusted on the basis of data. In other words, NNs "learn" from examples (as children learn to recognize dogs from examples of dogs) and exhibit some capability for generalization beyond the training data.

    NNs normally have great potential for parallelism, since the computations of the components are largely independent of each other. Some people regard massive parallelism and high connectivity to be defining characteristics of NNs, but such requirements rule out various simple models, such as simple linear regression (a minimal feedforward net with only two units plus bias), which are usefully regarded as special cases of NNs.

    Here is a sampling of definitions from the books on the FAQ maintainer's shelf. None will please everyone. Perhaps for that reason many NN textbooks do not explicitly define neural networks.

    According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60):

    ... a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes.
    According to Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, NY: Macmillan, p. 2:
      A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects:
      1. Knowledge is acquired by the network through a learning process.
      2. Interneuron connection strengths known as synaptic weights are used to store the knowledge.
    According to Nigrin, A. (1993), Neural Networks for Pattern Recognition, Cambridge, MA: The MIT Press, p. 11:
    A neural network is a circuit composed of a very large number of simple processing elements that are neurally based. Each element operates only on local information. Furthermore each element operates asynchronously; thus there is no overall system clock.
    According to Zurada, J.M. (1992), Introduction To Artificial Neural Systems, Boston: PWS Publishing Company, p. xv:
    Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.
    ------------------------------------------------------------------------

    Subject: Where can I find a simple introduction to NNs?

    Several excellent introductory books on NNs are listed in part 4 of the FAQ under "Books and articles about Neural Networks?" If you want a book with minimal math, see "The best introductory book for business executives."

    Dr. Leslie Smith has an on-line introduction to NNs, with examples and diagrams, at http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html.

    Another excellent introduction to NNs is Donald Tveter's Backpropagator's Review at http://www.dontveter.com/bpr/bpr.html or http://gannoo.uce.ac.uk/bpr/bpr.html, which contains both answers to additional FAQs and an annotated neural net bibliography emphasizing on-line articles.

    More introductory material is available on line from the "DACS Technical Report Summary: Artificial Neural Networks Technology" at http://www.utica.kaman.com/techs/neural/neural.html
    and "ARTIFICIAL INTELLIGENCE FOR THE BEGINNER" at http://home.clara.net/mll/ai/

    StatSoft Inc. has an on-line Electronic Statistics Textbook, at http://www.statsoft.com/textbook/stathome.html that includes a chapter on neural nets.

    ------------------------------------------------------------------------

    Subject: What can you do with an NN and what not?

    In principle, NNs can compute any computable function, i.e. they can do everything a normal digital computer can do.

    In practice, NNs are especially useful for classification and function approximation/mapping problems which are tolerant of some imprecision, which have lots of training data available, but to which hard and fast rules (such as those that might be used in an expert system) cannot easily be applied. Almost any vector function on a compact set can be approximated to arbitrary precision by feedforward NNs (which are the type most often used in practical applications) if you have enough data and enough computing resources.

    To be somewhat more precise, feedforward networks with a single hidden layer are statistically consistent estimators of arbitrary measurable, square-integrable regression functions under certain practically-satisfiable assumptions regarding sampling, target noise, number of hidden units, size of weights, and form of hidden-unit activation function (White, 1990). Such networks can also be trained as statistically consistent estimators of derivatives of regression functions (White and Gallant, 1992) and quantiles of the conditional noise distribution (White, 1992a). Feedforward networks with a single hidden layer using threshold or sigmoid activation functions are universally consistent estimators of binary classifications (Farag\'o and Lugosi, 1993; Lugosi and Zeger 1995; Devroye, Gy\"orfi, and Lugosi, 1996) under similar assumptions.

    Unfortunately, the above consistency results depend on one impractical assumption: that the networks are trained by an error (L_p error or misclassification rate) minimization technique that comes arbitrarily close to the global minimum. Such minimization is computationally intractable except in small or simple problems (Judd, 1990). In practice, however, you can usually get good results without doing a full-blown global optimization; e.g., using multiple (say, 10 to 1000) random weight initializations is usually sufficient.

    One example of a function that a typical neural net cannot learn is Y=1/X on the open interval (0,1), since an open interval is not a compact set. With any bounded output activation function, the error will get arbitrarily large as the input approaches zero. Of course, you could make the output activation function a reciprocal function and easily get a perfect fit, but neural networks are most often used in situations where you do not have enough prior knowledge to set the activation function in such a clever way. There are also many other important problems that are so difficult that a neural network will be unable to learn them without memorizing the entire training set, such as:

  • Predicting random or pseudo-random numbers.
  • Factoring large integers.
  • Determing whether a large integer is prime or composite.
  • Decrypting anything encrypted by a good algorithm.
  • NNs are, at least today, difficult to apply successfully to problems that concern manipulation of symbols and memory. And there are no methods for training NNs that can magically create information that is not contained in the training data.

    As for simulating human consciousness and emotion, that's still in the realm of science fiction. Consciousness is still one of the world's great mysteries. Artificial NNs may be useful for modeling some aspects of or prerequisites for consciousness, such as perception and cognition, but ANNs provide no insight so far into what Chalmers (1996, p. xi) calls the "hard problem":

    Many books and articles on consciousness have appeared in the past few years, and one might think we are making progress. But on a closer look, most of this work leaves the hardest problems about consciousness untouched. Often, such work addresses what might be called the "easy problems" of consciousness: How does the brain process environmental stimulation? How does it integrate information? How do we produce reports on internal states? These are important questions, but to answer them is not to solve the hard problem: Why is all this processing accompanied by an experienced inner life?
    For more information on consciousness, see the on-line journal Psyche at http://psyche.cs.monash.edu.au/index.html.

    For examples of specific applications of NNs, see What are some applications of NNs? References:

    Chalmers, D.J. (1996), The Conscious Mind: In Search of a Fundamental Theory, NY: Oxford University Press.

    Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer.

    Farag\'o, A. and Lugosi, G. (1993), "Strong Universal Consistency of Neural Network Classifiers," IEEE Transactions on Information Theory, 39, 1146-1151.

    Judd, J.S. (1990), Neural Network Design and the Complexity of Learning, Cambridge, MA: The MIT Press.

    Lugosi, G., and Zeger, K. (1995), "Nonparametric Estimation via Empirical Risk Minimization," IEEE Transactions on Information Theory, 41, 677-678.

    White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550. Reprinted in White (1992b).

    White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings of the 23rd Sympsium on the Interface: Computing Science and Statistics, Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted in White (1992b).

    White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.

    White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an Unknown Mapping with Multilayer Feedforward Networks," Neural Networks, 5, 129-138. Reprinted in White (1992b).

    ------------------------------------------------------------------------

    Subject: Who is concerned with NNs?

    Neural Networks are interesting for quite a lot of very different people:
  • Computer scientists want to find out about the properties of non-symbolic information processing with neural nets and about learning systems in general.
  • Statisticians use neural nets as flexible, nonlinear regression and classification models.
  • Engineers of many kinds exploit the capabilities of neural networks in many areas, such as signal processing and automatic control.
  • Cognitive scientists view neural networks as a possible apparatus to describe models of thinking and consciousness (High-level brain function).
  • Neuro-physiologists use neural networks to describe and explore medium-level brain function (e.g. memory, sensory system, motorics).
  • Physicists use neural networks to model phenomena in statistical mechanics and for a lot of other tasks.
  • Biologists use Neural Networks to interpret nucleotide sequences.
  • Philosophers and some other people may also be interested in Neural Networks for various reasons.
  • For world-wide lists of groups doing research on NNs, see the Foundation for Neural Networks's (SNN) page at http://www.mbfys.kun.nl/snn/pointers/groups.html and see Neural Networks Research on the IEEE Neural Network Council's homepage http://www.ieee.org/nnc.
    ------------------------------------------------------------------------

    Subject: How many kinds of NNs exist?

    There are many many kinds of NNs by now. Nobody knows exactly how many. New ones (or at least variations of existing ones) are invented every week. Below is a collection of some of the most well known methods, not claiming to be complete.

     The main categorization of these methods is the distinction between supervised and unsupervised learning:

  • In supervised learning, there is a "teacher" who in the learning phase "tells" the net how well it performs ("reinforcement learning") or what the correct behavior would have been ("fully supervised learning").
  • In unsupervised learning the net is autonomous: it just looks at the data it is presented with, finds out about some of the properties of the data set and learns to reflect these properties in its output. What exactly these properties are, that the network can learn to recognise, depends on the particular network model and learning method. Usually, the net learns some compressed representation of the data.
  • Many of these learning methods are closely connected with a certain (class of) network topology.

     Now here is the list, just giving some names:
     
     

    1. UNSUPERVISED LEARNING (i.e. without a "teacher"):
         1). Feedback Nets:
            a). Additive Grossberg (AG)
            b). Shunting Grossberg (SG)
            c). Binary Adaptive Resonance Theory (ART1)
            d). Analog Adaptive Resonance Theory (ART2, ART2a)
            e). Discrete Hopfield (DH)
            f). Continuous Hopfield (CH)
            g). Discrete Bidirectional Associative Memory (BAM)
            h). Temporal Associative Memory (TAM)
            i). Adaptive Bidirectional Associative Memory (ABAM)
            j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM)
            k). Competitive learning
         2). Feedforward-only Nets:
            a). Learning Matrix (LM)
            b). Driver-Reinforcement Learning (DR)
            c). Linear Associative Memory (LAM)
            d). Optimal Linear Associative Memory (OLAM)
            e). Sparse Distributed Associative Memory (SDM)
            f). Fuzzy Associative Memory (FAM)
            g). Counterprogation (CPN)
    
    2. SUPERVISED LEARNING (i.e. with a "teacher"):
         1). Feedback Nets:
            a). Brain-State-in-a-Box (BSB)
            b). Fuzzy Congitive Map (FCM)
            c). Boltzmann Machine (BM)
            d). Mean Field Annealing (MFT)
            e). Recurrent Cascade Correlation (RCC)
            f). Backpropagation through time (BPTT)
            g). Real-time recurrent learning (RTRL)
            h). Recurrent Extended Kalman Filter (EKF)
         2). Feedforward-only Nets:
            a). Perceptron
            b). Adaline, Madaline
            c). Backpropagation (BP)
            d). Cauchy Machine (CM)
            e). Adaptive Heuristic Critic (AHC)
            f). Time Delay Neural Network (TDNN)
            g). Associative Reward Penalty (ARP)
            h). Avalanche Matched Filter (AMF)
            i). Backpercolation (Perc)
            j). Artmap
            k). Adaptive Logic Network (ALN)
            l). Cascade Correlation (CasCor)
            m). Extended Kalman Filter(EKF)
            n). Learning Vector Quantization (LVQ)
            o). Probabilistic Neural Network (PNN)
            p). General Regression Neural Network (GRNN)
    ------------------------------------------------------------------------

    Subject: How many kinds of Kohonen networks exist? (And what is k-means?)

    Teuvo Kohonen is one of the most famous and prolific researchers in neurocomputing, and he has invented a variety of networks. But many people refer to "Kohonen networks" without specifying which kind of Kohonen network, and this lack of precision can lead to confusion. The phrase "Kohonen network" most often refers to one of the following three types of networks:
  • VQ: Vector Quantization--competitive networks that can be viewed as unsupervised density estimators or autoassociators (Kohonen, 1995/1997; Hecht-Nielsen 1990), closely related to k-means cluster analysis (MacQueen, 1967; Anderberg, 1973). Each competitive unit corresponds to a cluster, the center of which is called a "codebook vector". Kohonen's learning law is an on-line algorithm that finds the codebook vector closest to each training case and moves the "winning" codebook vector closer to the training case. The codebook vector is moved a certain proportion of the distance between it and the training case, the proportion being specified by the learning rate, that is:
  •    new_codebook = old_codebook * (1-learning_rate)
    
                    + data * learning_rate
    Numerous similar algorithms have been developed in the neural net and machine learning literature; see Hecht-Nielsen (1990) for a brief historical overview, and Kosko (1992) for a more technical overview of competitive learning.

    MacQueen's on-line k-means algorithm is essentially the same as Kohonen's learning law except that the learning rate is the reciprocal of the number of cases that have been assigned to the winnning cluster. Suppose that when processing a given training case, N cases have been previously assigned to the winning codebook vector. Then the codebook vector is updated as:

       new_codebook = old_codebook * N/(N+1)
    
                    + data * 1/(N+1)
    This reduction of the learning rate makes each codebook vector the mean of all cases assigned to its cluster and guarantees convergence of the algorithm to an optimum value of the error function (the sum of squared Euclidean distances between cases and codebook vectors) as the number of training cases goes to infinity. Kohonen's learning law with a fixed learning rate does not converge. As is well known from stochastic approximation theory, convergence requires the sum of the infinite sequence of learning rates to be infinite, while the sum of squared learning rates must be finite (Kohonen, 1995, p. 34). These requirements are satisfied by MacQueen's k-means algorithm.

    Kohonen VQ is often used for off-line learning, in which case the training data are stored and Kohonen's learning law is applied to each case in turn, cycling over the data set many times (incremental training). Convergence to a local optimum can be obtained as the training time goes to infinity if the learning rate is reduced in a suitable manner as described above. However, there are off-line k-means algorithms, both batch and incremental, that converge in a finite number of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979). The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the advantage for large data sets, since the incremental methods require you either to store the cluster membership of each case or to do two nearest-cluster computations as each case is processed. Forgy's algorithm is a simple alternating least-squares algorithm consisting of the following steps:
     
     

  • Initialize the codebook vectors.
  • Repeat the following two steps until convergence:

  • A. Read the data, assigning each case to the nearest (using Euclidean distance) codebook vector.
    B. Replace each codebook vector with the mean of the cases that were assigned to it.
    Fastest training is usually obtained if MacQueen's on-line algorithm is used for the first pass and off-line k-means algorithms are applied on subsequent passes. However, these training methods do not necessarily converge to a global optimum of the error function. The chance of finding a global optimum can be improved by using rational initialization (SAS Institute, 1989, pp. 824-825), multiple random initializations, or various time-consuming training methods intended for global optimization (Ismail and Kamel, 1989; Zeger, Vaisy, and Gersho, 1992).

    VQ has been a popular topic in the signal processing literature, which has been largely separate from the literature on Kohonen networks and from the cluster analysis literature in statistics and taxonomy. In signal processing, on-line methods such as Kohonen's and MacQueen's are called "adaptive vector quantization" (AVQ), while off-line k-means methods go by the names of "Lloyd-Max" (Lloyd, 1982; Max, 1960) and "LBG" (Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho and Gray (1992) that summarizes these algorithms as information compression methods.

    Kohonen's work emphasized VQ as density estimation and hence the desirability of equiprobable clusters (Kohonen 1984; Hecht-Nielsen 1990). However, Kohonen's learning law does not produce equiprobable clusters--that is, the proportions of training cases assigned to each cluster are not usually equal. If there are I inputs and the number of clusters is large, the density of the codebook vectors approximates the I/(I+2) power of the density of the training data (Kohonen, 1995, p. 35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately equiprobable only if the data density is uniform or the number of inputs is large. The most popular method for obtaining equiprobability is Desieno's (1988) algorithm which adds a "conscience" value to each distance prior to the competition. The conscience value for each cluster is adjusted during training so that clusters that win more often have larger conscience values and are thus handicapped to even out the probabilities of winning in later iterations.

    Kohonen's learning law is an approximation to the k-means model, which is an approximation to normal mixture estimation by maximum likelihood assuming that the mixture components (clusters) all have spherical covariance matrices and equal sampling probabilities. Hence if the population contains clusters that are not equiprobable, k-means will tend to produce sample clusters that are more nearly equiprobable than the population clusters. Corrections for this bias can be obtained by maximizing the likelihood without the assumption of equal sampling probabilities Symons (1981). Such corrections are similar to conscience but have the opposite effect.

    In cluster analysis, the purpose is not to compress information but to recover the true cluster memberships. K-means differs from mixture models in that, for k-means, the cluster membership for each case is considered a separate parameter to be estimated, while mixture models estimate a posterior probability for each case based on the means, covariances, and sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and Lewis (1994) found that k-means algorithms recovered cluster membership more accurately than Kohonen VQ.

  • SOM: Self-Organizing Map--competitive networks that provide a "topological" mapping from the input space to the clusters (Kohonen, 1995). The SOM was inspired by the way in which various human sensory impressions are neurologically mapped into the brain such that spatial or other relations among stimuli correspond to spatial relations among the neurons. In a SOM, the neurons (clusters) are organized into a grid--usually two-dimensional, but sometimes one-dimensional or (rarely) three- or more-dimensional. The grid exists in a space that is separate from the input space; any number of inputs may be used as long as the number of inputs is greater than the dimensionality of the grid space. A SOM tries to find clusters such that any two clusters that are close to each other in the grid space have codebook vectors close to each other in the input space. But the converse does not hold: codebook vectors that are close to each other in the input space do not necessarily correspond to clusters that are close to each other in the grid. Another way to look at this is that a SOM tries to embed the grid in the input space such every training case is close to some codebook vector, but the grid is bent or stretched as little as possible. Yet another way to look at it is that a SOM is a (discretely) smooth mapping between regions in the input space and points in the grid space. The best way to undestand this is to look at the pictures in Kohonen (1995) or various other NN textbooks.

  • The Kohonen algorithm for SOMs is very similar to the Kohonen algorithm for AVQ. Let the codebook vectors be indexed by a subscript j, and let the index of the codebook vector nearest to the current training case be n. The Kohonen SOM algorithm requires a kernel function K(j,n), where K(j,j)=1 and K(j,n) is usually a non-increasing function of the distance (in any metric) between codebook vectors j and n in the grid space (not the input space). Usually, K(j,n) is zero for codebook vectors that are far apart in the grid space. As each training case is processed, all the codebook vectors are updated as:

       new_codebook  = old_codebook  * [1 - K(j,n) * learning_rate]
                   j               j
    
                     + data * K(j,n) * learning_rate
    The kernel function does not necessarily remain constant during training. The neighborhood of a given codebook vector is the set of codebook vectors for which K(j,n)>0. To avoid poor results (akin to local minima), it is usually advisable to start with a large neighborhood, and let the neighborhood gradually shrink during training. If K(j,n)=0 for j not equal to n, then the SOM update formula reduces to the formula for Kohonen vector quantization. In other words, if the neighborhood size (for example, the radius of the support of the kernel function K(j,n)) is zero, the SOM algorithm degenerates into simple VQ. Hence it is important not to let the neighborhood size shrink all the way to zero during training. Indeed, the choice of the final neighborhood size is the most important tuning parameter for SOM training, as we will see shortly.

    A SOM works by smoothing the codebook vectors in a manner similar to kernel estimation methods, but the smoothing is done in neighborhoods in the grid space rather than in the input space (Mulier and Cherkassky 1995). This is easier to see in a batch algorithm for SOMs, which is similar to Forgy's algorithm for batch k-means, but incorporates an extra smoothing process:

  • Initialize the codebook vectors.
  • Repeat the following two steps until convergence or boredom:

  • A. Read the data, assigning each case to the nearest (using Euclidean distance) codebook vector. While you are doing this, keep track of the mean and the number of cases for each cluster.
    B. Do a nonparamteric regression using K(j,n) as a kernel function, with the grid points as inputs, the cluster means as target values, and the number of cases in each cluster as an case weight. Replace each codebook vector with the output of the nonparametric regression function evaluated at its grid point.
    If the nonparametric regression method is Nadaraya-Watson kernel regression (see What is GRNN?), the batch SOM algorithm produces essentially the same results as Kohonen's algorithm, barring local minima. The main difference is that the batch algorithm often converges. Mulier and Cherkassky (1995) note that other nonparametric regression methods can be used to provide superior SOM algorithms. In particular, local-linear smoothing eliminates the notorious "border effect", whereby the codebook vectors near the border of the grid are compressed in the input space. The border effect is especially problematic when you try to use a high degree of smoothing in a Kohonen SOM, since all the codebook vectors will collapse into the center of the input space. The SOM border effect has the same mathematical cause as the "boundary effect" in kernel regression, which causes the estimated regression function to flatten out near the edges of the regression input space. There are various cures for the edge effect in nonparametric regression, of which local-linear smoothing is the simplest (Wand and Jones, 1995). Hence, local-linear smoothing also cures the border effect in SOMs. Furthermore, local-linear smoothing is much more general and reliable than the heuristic weighting rule proposed by Kohonen (1995, p. 129).

    Since nonparametric regression is used in the batch SOM algorithm, various properties of nonparametric regression extend to SOMs. In particular, it is well known that the shape of the kernel function is not a crucial matter in nonparametric regression, hence it is not crucial in SOMs. On the other hand, the amount of smoothing used for nonparametric regression is crucial, hence the choice of the final neighborhood size in a SOM is crucial. Unfortunately, I am not aware of any systematic studies of methods for choosing the final neighborhood size.

    The batch SOM algorithm is very similar to the principal curve and surface algorithm proposed by Hastie and Stuetzle (1989), as pointed out by Ritter, Martinetz, and Schulten (1992) and Mulier and Cherkassky (1995). A principal curve is a nonlinear generalization of a principal component. Given the probability distribution of a population, a principal curve is defined by the following self-consistency condition:

  • If you choose any point on a principal curve,
  • then find the set of all the points in the input space that are closer to the chosen point than any other point on the curve,
  • and compute the expected value (mean) of that set of points with respect to the probability distribution, then
  • you end up with the same point on the curve that you chose originally.
  • In a multivariate normal distribution, the principal curves are the same as the principal components. A principal surface is the obvious generalization from a curve to a surface. In a multivariate normal distribution, the principal surfaces are the subspaces spanned by any two principal components.

     A one-dimensional local-linear batch SOM can be used to estimate a principal curve by letting the number of codebook vectors approach infinity while choosing a kernel function of appropriate smoothness. A two-dimensional local-linear batch SOM can be used to estimate a principal surface by letting the number of both rows and columns in the grid approach infinity. This connection between SOMs and principal curves and surfaces shows that the choice of the number of codebook vectors is not crucial, provided the number is fairly large.

    Principal component analysis is a method of data compression, not a statistical model. However, there is a related method called "common factor analysis" that is often confused with principal component analysis but is indeed a statistical model. Common factor analysis posits that the relations among the observed variables can be explained by a smaller number of unobserved, "latent" variables. Tibshirani (1992) has proposed a latent-variable variant of principal curves, and latent-variable modifications of SOMs have been introduced by Utsugi (1996, 1997) and Bishop, Svens\'en, and Williams (1997).

    In a Kohonen SOM, as in VQ, it is necessary to reduce the learning rate during training to obtain convergence. Greg Heath has commented in this regard:

    I favor separate learning rates for each winning SOM node (or k-means cluster) in the form 1/(N_0i + N_i + 1), where N_i is the count of vectors that have caused node i to be a winner and N_0i is an initializing count that indicates the confidence in the initial weight vector assignment. The winning node expression is based on stochastic estimation convergence constraints and pseudo-Bayesian estimation of mean vectors. Kohonen derived a heuristic recursion relation for the "optimal" rate. To my surprise, when I solved the recursion relation I obtained the same above expression that I've been using for years.

    In addition, I have had success using the similar form (1/n)/(N_0j + N_j + (1/n)) for the n nodes in the shrinking updating-neighborhood. Before the final "winners-only" stage when neighbors are no longer updated, the number of updating neighbors eventually shrinks to n = 6 or 8 for hexagonal or rectangular neighborhoods, respectively.

    Kohonen's neighbor-update formula is more precise replacing my constant fraction (1/n) with a node-pair specific h_ij (h_ij < 1). However, as long as the initial neighborhood is sufficiently large, the shrinking rate is sufficiently slow, and the final winner-only stage is sufficiently long, the results should be relatively insensitive to exact form of h_ij.
     
     

    Another advantage of batch SOMs is that there is no learning rate, so these complications evaporate.

    Kohonen (1995, p. VII) says that SOMs are not intended for pattern recognition but for clustering, visualization, and abstraction. Kohonen has used a "supervised SOM" (1995, pp. 160-161) that is similar to counterpropagation (Hecht-Nielsen 1990), but he seems to prefer LVQ (see below) for supervised classification. Many people continue to use SOMs for classification tasks, sometimes with surprisingly (I am tempted to say "inexplicably") good results (Cho, 1997).

  • LVQ: Learning Vector Quantization--competitive networks for supervised classification (Kohonen, 1988, 1995; Ripley, 1996). Each codebook vector is assigned to one of the target classes. Each class may have one or more codebook vectors. A case is classified by finding the nearest codebook vector and assigning the case to the class corresponding to the codebook vector. Hence LVQ is a kind of nearest-neighbor rule.

  • Ordinary VQ methods, such as Kohonen's VQ or k-means, can easily be used for supervised classification. Simply count the number of training cases from each class assigned to each cluster, and divide by the total number of cases in the cluster to get the posterior probability. For a given case, output the class with the greatest posterior probability--i.e. the class that forms a majority in the nearest cluster. Such methods can provide universally consistent classifiers (Devroye et al., 1996) even when the codebook vectors are obtained by unsupervised algorithms. LVQ tries to improve on this approach by adapting the codebook vectors in a supervised way. There are several variants of LVQ--called LVQ1, OLVQ1, LVQ2, and LVQ3--based on heuristics. However, a smoothed version of LVQ can be trained as a feedforward network using a NRBFEQ architecture (see "How do MLPs compare with RBFs?") and optimizing any of the usual error functions; as the width of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ network.
     
     

    There are several other kinds of Kohonen networks described in Kohonen (1995), including:
  • DEC--Dynamically Expanding Context
  • LSM--Learning Subspace Method
  • ASSOM--Adaptive Subspace SOM
  • FASSOM--Feedback-controlled Adaptive Subspace SOM
  • Supervised SOM
  • LVQ-SOM
  • More information on the error functions (or absence thereof) used by Kohonen VQ and SOM is provided under "What does unsupervised learning learn?"

    For more on-line information on Kohonen networks and other varieties of SOMs, see:

  • The web page of The Neural Networks Research Centre, Helsinki University of Technology, at http://nucleus.hut.fi/nnrc/
  • The SOM of articles from comp.ai.neural-nets at http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html
  • Akio Utsugi's web page on Bayesian SOMs at the National Institute of Bioscience and Human-Technology, Agency of Industrial Science and Technology, M.I.T.I., 1-1, Higashi, Tsukuba, Ibaraki, 305 Japan, at http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html
  • The GTM (generative topographic mapping) home page at the Neural Computing Research Group, Aston University, Birmingham, UK, at http://www.ncrg.aston.ac.uk/GTM/
  • Nenet SOM software at http://www.mbnet.fi/~phodju/nenet/nenet.html
  • References:
    Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc.

    Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525.

    Bishop, C.M., Svens\'en, M., and Williams, C.K.I (1997), "GTM: A principled alternative to the self-organizing map," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambridge, MA: The MIT Press, pp. 354-360. Also see http://www.ncrg.aston.ac.uk/GTM/

    Cho, S.-B. (1997), "Self-organizing map with dynamical node-splitting: Application to handwritten digit recognition," Neural Computation, 9, 1345-1355.

    Desieno, D. (1988), "Adding a conscience to competitive learning," Proc. Int. Conf. on Neural Networks, I, 117-124, IEEE Press.

    Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer,

    Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency versus interpretability," Biometric Society Meetings, Riverside, CA. Abstract in Biomatrics, 21, 768.

    Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers.

    Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.

    Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means clustering algorithm," Applied Statistics, 28-100-108.

    Hastie, T., and Stuetzle, W. (1989), "Principal curves," Journal of the American Statistical Association, 84, 502-516.

    Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.

    Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering utilizing hybrid search strategies," Pattern Recognition, 22, 75-89.

    Kohonen, T (1984), Self-Organization and Associative Memory, Berlin: Springer-Verlag.

    Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl 1), 303.

    Kohonen, T. (1995/1997), Self-Organizing Maps, Berlin: Springer-Verlag. First edition was 1995, second edition 1997. See http://www.cis.hut.fi/nnrc/new_book.html for information on the second edition.

    Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall.

    Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector quantizer design," IEEE Transactions on Communications, 28, 84-95.

    Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions on Information Theory, 28, 129-137.

    MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.

    Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on Information Theory, 6, 7-12.

    Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an iterative kernel smoothing process," Neural Computation, 7, 1165-1177.

    Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation and Self-Organizing Maps: An Introduction, Reading, MA: Addison-Wesley.

    SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition, Cary, NC: SAS Institute.

    Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 35-43.

    Tibshirani, R. (1992), "Principal curves revisited," Statistics and Computing, 2, 183-190.

    Utsugi, A. (1996), "Topology selection for self-organizing maps," Network: Computation in Neural Systems, 7, 727-740, available on-line at http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

    Utsugi, A. (1997), "Hyperparameter selection for self-organizing maps," Neural Computation, 9, 623-635, available on-line at http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

    Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman & Hall.

    Zador, P.L. (1982), "Asymptotic quantization error of continuous signals and the quantization dimension," IEEE Transactions on Information Theory, 28, 139-149.

    Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector quantizer design by stochastic relaxation," IEEE Transactions on Signal Procesing, 40, 310-322.

    ------------------------------------------------------------------------

    Subject: How are layers counted?

    How to count layers is a matter of considerable dispute.
  • Some people count layers of units. But of these people, some count the input layer and some don't.
  • Some people count layers of weights. But I have no idea how they count skip-layer connections.

  •  

     

    To avoid ambiguity, you should speak of a 2-hidden-layer network, not a 4-layer network (as some would call it) or 3-layer network (as others would call it). And if the connections follow any pattern other than fully connecting each layer to the next and to no others, you should carefully specify the connections.
    ------------------------------------------------------------------------

    Subject: What are cases and variables?

    A vector of values presented at one time to all the input units of a neural network is called a "case", "example", "pattern, "sample", etc. The term "case" will be used in this FAQ because it is widely recognized, unambiguous, and requires less typing than the other terms. A case may include not only input values, but also target values and possibly other information.

    A vector of values presented at different times to a single input unit is often called an "input variable" or "feature". To a statistician, it is a "predictor", "regressor", "covariate", "independent variable", "explanatory variable", etc. A vector of target values associated with a given output unit of the network during training will be called a "target variable" in this FAQ. To a statistician, it is usually a "response" or "dependent variable".

    A "data set" is a matrix containing one or (usually) more cases. In this FAQ, it will be assumed that cases are rows of the matrix, while variables are columns.

    Note that the often-used term "input vector" is ambiguous; it can mean either an input case or an input variable.

    ------------------------------------------------------------------------

    Subject: What are the population, sample, training set, design set, validation set, and test set?

    There seems to be no term in the NN literature for the set of all cases that you want to be able to generalize to. Statisticians call this set the "population". Neither is there a consistent term in the NN literature for the set of cases that are available for training and evaluating an NN. Statisticians call this set the "sample". The sample is usually a subset of the population.

    (Neurobiologists mean something entirely different by "population," apparently some collection of neurons, but I have never found out the exact meaning. I am going to continue to use "population" in the statistical sense until NN researchers reach a consensus on some other terms for "population" and "sample"; I suspect this will never happen.)

    In NN methodology, the sample is often subdivided into "training", "validation", and "test" sets. The distinctions among these subsets are crucial, but the terms "validation" and "test" sets are often confused. There is no book in the NN literature more authoritative than Ripley (1996), from which the following definitions are taken (p.354):

    Training set:
    A set of examples used for learning, that is to fit the parameters [weights] of the classifier.
    Validation set:
    A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.
    Test set:
    A set of examples used only to assess the performance [generalization] of a fully-specified classifier.
    Bishop (1995), another indispensable reference on neural networks, provides the following explanation (p. 372):
    Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.
    The crucial point is that a test set, by definition, is never used to choose among two or more networks, so that the error on the test set provides an unbiased estimate of the generalization error (assuming that the test set is representative of the population, etc.). Any data set that is used to choose the best of two or more networks is, by definition, a validation set, and the error of the chosen network on the validation set is optimistically biased.

    There is a problem with the usual distinction between training and validation sets. Some training approaches, such as early stopping, require a validation set, so in a sense, the validation set is used for training. Other approaches, such as maximum likelihood, do not inherently require a validation set. So the "training" set for maximum likelihood might encompass both the "training" and "validation" sets for early stopping. Greg Heath has suggested the term "design" set be used for cases that are used solely to adjust the weights in a network, while "training" set be used to encompass both design and validation sets. There is considerable merit to this suggestion, but it has not yet been widely adopted.

    But things can get more complicated. Suppose you want to train nets with 5 ,10, and 20 hidden units using maximum likelihood, and you want to train nets with 20 and 50 hidden units using early stopping. You also want to use a validation set to choose the best of these various networks. Should you use the same validation set for early stopping that you use for the final network choice, or should you use two separate validation sets? That is, you could divide the sample into 3 subsets, say A, B, C and proceed as follows:

  • Do maximum likelihood using A.
  • Do early stopping with A to adjust the weights and B to decide when to stop (this makes B a validation set).
  • Choose among all 3 nets trained by maximum likelihood and the 2 nets trained by early stopping based on the error computed on B (the validation set).
  • Estimate the generalization error of the chosen network using C (the test set).
  • Or you could divide the sample into 4 subsets, say A, B, C, and D and proceed as follows:
  • Do maximum likelihood using A and B combined.
  • Do early stopping with A to adjust the weights and B to decide when to stop (this makes B a validation set with respect to early stopping).
  • Choose among all 3 nets trained by maximum likelihood and the 2 nets trained by early stopping based on the error computed on C (this makes C a second validation set).
  • Estimate the generalization error of the chosen network using D (the test set).
  • Or, with the same 4 subsets, you could take a third approach:
  • Do maximum likelihood using A.
  • Choose among the 3 nets trained by maximum likelihood based on the error computed on B (the first validation set)
  • Do early stopping with A to adjust the weights and B (the first validation set) to decide when to stop.
  • Choose among the best net trained by maximum likelihood and the 2 nets trained by early stopping based on the error computed on C (the second validation set).
  • Estimate the generalization error of the chosen network using D (the test set).
  • You could argue that the first approach is biased towards choosing a net trained by early stopping. Early stopping involves a choice among a potentially large number of networks, and therefore provides more opportunity for overfitting the validation set than does the choice among only 3 networks trained by maximum likelihood. Hence if you make the final choice of networks using the same validation set (B) that was used for early stopping, you give an unfair advantage to early stopping. If you are writing an article to compare various training methods, this bias could be a serious flaw. But if you are using NNs for some practical application, this bias might not matter at all, since you obtain an honest estimate of generalization error using C.

    You could also argue that the second and third approaches are too wasteful in their use of data. This objection could be important if your sample contains 100 cases, but will probably be of little concern if your sample contains 100,000,000 cases. For small samples, there are other methods that make more efficient use of data; see "What are cross-validation and bootstrapping?"

    References:

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    ------------------------------------------------------------------------

    Subject: How are NNs related to statistical methods?

    There is considerable overlap between the fields of neural networks and statistics. Statistics is concerned with data analysis. In neural network terminology, statistical inference means learning to generalize from noisy data. Some neural networks are not concerned with data analysis (e.g., those intended to model biological systems) and therefore have little to do with statistics. Some neural networks do not learn (e.g., Hopfield nets) and therefore have little to do with statistics. Some neural networks can learn successfully only from noise-free data (e.g., ART or the perceptron rule) and therefore would not be considered statistical methods. But most neural networks that can learn to generalize effectively from noisy data are similar or identical to statistical methods. For example:
  • Feedforward nets with no hidden layer (including functional-link neural nets and higher-order neural nets) are basically generalized linear models.
  • Feedforward nets with one hidden layer are closely related to projection pursuit regression.
  • Probabilistic neural nets are identical to kernel discriminant analysis.
  • Kohonen nets for adaptive vector quantization are very similar to k-means cluster analysis.
  • Kohonen self-organizing maps are discretized principal curves and surfaces.
  • Hebbian learning is closely related to principal component analysis.
  • Some neural network areas that appear to have no close relatives in the existing statistical literature are:
  • Reinforcement learning (although this is treated in the operations research literature on Markov decision processes).
  • Stopped training (the purpose and effect of stopped training are similar to shrinkage estimation, but the method is quite different).
  • Feedforward nets are a subset of the class of nonlinear regression and discrimination models. Statisticians have studied the properties of this general class but had not considered the specific case of feedforward neural nets before such networks were popularized in the neural network field. Still, many results from the statistical theory of nonlinear models apply directly to feedforward nets, and the methods that are commonly used for fitting nonlinear models, such as various Levenberg-Marquardt and conjugate gradient algorithms, can be used to train feedforward nets. The application of statistical theory to neural networks is explored in detail by Bishop (1995) and Ripley (1996). Several summary articles have also been published relating statistical models to neural networks, including Cheng and Titterington (1994), Kuan and White (1994), Ripley (1993, 1994), Sarle (1994), and several articles in Cherkassky, Friedman, and Wechsler (1994). Among the many statistical concepts important to neural nets is the bias/variance trade-off in nonparametric estimation, discussed by Geman, Bienenstock, and Doursat, R. (1992). Some more advanced results of statistical theory applied to neural networks are given by White (1989a, 1989b, 1990, 1992a) and White and Gallant (1992), reprinted in White (1992b).

    While neural nets are often defined in terms of their algorithms or implementations, statistical methods are usually defined in terms of their results. The arithmetic mean, for example, can be computed by a (very simple) backprop net, by applying the usual formula SUM(x_i)/n, or by various other methods. What you get is still an arithmetic mean regardless of how you compute it. So a statistician would consider standard backprop, Quickprop, and Levenberg-Marquardt as different algorithms for implementing the same statistical model such as a feedforward net. On the other hand, different training criteria, such as least squares and cross entropy, are viewed by statisticians as fundamentally different estimation methods with different statistical properties.

    It is sometimes claimed that neural networks, unlike statistical models, require no distributional assumptions. In fact, neural networks involve exactly the same sort of distributional assumptions as statistical models (Bishop, 1995), but statisticians study the consequences and importance of these assumptions while many neural networkers ignore them. For example, least-squares training methods are widely used by statisticians and neural networkers. Statisticians realize that least-squares training involves implicit distributional assumptions in that least-squares estimates have certain optimality properties for noise that is normally distributed with equal variance for all training cases and that is independent between different cases. These optimality properties are consequences of the fact that least-squares estimation is maximum likelihood under those conditions. Similarly, cross-entropy is maximum likelihood for noise with a Bernoulli distribution. If you study the distributional assumptions, then you can recognize and deal with violations of the assumptions. For example, if you have normally distributed noise but some training cases have greater noise variance than others, then you may be able to use weighted least squares instead of ordinary least squares to obtain more efficient estimates.

    Hundreds, perhaps thousands of people have run comparisons of neural nets with "traditional statistics" (whatever that means). Most such studies involve one or two data sets, and are of little use to anyone else unless they happen to be analyzing the same kind of data. But there is an impressive comparative study of supervised classification by Michie, Spiegelhalter, and Taylor (1994), which not only compares many classification methods on many data sets, but also provides unusually extensive analyses of the results. Another useful study on supervised classification by Lim, Loh, and Shih (1999) is available on-line. There is an excellent comparison of unsupervised Kohonen networks and k-means clustering by Balakrishnan, Cooper, Jacob, and Lewis (1994).

    There are many methods in the statistical literature that can be used for flexible nonlinear modeling. These methods include:

  • Polynomial regression (Eubank, 1999)
  • Fourier series regression (Eubank, 1999; Haerdle, 1990)
  • Wavelet smoothing (Donoho and Johnstone, 1995; Donoho, Johnstone, Kerkyacharian, and Picard, 1995)
  • K-nearest neighbor regression and discriminant analysis (Haerdle, 1990; Hand, 1981, 1997; Ripley, 1996)
  • Kernel regression and discriminant analysis (Eubank, 1999; Haerdle, 1990; Hand, 1981, 1982, 1997; Ripley, 1996)
  • Local polynomial smoothing (Eubank, 1999; Wand and Jones, 1995; Fan and Gijbels, 1995)
  • LOESS (Cleveland and Gross, 1991)
  • Smoothing splines (such as thin-plate splines) (Eubank, 1999; Wahba, 1990; Green and Silverman, 1994; Haerdle, 1990)
  • B-splines (Eubank, 1999)
  • Tree-based models (CART, AID, etc.) (Haerdle, 1990; Lim, Loh, and Shih, 1997; Hand, 1997; Ripley, 1996)
  • Multivariate adaptive regression splines (MARS) (Friedman, 1991)
  • Projection pursuit (Friedman and Stuetzle, 1981; Haerdle, 1990; Ripley, 1996)
  • Various Bayesian methods (Dey, 1998)
  • GMDH (Farlow, 1984)
  • Communication between statisticians and neural net researchers is often hindered by the different terminology used in the two fields. There is a comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon

    For free statistical software, see the StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon University.

    There are zillions of introductory textbooks on statistics. One of the better ones is Moore and McCabe (1989). At an intermediate level, the books on linear regression by Weisberg (1985) and Myers (1986), on logistic regression by Hosmer and Lemeshow (1989), and on discriminant analysis by Hand (1981) can be recommended. At a more advanced level, the book on generalized linear models by McCullagh and Nelder (1989) is an essential reference, and the book on nonlinear regression by Gallant (1987) has much material relevant to neural nets.

    Several introductory statistics texts are available on the web:

  • David Lane, HyperStat, at http://www.ruf.rice.edu/~lane/hyperstat/contents.html
  • Jan de Leeuw (ed.), Statistics: The Study of Stability in Variation , at http://www.stat.ucla.edu/textbook/
  • StatSoft, Inc., Electronic Statistics Textbook, at http://www.statsoft.com/textbook/stathome.html
  • David Stockburger, Introductory Statistics: Concepts, Models, and Applications, at http://www.psychstat.smsu.edu/sbk00.htm
  • University of Newcastle (Australia) Statistics Department, SurfStat Australia,http://surfstat.newcastle.edu.au/surfstat/
  • References:
    Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from a Statistical Perspective", Statistical Science, 9, 2-54.

    Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From Statistics to Neural Networks: Theory and Pattern Recognition Applications, Berlin: Springer-Verlag.

    Cleveland and Gross (1991), "Computational Methods for Local Regression," Statistics and Computing 1, 47-62.

    Dey, D., ed. (1998) Practical Nonparametric and Semiparametric Bayesian Statistics, Springer Verlag.

    Donoho, D.L., and Johnstone, I.M. (1995), "Adapting to unknown smoothness via wavelet shrinkage," J. of the American Statistical Association, 90, 1200-1224.

    Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995), "Wavelet shrinkage: asymptopia (with discussion)?" J. of the Royal Statistical Society, Series B, 57, 301-369.

    Eubank, R.L. (1999), Nonparametric Regression and Spline Smoothing, 2nd ed., Marcel Dekker, ISBN 0-8247-9337-4.

    Fan, J., and Gijbels, I. (1995), "Data-driven bandwidth selection in local polynomial: variable bandwidth and spatial adaptation," J. of the Royal Statistical Society, Series B, 57, 371-394.

    Farlow, S.J. (1984), Self-organizing Methods in Modeling: GMDH Type Algorithms, NY: Marcel Dekker. (GMDH)

    Friedman, J.H. (1991), "Multivariate adaptive regression splines", Annals of Statistics, 19, 1-141. (MARS)

    Friedman, J.H. and Stuetzle, W. (1981) "Projection pursuit regression," J. of the American Statistical Association, 76, 817-823.

    Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.

    Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

    Green, P.J., and Silverman, B.W. (1994), Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, London: Chapman & Hall.

    Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ. Press.

    Hand, D.J. (1981) Discrimination and Classification, NY: Wiley.

    Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press.

    Hand, D.J. (1997) Construction and Assessment of Classification Rules, NY: Wiley.

    Hill, T., Marquez, L., O'Connor, M., and Remus, W. (1994), "Artificial neural network models for forecasting and decision making," International J. of Forecasting, 10, 5-15.

    Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An Econometric Perspective", Econometric Reviews, 13, 1-91.

    Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag.

    Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ( 1999?), "A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms," Machine Learning, forthcoming, http://www.stat.wisc.edu/~limt/mach1317.pdf or http://www.stat.wisc.edu/~limt/mach1317.ps; Appendix containing complete tables of error rates, ranks, and training times, http://www.stat.wisc.edu/~limt/appendix.pdf or http://www.stat.wisc.edu/~limt/appendix.ps

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.

    Michie, D., Spiegelhalter, D.J. and Taylor, C.C., eds. (1994), Machine Learning, Neural and Statistical Classification, NY: Ellis Horwood; this book is out of print but available online at http://www.amsta.leeds.ac.uk/~charles/statlog/

    Moore, D.S., and McCabe, G.P. (1989), Introduction to the Practice of Statistics, NY: W.H. Freeman.

    Myers, R.H. (1986), Classical and Modern Regression with Applications, Boston: Duxbury Press.

    Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., Networks and Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412 46530 2.

    Ripley, B.D. (1994), "Neural Networks and Related Methods for Classification," Journal of the Royal Statistical Society, Series B, 56, 409-456.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, pp 1538-1550. ( ftp://ftp.sas.com/pub/neural/neural1.ps)

    Wahba, G. (1990), Spline Models for Observational Data, SIAM.

    Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman & Hall.

    Weisberg, S. (1985), Applied Linear Regression, NY: Wiley

    White, H. (1989a), "Learning in Artificial Neural Networks: A Statistical Perspective," Neural Computation, 1, 425-464.

    White, H. (1989b), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013.

    White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550.

    White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Computing Science and Statistics.

    White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an Unknown Mapping with Multilayer Feedforward Networks," Neural Networks, 5, 129-138.

    White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.

    ------------------------------------------------------------------------
    Next part is part 2 (of 7).
    
    
    

    Subject: What is backprop?

    "Backprop" is short for "backpropagation of error". The term backpropagation causes much confusion. Strictly speaking, backpropagation refers to the method for computing the error gradient for a feedforward network, a straightforward but elegant application of the chain rule of elementary calculus (Werbos 1974/1994). By extension, backpropagation or backprop refers to a training method that uses backpropagation to compute the gradient. By further extension, a backprop network is a feedforward network trained by backpropagation.

    "Standard backprop" is a euphemism for the generalized delta rule, the training algorithm that was popularized by Rumelhart, Hinton, and Williams in chapter 8 of Rumelhart and McClelland (1986), which remains the most widely used supervised training method for neural nets. The generalized delta rule (including momentum) is called the "heavy ball method" in the numerical analysis literature (Polyak 1964, 1987; Bertsekas 1995, 78-79).

    Standard backprop can be used for incremental (on-line) training (in which the weights are updated after processing each case) but it does not converge to a stationary point of the error surface. To obtain convergence, the learning rate must be slowly reduced. This methodology is called "stochastic approximation."

    The convergence properties of standard backprop, stochastic approximation, and related methods, including both batch and incremental algorithms, are discussed clearly and thoroughly by Bertsekas and Tsitsiklis (1996). For a practical discussion of backprop training in MLPs, Reed and Marks (1999) is the best reference I've seen.

    For batch processing, there is no reason to suffer through the slow convergence and the tedious tuning of learning rates and momenta of standard backprop. Much of the NN research literature is devoted to attempts to speed up backprop. Most of these methods are inconsequential; two that are effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun 1993). Concise descriptions of these algorithms are given by Schiffmann, Joost, and Werner (1994) and Reed and Marks (1999). But conventional methods for nonlinear optimization are usually faster and more reliable than any of the "props". See "What are conjugate gradients, Levenberg-Marquardt, etc.?".

    For more on-line info on backprop see Donald Tveter's Backpropagator's Review at http://www.dontveter.com/bpr/bpr.html or http://gannoo.uce.ac.uk/bpr/bpr.html.

    References on backprop:

    Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0.

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.

    Polyak, B.T. (1964), "Some methods of speeding up the convergence of iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. 

    Polyak, B.T. (1987), Introduction to Optimization, NY: Optimization Software, Inc.

    Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.

    Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning internal representations by error propagation", in Rumelhart, D.E. and McClelland, J. L., eds. (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, 318-362, Cambridge, MA: The MIT Press.

    Werbos, P.J. (1974/1994), The Roots of Backpropagation, NY: John Wiley & Sons. Includes Werbos's 1974 Harvard Ph.D. thesis, Beyond Regression.

    References on stochastic approximation:
    Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method", Annals of Mathematical Statistics, 22, 400-407.

    Kiefer, J. & Wolfowitz, J. (1952), "Stochastic Estimation of the Maximum of a Regression Function," Annals of Mathematical Statistics, 23, 462-466.

    Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms and Applications, NY: Springer-Verlag.

    Kushner, H.J., and Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag.

    White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013.

    References on better props:
    Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., Proceedings of the 1988 Connectionist Models Summer School, Morgan Kaufmann, 38-51.

    Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.

    Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE International Conference on Neural Networks 1993, San Francisco: IEEE.

    Schiffmann, W., Joost, M., and Werner, R. (1994), "Optimization of the Backpropagation Algorithm for Training Multilayer Perceptrons," ftp://archive.cis.ohio-state.edu/pub/neuroprose/schiff.bp_speedup.ps.Z

    ------------------------------------------------------------------------

    Subject: What learning rate should be used for backprop?

    In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate makes the weights and error function diverge, so there is no learning at all. If the error function is quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and Tsitsiklis, 1996). If the error function has many local and global optima, as in typical feedforward NNs with hidden units, the optimal learning rate often changes dramatically during the training process, since the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a tedious process requiring much trial and error.

    With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist (see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned under "What are conjugate gradients, Levenberg-Marquardt, etc.?").

    Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a large step size; this happens when you initialize a network with small random weights. In other regions of the weight space, the gradient is small and you need a small step size; this happens when you are close to a local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also second-order derivatives or a line search (or some combination thereof) to obtain a good step size.

    With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the learning rate during training. Various proposals have appeared in the NN literature, but most of them don't work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate. There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging" (Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running average of the weight values. I have no personal experience with these methods; if you have any solid evidence that these or other methods of automatically setting the learning rate and/or momentum in incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer (saswss@unx.sas.com).

    References:

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.

    Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, pp. 1009-1016.

    Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms and Applications, NY: Springer-Verlag.

    LeCun, Y., Simard, P.Y., and Pearlmetter, B. (1993), "Automatic learning rate maximization by on-line estimation of the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles, C.L. (eds.), Advances in Neural Information Processing Systems 5, San Mateo, CA: Morgan Kaufmann, pp. 156-163.

    Orr, G.B. and Leen, T.K. (1997), "Using curvature information for fast stochastic search," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 606-612.

    ------------------------------------------------------------------------

    Subject: What are conjugate gradients, Levenberg-Marquardt, etc.?

    Training a neural network is, in most cases, an exercise in numerical optimization of a usually nonlinear objective function ("objective function" means whatever function you are trying to optimize and is a slightly more general term than "error function" in that it may include other quantities such as penalties for weight decay). Methods of nonlinear optimization have been studied for hundreds of years, and there is a huge literature on the subject in fields such as numerical analysis, operations research, and statistical computing, e.g., Bertsekas (1995), Bertsekas and Tsitsiklis (1996), Gill, Murray, and Wright (1981). Masters (1995) has a good elementary discussion of conjugate gradient and Levenberg-Marquardt algorithms in the context of NNs.

    There is no single best method for nonlinear optimization. You need to choose a method based on the characteristics of the problem to be solved. For objective functions with continuous second derivatives (which would include feedforward nets with the most popular differentiable activation functions and error functions), three general types of algorithms have been found to be effective for most practical purposes:

  • For a small number of weights, stabilized Newton and Gauss-Newton algorithms, including various Levenberg-Marquardt and trust-region algorithms, are efficient.
  • For a moderate number of weights, various quasi-Newton algorithms are efficient.
  • For a large number of weights, various conjugate-gradient algorithms are efficient.
  • Additional variations on the above methods, such as limited-memory quasi-Newton and double dogleg, can be found in textbooks such as Bertsekas (1995). Objective functions that are not continuously differentiable are more difficult to optimize. For continuous objective functions that lack derivatives on certain manifolds, such as ramp activation functions (which lack derivatives at the top and bottom of the ramp) and the least-absolute-value error function (which lacks derivatives for cases with zero error), subgradient methods can be used. For objective functions with discontinuities, such as threshold activation functions and the misclassification-count error function, Nelder-Mead simplex algorithm and various secant methods can be used. However, these methods may be very slow for large networks, and it is better to use continuously differentiable objective functions when possible.

    All of the above methods find local optima--they are not guaranteed to find a global optimum. In practice, Levenberg-Marquardt often finds better optima for a variety of problems than do the other usual methods. I know of no theoretical explanation for this empirical finding.

    For global optimization, there are also a variety of approaches. You can simply run any of the local optimization methods from numerous random starting points. Or you can try more complicated methods designed for global optimization such as simulated annealing or genetic algorithms (see Reeves 1993 and "What about Genetic Algorithms and Evolutionary Computation?"). Global optimization for neural nets is especially difficult because the number of distinct local optima can be astronomical.

    Another important consideration in the choice of optimization algorithms is that neural nets are often ill-conditioned (Saarinen, Bramley, and Cybenko 1993), especially when there are many hidden units. Algorithms that use only first-order information, such as steepest descent and standard backprop, are notoriously slow for ill-conditioned problems. Generally speaking, the more use an algorithm makes of second-order information, the better it will behave under ill-conditioning. The following methods are listed in order of increasing use of second-order information: steepest descent, conjugate gradients, quasi-Newton, Gauss-Newton, Newton-Raphson. Unfortunately, the methods that are better for severe ill-conditioning are the methods that are preferable for a small number of weights, and the methods that are preferable for a large number of weights are not as good at handling severe ill-conditioning. Therefore for networks with many hidden units, it is advisable to try to alleviate ill-conditioning by standardizing input and target variables, choosing initial values from a reasonable range, and using weight decay or Bayesian regularization methods.

    Although ill-conditioning is an important consideration, its effect is exaggerated by Saarinen, Bramley, and Cybenko (1993) for several reasons: they do not center the input variables, they use no bias for the output unit, and for some examples they choose initial values from a ridiculously wide range of (-100,100). Their conclusion that neural nets "can cause undue strain on any numerical scheme which uses Jacobians" is unduly pessimistic because the numerical analysis results they cite are concerned with the accuracy of computing the optimal weights, which is inherently difficult for ill-conditioned problems. But in neural nets, what matters is the accuracy of the outputs (and ultimately of generalization), not accuracy of the weights. It is often possible to obtain accurate outputs without accurate weights, since the basic meaning of "ill-conditioned" is that large changes in the weights may have little effect on the objective function. In fact, neural net researchers often consider ill-conditioning (with respect to the unregularized error function) to be a virtue when it comes in the form of "fault tolerance".

    For a survey of optimization software, see More\' and Wright (1993). For more on-line information on numerical optimization see:

  • The kangaroos, a nontechnical description of various optimization methods, at ftp://ftp.sas.com/pub/neural/kangaroos.
  • The Netlib repository, http://www.netlib.org/, containing freely available software, documents, and databases of interest to the numerical and scientific computing community.
  • The linear and nonlinear programming FAQs at http://www.mcs.anl.gov/home/otc/Guide/faq/
  • Arnold Neumaier's page on global optimization at http://solon.cma.univie.ac.at/~neum/glopt.html
  • Simon Streltsov's page on global optimization at http://cad.bu.edu/go
  • Lester Ingber's page on Adaptive Simulated Annealing (ASA), karate, etc. at http://www.ingber.com/ or http://www.alumni.caltech.edu/~ingber/ 
  • Jonathan Shewchuk's paper on conjugate gradients, "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain," at http://www.cs.cmu.edu/~jrs/jrspapers.html 
  • References:
    Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0.

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.

    Fletcher, R. (1987) Practical Methods of Optimization, NY: Wiley.

    Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization, Academic Press: London.

    Levenberg, K. (1944) "A method for the solution of certain problems in least squares," Quart. Appl. Math., 2, 164-168.

    Marquardt, D. (1963) "An algorithm for least-squares estimation of nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. This is the third most frequently cited paper in all the mathematical sciences.

    Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0

    More\', J.J. (1977) "The Levenberg-Marquardt algorithm: implementation and theory," in Watson, G.A., ed., Numerical Analysis, Lecture Notes in Mathematics 630, Springer-Verlag, Heidelberg, 105-116.

    More\', J.J. and Wright, S.J. (1993), Optimization Software Guide, Philadelphia: SIAM, ISBN 0-89871-322-6.

    Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.

    Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial Problems, NY: Wiley.

    Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A Survey, International Series of Numerical Mathematics, vol. 87, Basel: Birkhauser Verlag.

    Saarinen, S., Bramley, R., and Cybenko, G. (1993), "Ill-conditioning in neural network training problems," Siam J. of Scientific Computing, 14, 693-714.

    ------------------------------------------------------------------------

    Subject: How should categories be coded?

    First, consider unordered categories. If you want to classify cases into one of C categories (i.e. you have a categorical target variable), use 1-of-C coding. That means that you code C binary (0/1) target variables corresponding to the C categories. Statisticians call these "dummy" variables. Each dummy variable is given the value zero except for the one corresponding to the correct category, which is given the value one. Then use a softmax output activation function (see "What is a softmax activation function?") so that the net, if properly trained, will produce valid posterior probability estimates (McCullagh and Nelder, 1989; Finke and M\"uller, 1994). If the categories are Red, Green, and Blue, then the data would look like this:
       Category  Dummy variables
       --------  ---------------
        Red        1   0   0
        Green      0   1   0
        Blue       0   0   1
    When there are only two categories, it is simpler to use just one dummy variable with a logistic output activation function; this is equivalent to using softmax with two dummy variables.

    The common practice of using target values of .1 and .9 instead of 0 and 1 prevents the outputs of the network from being directly interpretable as posterior probabilities, although it is easy to rescale the outputs to produce probabilities (Hampshire and Pearlmutter, 1991, figure 3). This practice has also been advocated on the grounds that infinite weights are required to obtain outputs of 0 or 1 from a logistic function, but in fact, weights of about 10 to 30 will produce outputs close enough to 0 and 1 for all practical purposes, assuming standardized inputs. Large weights will not cause overflow if the activation functions are coded properly; see How to avoid overflow in the logistic function?

    Another common practice is to use a logistic activation function for each output. Thus, the outputs are not constrained to sum to one, so they are not valid posterior probability estimates. The usual justification advanced for this procedure is that if a test case is not similar to any of the training cases, all of the outputs will be small, indicating that the case cannot be classified reliably. This claim is incorrect, since a test case that is not similar to any of the training cases will require the net to extrapolate, and extrapolation is thoroughly unreliable; such a test case may produce all small outputs, all large outputs, or any combination of large and small outputs. If you want a classification method that detects novel cases for which the classification may not be reliable, you need a method based on probability density estimation. For example, see "What is PNN?".

    It is very important not to use a single variable for an unordered categorical target. Suppose you used a single variable with values 1, 2, and 3 for red, green, and blue, and the training data with two inputs looked like this:

          |    1    1
          |   1   1
          |       1   1
          |     1   1
          | 
          |      X
          | 
          |    3   3           2   2
          |     3     3      2
          |  3   3            2    2
          |     3   3       2    2
          | 
          +----------------------------
    Consider a test point located at the X. The correct output would be that X has about a 50-50 chance of being a 1 or a 3. But if you train with a single target variable with values of 1, 2, and 3, the output for X will be the average of 1 and 3, so the net will say that X is definitely a 2!

    If you are willing to forego the simple posterior-probability interpretation of outputs, you can try more elaborate coding schemes, such as the error-correcting output codes suggested by Dietterich and Bakiri (1995).

    For an input with categorical values, you can use 1-of-(C-1) coding if the network has a bias unit. This is just like 1-of-C coding, except that you omit one of the dummy variables (doesn't much matter which one). Using all C of the dummy variables creates a linear dependency on the bias unit, which is not advisable unless you are using weight decay or Bayesian learning or some such thing that requires all C weights to be treated on an equal basis. 1-of-(C-1) coding looks like this:

       Category  Dummy variables
       --------  ---------------
        Red        1   0
        Green      0   1
        Blue       0   0
    Another possible coding is called "effects" coding or "deviations from means" coding in statistics. It is like 1-of-(C-1) coding, except that when a case belongs to the category for the omitted dummy variable, all of the dummy variables are set to -1, like this:
       Category  Dummy variables
       --------  ---------------
        Red        1   0
        Green      0   1
        Blue      -1  -1
    As long as a bias unit is used, any network with effects coding can be transformed into an equivalent network with 1-of-(C-1) coding by a linear transformation of the weights, so if you train to a global optimum, there will be no difference in the outputs for these two types of coding. One advantage of effects coding is that the dummy variables require no standardizing (see "Should I normalize/standardize/rescale the data?").

    If you are using weight decay, you want to make sure that shrinking the weights toward zero biases ('bias' in the statistical sense) the net in a sensible, usually smooth, way. If you use 1 of C-1 coding for an input, weight decay biases the output for the C-1 categories towards the output for the 1 omitted category, which is probably not what you want, although there might be special cases where it would make sense. If you use 1 of C coding for an input, weight decay biases the output for all C categories roughly towards the mean output for all the categories, which is smoother and usually a reasonable thing to do.

    Now consider ordered categories. For inputs, some people recommend a "thermometer code" (Smith, 1993; Masters, 1993) like this:

       Category  Dummy variables
       --------  ---------------
        Red        1   1   1
        Green      0   1   1
        Blue       0   0   1
    However, thermometer coding is equivalent to 1-of-C coding, in that for any network using 1-of-C coding, there exists a network with thermometer coding that produces identical outputs; the weights in the thermometer-coded network are just the differences of successive weights in the 1-of-C-coded network. To get a genuinely ordinal representation, you must constrain the weights connecting the dummy variables to the hidden units to be nonnegative (except for the first dummy variable). Another approach that makes some use of the order information is to use weight decay or Bayesian learning to encourage the the weights for all but the first dummy variable to be small.

    It is often effective to represent an ordinal input as a single variable like this:

       Category  Input
       --------  -----
        Red        1
        Green      2
        Blue       3
    Although this representation involves only a single quantitative input, given enough hidden units, the net is capable of computing nonlinear transformations of that input that will produce results equivalent to any of the dummy coding schemes. But using a single quantitative input makes it easier for the net to use the order of the categories to generalize when that is appropriate.

    B-splines provide a way of coding ordinal inputs into fewer than C variables while retaining information about the order of the categories. See Brown and Harris (1994) or Gifi (1990, 365-370).

    Target variables with ordered categories require thermometer coding. The outputs are thus cumulative probabilities, so to obtain the posterior probability of any category except the first, you must take the difference between successive outputs. It is often useful to use a proportional-odds model, which ensures that these differences are positive. For more details on ordered categorical targets, see McCullagh and Nelder (1989, chapter 5).

    References:

    Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and Control, NY: Prentice Hall.

    Dietterich, T.G. and Bakiri, G. (1995), "Error-correcting output codes: A general method for improving multiclass inductive learning programs," in Wolpert, D.H. (ed.), The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, Santa Fe Institute Studies in the Sciences of Complexity, Volume XX, Reading, MA: Addison-Wesley, pp. 395-407.

    Finke, M. and M\"uller, K.-R. (1994), "Estimating a-posteriori probabilities using stochastic network models," in Mozer, M., Smolensky, P., Touretzky, D., Elman, J., and Weigend, A. (eds.), Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 324-331.

    Gifi, A. (1990), Nonlinear Multivariate Analysis, NY: John Wiley & Sons, ISBN 0-471-92620-5.

    Hampshire II, J.B., and Pearlmutter, B. (1991), "Equivalence proofs for multi-layer perceptron classifiers and the Bayesian discriminant function," in Touretzky, D.S., Elman, J.L., Sejnowski, T.J., and Hinton, G.E. (eds.), Connectionist Models: Proceedings of the 1990 Summer School, San Mateo, CA: Morgan Kaufmann, pp.159-172.

    Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego: Academic Press.

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.

    Smith, M. (1993). Neural Networks for Statistical Modeling, NY: Van Nostrand Reinhold.

    ------------------------------------------------------------------------

    Subject: Why use a bias/threshold?

    Sigmoid hidden and output units usually use a "bias" or "threshold" value in computing the net input to the unit. For a linear output unit, a bias value is equivalent to an intercept in a regression model.

    A bias value can be treated as a connection weight from an input called a "bias unit" with a constant value of one. The single bias unit is connected to every hidden or output unit that needs a bias value. Hence the bias values can be learned just like other weights.

    Consider a multilayer perceptron with any of the usual sigmoid activation functions. Choose any hidden unit or output unit. Let's say there are N inputs to that unit, which define an N-dimensional space. The given unit draws a hyperplane through that space, producing an "on" output on one side and an "off" output on the other. (With sigmoid units the plane will not be sharp -- there will be some gray area of intermediate values near the separating plane -- but ignore this for now.)

    The weights determine where this hyperplane lies in the input space. Without a bias input, this separating hyperplane is constrained to pass through the origin of the space defined by the inputs. For some problems that's OK, but in many problems the hyperplane would be much more useful somewhere else. If you have many units in a layer, they share the same input space and without bias would ALL be constrained to pass through the origin.

    The "universal approximation" property of multilayer perceptrons with most commonly-used hidden-layer activation functions does not hold if you omit the bias units. But Hornik (1993) shows that a sufficient condition for the universal approximation property without biases is that no derivative of the activation function vanishes at the origin, which implies that with the usual sigmoid activation functions, a fixed nonzero bias can be used instead of a trainable bias.

    Typically, every hidden and output unit has it's own bias value. The main exception to this is when the activations of two or more units in one layer always sum to a nonzero constant. For example, you might scale the inputs to sum to one (see Should I standardize the input cases?), or you might use a normalized RBF function in the hidden layer (see How do MLPs compare with RBFs?). If there do exist units in one layer whose activations sum to a nonzero constant, then any subsequent layer does not need bias values if it receives connections from the units that sum to a constant, since using bias values in the subsequent layer would create linear dependencies.

    If you have a large number of hidden units, it may happen that one or more hidden units "saturate" as a result of having large incoming weights, producing a constant activation. If this happens, then the saturated hidden units act like bias units, and the output bias values are redundant. However, you should not rely on this phenomenon to avoid using output biases, since networks without output biases are usually ill-conditioned and harder to train than networks that use output biases.

    Regarding bias-like values in RBF networks, see "How do MLPs compare with RBFs?"

    Reference:

    Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 1069-1072.
    ------------------------------------------------------------------------

    Subject: Why use activation functions?

    Activation functions for the hidden units are needed to introduce nonlinearity into the network. Without nonlinearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a composition of linear functions is again a linear function. However, it is the nonlinearity (i.e, the capability to represent nonlinear functions) that makes multilayer networks so powerful. Almost any nonlinear function does the job, although for backpropagation learning it must be differentiable and it helps if the function is bounded; the sigmoidal functions such as logistic and tanh and the Gaussian function are the most common choices. Functions such as tanh that produce both positive and negative values tend to yield faster training than functions that produce only positive values such as logistic, because of better numerical conditioning.

    For the output units, you should choose an activation function suited to the distribution of the target values. For binary (0/1) outputs, the logistic function is an excellent choice (Jordan, 1995). For targets using 1-of-C coding, the softmax activation function is the logical extension of the logistic function. For continuous-valued targets with a bounded range, the logistic and tanh functions are again useful, provided you either scale the outputs to the range of the targets or scale the targets to the range of the output activation function ("scaling" means multiplying by and adding appropriate constants). But if the target values have no known bounded range, it is better to use an unbounded activation function, most often the identity function (which amounts to no activation function). If the target values are positive but have no known upper bound, you can use an exponential output activation function (but beware of overflow if you are writing your own code).

    There are certain natural associations between output activation functions and various noise distributions which have been studied by statisticians in the context of generalized linear models. The output activation function is the inverse of what statisticians call the "link function". See:

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.

    Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on probabilities and neural networks", ftp://psyche.mit.edu/pub/jordan/uai.ps.Z.

    For more information on activation functions, see Donald Tveter's Backpropagator's Review.
    ------------------------------------------------------------------------

    Subject: How to avoid overflow in the logistic function?

       if (netinput<-45) netoutput=0;
       else if (netinput>45) netoutput=1;
       else netoutput=1/(1+exp(-netinput));
    The constant 45 will work for double precision on all machines that I know of, but there may be some bizarre machines where it will require some adjustment. Other activation functions can be handled similarly.
    ------------------------------------------------------------------------

    Subject: What is a softmax activation function?

    The purpose of the softmax activation function is to make the sum of the outputs equal to one, so that the outputs are interpretable as posterior probabilities. Let the net input to each output unit be q_i, i=1,...,c where c is the number of categories. Then the softmax output p_i is:
               exp(q_i)
       p_i = ------------
              c
             sum exp(q_j)
             j=1
    Unless you are using weight decay or Bayesian estimation or some such thing that requires the weights to be treated on an equal basis, you can choose any one of the output units and leave it completely unconnected--just set the net input to 0. Connecting all of the output units will just give you redundant weights and will slow down training. To see this, add an arbitrary constant z to each net input and you get:
               exp(q_i+z)       exp(q_i) exp(z)       exp(q_i)    
       p_i = ------------   = ------------------- = ------------   
              c                c                     c            
             sum exp(q_j+z)   sum exp(q_j) exp(z)   sum exp(q_j)  
             j=1              j=1                   j=1
    so nothing changes. Hence you can always pick one of the output units, and add an appropriate constant to each net input to produce any desired net input for the selected output unit, which you can choose to be zero or whatever is convenient. You can use the same trick to make sure that none of the exponentials overflows.

    Statisticians usually call softmax a "multiple logistic" function. It reduces to the simple logistic function when there are only two categories. Suppose you choose to set q_2 to 0. Then

               exp(q_1)         exp(q_1)              1
       p_1 = ------------ = ----------------- = -------------
              c             exp(q_1) + exp(0)   1 + exp(-q_1)
             sum exp(q_j)
             j=1
    and p_2, of course, is 1-p_1.

    The softmax function derives naturally from log-linear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sum--that's called the Bradley-Terry-Luce model.

    References:

    Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp. 227-236.

    Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as Networks can lead to Maximum Mutual Information Estimation of Parameters. In: D.S.Touretzky (ed.), Advances in Neural Information Processing Systems 2, San Mateo: Morgan Kaufmann, pp. 211-217.

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. See Chapter 5.

    ------------------------------------------------------------------------

    Subject: What is the curse of dimensionality?

    Answer by Janne Sinkkonen.

     Curse of dimensionality (Bellman 1961) refers to the exponential growth of hypervolume as a function of dimensionality. In the field of NNs, curse of dimensionality expresses itself in two related problems:

  • Many NNs can be thought of mappings from an input space to an output space. Thus, loosely speaking, an NN needs to somehow "monitor", cover or represent every part of its input space in order to know how that part of the space should be mapped. Covering the input space takes resources, and, in the most general case, the amount of resources needed is proportional to the hypervolume of the input space. The exact formulation of "resources" and "part of the input space" depends on the type of the network and should probably be based on the concepts of information theory and differential geometry.

  • As an example, think of a vector quantization (VQ). In VQ, a set of units competitively learns to represents an input space (this is like Kohonen's Self-Organizing Map but without topography for the units). Imagine a VQ trying to share its units (resources) more or less equally over hyperspherical input space. One could argue that the average distance from a random point of the space to the nearest network unit measures the goodness of the representation: the shorter the distance, the better is the represention of the data in the sphere. It is intuitively clear (and can be experimentally verified) that the total number of units required to keep the average distance constant increases exponentially with the dimensionality of the sphere (if the radius of the sphere is fixed).

    The curse of dimensionality causes networks with lots of irrelevant inputs to be behave relatively badly: the dimension of the input space is high, and the network uses almost all its resources to represent irrelevant portions of the space.

    Unsupervised learning algorithms are typically prone to this problem - as well as conventional RBFs. A partial remedy is to preprocess the input in the right way, for example by scaling the components according to their "importance".

  • Even if we have a network algorithm which is able to focus on important portions of the input space, the higher the dimensionality of the input space, the more data may be needed to find out what is important and what is not.
  • A priori information can help with the curse of dimensionality. Careful feature selection and scaling of the inputs fundamentally affects the severity of the problem, as well as the selection of the neural network model. For classification purposes, only the borders of the classes are important to represent accurately.

    References:

    Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, section 1.4.

    Scott, D.W. (1992), Multivariate Density Estimation, NY: Wiley.

    ------------------------------------------------------------------------

    Subject: How do MLPs compare with RBFs?

    Multilayer perceptrons (MLPs) and radial basis function (RBF) networks are the two most commonly-used types of feedforward network. They have much more in common than most of the NN literature would suggest. The only fundamental difference is the way in which hidden units combine values coming from preceding layers in the network--MLPs use inner products, while RBFs use Euclidean distance. There are also differences in the customary methods for training MLPs and RBF networks, although most methods for training MLPs can also be applied to RBF networks. Furthermore, there are crucial differences between two broad types of RBF network--ordinary RBF networks and normalized RBF networks--that are ignored in most of the NN literature. These differences have important consequences for the generalization ability of the networks, especially when the number of inputs is large.

    Notation:

          a_j     is the altitude or height of the jth hidden unit
          b_j     is the bias of the jth hidden unit
          f       is the fan-in of the jth hidden unit 
          h_j     is the activation of the jth hidden unit 
          s       is a common width shared by all hidden units in the layer
          s_j     is the width of the jth hidden unit
          w_ij    is the weight connecting the ith input to
                    the jth hidden unit
          w_i     is the common weight for the ith input shared by
                    all hidden units in the layer
          x_i     is the ith input
    The inputs to each hidden or output unit must be combined with the weights to yield a single value called the "net input" to which the activation function is applied. There does not seem to be a standard term for the function that combines the inputs and weights; I will use the term "combination function". Thus, each hidden or output unit in a feedforward network first computes a combination function to produce the net input, and then applies an activation function to the net input yielding the activation of the unit.

    A multilayer perceptron has one or more hidden layers for which the combination function is the inner product of the inputs and weights, plus a bias. The activation function is usually a logistic or tanh function. Hence the formula for the activation is typically:

    h_j = tanh( b_j + sum[w_ij*x_i] )
    The MLP architecture is the most popular one in practical applications. Each layer uses a linear combination function. The inputs are fully connected to the first hidden layer, each hidden layer is fully connected to the next, and the last hidden layer is fully connected to the outputs. You can also have "skip-layer" connections; direct connections from inputs to outputs are especially useful.

    Consider the multidimensional space of inputs to a given hidden unit. Since an MLP uses linear combination functions, the set of all points in the space having a given value of the activation function is a hyperplane. The hyperplanes corresponding to different activation levels are parallel to each other (the hyperplanes for different units are not parallel in general). These parallel hyperplanes are the isoactivation contours of the hidden unit.

    Radial basis function (RBF) networks usually have only one hidden layer for which the combination function is based on the Euclidean distance between the input vector and the weight vector. RBF networks do not have anything that's exactly the same as the bias term in an MLP. But some types of RBFs have a "width" associated with each hidden unit or with the the entire hidden layer; instead of adding it in the combination function like a bias, you divide the Euclidean distance by the width.

    To see the similarity between RBF networks and MLPs, it is convenient to treat the combination function as the square of distance/width. Then the familiar exp or softmax activation functions produce members of the popular class of Gaussian RBF networks. It can also be useful to add another term to the combination function that determines what I will call the "altitude" of the unit. The altitude is the maximum height of the Gaussian curve above the horizontal axis. I have not seen altitudes used in the NN literature; if you know of a reference, please tell me (saswss@unx.sas.com).

    The output activation function in RBF networks is usually the identity. The identity output activation function is a computational convenience in training (see Hybrid training and the curse of dimensionality) but it is possible and often desirable to use other output activation functions just as you would in an MLP.

    There are many types of radial basis functions. Gaussian RBFs seem to be the most popular by far in the NN literature. In the statistical literature, thin plate splines are also used (Green and Silverman 1994). This FAQ will concentrate on Gaussian RBFs.

    There are two distinct types of Gaussian RBF architectures. The first type uses the exp activation function, so the activation of the unit is a Gaussian "bump" as a function of the inputs. There seems to be no specific term for this type of Gaussian RBF network; I will use the term "ordinary RBF", or ORBF, network.

    The second type of Gaussian RBF architecture uses the softmax activation function, so the activations of all the hidden units are normalized to sum to one. This type of network is often called a "normalized RBF", or NRBF, network. In a NRBF network, the output units should not have a bias, since the constant bias term would be linearly dependent on the constant sum of the hidden units.

    While the distinction between these two types of Gaussian RBF architectures is sometimes mentioned in the NN literature, its importance has rarely been appreciated except by Tao (1993) and Werntges (1993). Shorten and Murray-Smith (1996) also compare ordinary and normalized Gaussian RBF networks.

    There are several subtypes of both ORBF and NRBF architectures. Descriptions and formulas are as follows:

    ORBFUN
    Ordinary radial basis function (RBF) network with unequal widths

    h_j = exp( - s_j^-2 * sum[(w_ij-x_i)^2] )
    ORBFEQ
    Ordinary radial basis function (RBF) network with equal widths

    h_j = exp( - s^-2 * sum[(w_ij-x_i)^2] )
    NRBFUN
    Normalized RBF network with unequal widths and heights

    h_j = softmax(f*log(a_j) - s_j^-2 * sum[(w_ij-x_i)^2] )
    NRBFEV
    Normalized RBF network with equal volumes

    h_j = softmax( f*log(s_j) - s_j^-2 * sum[(w_ij-x_i)^2] )
    NRBFEH
    Normalized RBF network with equal heights (and unequal widths)

    h_j = softmax( - s_j^-2 * sum[(w_ij-x_i)^2] )
    NRBFEW
    Normalized RBF network with equal widths (and unequal heights)

    h_j = softmax( f*log(a_j) - s^-2 * sum[(w_ij-x_i)^2] )
    NRBFEQ
    Normalized RBF network with equal widths and heights

    h_j = softmax( - s^-2 * sum[(w_ij-x_i)^2] )
    To illustrate various architectures, an example with two inputs and one output will be used so that the results can be shown graphically. The function being learned resembles a landscape with a Gaussian hill and a logistic plateau as shown in ftp://ftp.sas.com/pub/neural/hillplat.gif. There are 441 training cases on a regular 21-by-21 grid. The table below shows the root mean square error (RMSE) for a test data set. The test set has 1681 cases on a regular 41-by-41 grid over the same domain as the training set. If you are reading the HTML version of this document via a web browser, click on any number in the table to see a surface plot of the corresponding network output (each plot is a gif file, approximately 9K).

    The MLP networks in the table have one hidden layer with a tanh activation function. All of the networks use an identity activation function for the outputs.

              Hill and Plateau Data: RMSE for the Test Set
    
    HUs  MLP   ORBFEQ  ORBFUN  NRBFEQ  NRBFEW  NRBFEV  NRBFEH  NRBFUN
                                                               
     2  0.218   0.247   0.247   0.230   0.230   0.230   0.230   0.230  
     3  0.192   0.244   0.143   0.218   0.218   0.036   0.012   0.001 
     4  0.174   0.216   0.096   0.193   0.193   0.036   0.007
     5  0.160   0.188   0.083   0.086   0.051   0.003
     6  0.123   0.142   0.058   0.053   0.030
     7  0.107   0.123   0.051   0.025   0.019
     8  0.093   0.105   0.043   0.020   0.008
     9  0.084   0.085   0.038   0.017
    10  0.077   0.082   0.033   0.016
    12  0.059   0.074   0.024   0.005
    15  0.042   0.060   0.019
    20  0.023   0.046   0.010
    30  0.019   0.024
    40  0.016   0.022
    50  0.010   0.014
    The ORBF architectures use radial combination functions and the exp activation function. Only two of the radial combination functions are useful with ORBF architectures. For radial combination functions including an altitude, the altitude would be redundant with the hidden-to-output weights.

    Radial combination functions are based on the Euclidean distance between the vector of inputs to the unit and the vector of corresponding weights. Thus, the isoactivation contours for ORBF networks are concentric hyperspheres. A variety of activation functions can be used with the radial combination function, but the exp activation function, yielding a Gaussian surface, is the most useful. Radial networks typically have only one hidden layer, but it can be useful to include a linear layer for dimensionality reduction or oblique rotation before the RBF layer.

    The output of an ORBF network consists of a number of superimposed bumps, hence the output is quite bumpy unless many hidden units are used. Thus an ORBF network with only a few hidden units is incapable of fitting a wide variety of simple, smooth functions, and should rarely be used.

    The NRBF architectures also use radial combination functions but the activation function is softmax, which forces the sum of the activations for the hidden layer to equal one. Thus, each output unit computes a weighted average of the hidden-to-output weights, and the output values must lie within the range of the hidden-to-output weights. Therefore, if the hidden-to-output weights are within a reasonable range (such as the range of the target values), you can be sure that the outputs will be within that same range for all possible inputs, even when the net is extrapolating. No comparably useful bound exists for the output of an ORBF network.

    If you extrapolate far enough in a Gaussian ORBF network with an identity output activation function, the activation of every hidden unit will approach zero, hence the extrapolated output of the network will equal the output bias. If you extrapolate far enough in an NRBF network, one hidden unit will come to dominate the output. Hence if you want the network to extrapolate different values in a different directions, an NRBF should be used instead of an ORBF.

    Radial combination functions incorporating altitudes are useful with NRBF architectures. The NRBF architectures combine some of the virtues of both the RBF and MLP architectures, as explained below. However, the isoactivation contours are considerably more complicated than for ORBF architectures.

    Consider the case of an NRBF network with only two hidden units. If the hidden units have equal widths, the isoactivation contours are parallel hyperplanes; in fact, this network is equivalent to an MLP with one logistic hidden unit. If the hidden units have unequal widths, the isoactivation contours are concentric hyperspheres; such a network is almost equivalent to an ORBF network with one Gaussian hidden unit.

    If there are more than two hidden units in an NRBF network, the isoactivation contours have no such simple characterization. If the RBF widths are very small, the isoactivation contours are approximately piecewise linear for RBF units with equal widths, and approximately piecewise spherical for RBF units with unequal widths. The larger the widths, the smoother the isoactivation contours where the pieces join. As Shorten and Murray-Smith (1996) point out, the activation is not necessarily a monotone function of distance from the center when unequal widths are used.

    The NRBFEQ architecture is a smoothed variant of the learning vector quantization (Kohonen 1988, Ripley 1996) and counterpropagation (Hecht-Nielsen 1990), architectures. In LVQ and counterprop, the hidden units are often called "codebook vectors". LVQ amounts to nearest-neighbor classification on the codebook vectors, while counterprop is nearest-neighbor regression on the codebook vectors. The NRBFEQ architecture uses not just the single nearest neighbor, but a weighted average of near neighbors. As the width of the NRBFEQ functions approaches zero, the weights approach one for the nearest neighbor and zero for all other codebook vectors. LVQ and counterprop use ad hoc algorithms of uncertain reliability, but standard numerical optimization algorithms (not to mention backprop) can be applied with the NRBFEQ architecture.

    In a NRBFEQ architecture, if each observation is taken as an RBF center, and if the weights are taken to be the target values, the outputs are simply weighted averages of the target values, and the network is identical to the well-known Nadaraya-Watson kernel regression estimator, which has been reinvented at least twice in the neural net literature (see "What is GRNN?"). A similar NRBFEQ network used for classification is equivalent to kernel discriminant analysis (see "What is PNN?").

    Kernels with variable widths are also used for regression in the statistical literature. Such kernel estimators correspond to the the NRBFEV architecture, in which the kernel functions have equal volumes but different altitudes. In the neural net literature, variable-width kernels appear always to be of the NRBFEH variety, with equal altitudes but unequal volumes. The analogy with kernel regression would make the NRBFEV architecture the obvious choice, but which of the two architectures works better in practice is an open question.

    Hybrid training and the curse of dimensionality

    A comparison of the various architectures must separate training issues from architectural issues to avoid common sources of confusion. RBF networks are often trained by "hybrid" methods, in which the hidden weights (centers) are first obtained by unsupervised learning, after which the output weights are obtained by supervised learning. Unsupervised methods for choosing the centers include:
  • Distribute the centers in a regular grid over the input space.
  • Choose a random subset of the training cases to serve as centers.
  • Cluster the training cases based on the input variables, and use the mean of each cluster as a center.
  • Various heuristic methods are also available for choosing the RBF widths (e.g., Moody and Darken 1989; Sarle 1994b). Once the centers and widths are fixed, the output weights can be learned very efficiently, since the computation reduces to a linear or generalized linear model. The hybrid training approach can thus be much faster than the nonlinear optimization that would be required for supervised training of all of the weights in the network.

    Hybrid training is not often applied to MLPs because no effective methods are known for unsupervised training of the hidden units (except when there is only one input).

    Hybrid training will usually require more hidden units than supervised training. Since supervised training optimizes the locations of the centers, while hybrid training does not, supervised training will provide a better approximation to the function to be learned for a given number of hidden units. Thus, the better fit provided by supervised training will often let you use fewer hidden units for a given accuracy of approximation than you would need with hybrid training. And if the hidden-to-output weights are learned by linear least-squares, the fact that hybrid training requires more hidden units implies that hybrid training will also require more training cases for the same accuracy of generalization (Tarassenko and Roberts 1994).

    The number of hidden units required by hybrid methods becomes an increasingly serious problem as the number of inputs increases. In fact, the required number of hidden units tends to increase exponentially with the number of inputs. This drawback of hybrid methods is discussed by Minsky and Papert (1969). For example, with method (1) for RBF networks, you would need at least five elements in the grid along each dimension to detect a moderate degree of nonlinearity; so if you have Nx inputs, you would need at least 5^Nx hidden units. For methods (2) and (3), the number of hidden units increases exponentially with the effective dimensionality of the input distribution. If the inputs are linearly related, the effective dimensionality is the number of nonnegligible (a deliberately vague term) eigenvalues of the covariance matrix, so the inputs must be highly correlated if the effective dimensionality is to be much less than the number of inputs.

    The exponential increase in the number of hidden units required for hybrid learning is one aspect of the curse of dimensionality. The number of training cases required also increases exponentially in general. No neural network architecture--in fact no method of learning or statistical estimation--can escape the curse of dimensionality in general, hence there is no practical method of learning general functions in more than a few dimensions.

    Fortunately, in many practical applications of neural networks with a large number of inputs, most of those inputs are additive, redundant, or irrelevant, and some architectures can take advantage of these properties to yield useful results. But escape from the curse of dimensionality requires fully supervised training as well as special types of data. Supervised training for RBF networks can be done by "backprop" (see "What is backprop?") or other optimization methods (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"), or by subset regression "What are OLS and subset/stepwise regression?").

    Additive inputs

    An additive model is one in which the output is a sum of linear or nonlinear transformations of the inputs. If an additive model is appropriate, the number of weights increases linearly with the number of inputs, so high dimensionality is not a curse. Various methods of training additive models are available in the statistical literature (e.g. Hastie and Tibshirani 1990). You can also create a feedforward neural network, called a "generalized additive network" (GAN), to fit additive models (Sarle 1994a). Additive models have been proposed in the neural net literature under the name "topologically distributed encoding" (Geiger 1990).

    Projection pursuit regression (PPR) provides both universal approximation and the ability to avoid the curse of dimensionality for certain common types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR computes the output as a sum of nonlinear transformations of linear combinations of the inputs. Each term in the sum is analogous to a hidden unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear transformations rather than a specific nonlinear activation function, and allows a different transformation for each term. The nonlinear transformations in PPR are usually estimated by nonparametric regression, but you can set up a projection pursuit network (PPN), in which each nonlinear transformation is performed by a subnetwork. If a PPN provides an adequate fit with few terms, then the curse of dimensionality can be avoided, and the results may even be interpretable.

    If the target function can be accurately approximated by projection pursuit, then it can also be accurately approximated by an MLP with a single hidden layer. The disadvantage of the MLP is that there is little hope of interpretability. An MLP with two or more hidden layers can provide a parsimonious fit to a wider variety of target functions than can projection pursuit, but no simple characterization of these functions is known.

    Redundant inputs

    With proper training, all of the RBF architectures listed above, as well as MLPs, can process redundant inputs effectively. When there are redundant inputs, the training cases lie close to some (possibly nonlinear) subspace. If the same degree of redundancy applies to the test cases, the network need produce accurate outputs only near the subspace occupied by the data. Adding redundant inputs has little effect on the effective dimensionality of the data; hence the curse of dimensionality does not apply, and even hybrid methods (2) and (3) can be used. However, if the test cases do not follow the same pattern of redundancy as the training cases, generalization will require extrapolation and will rarely work well.

    Irrelevant inputs

    MLP architectures are good at ignoring irrelevant inputs. MLPs can also select linear subspaces of reduced dimensionality. Since the first hidden layer forms linear combinations of the inputs, it confines the networks attention to the linear subspace spanned by the weight vectors. Hence, adding irrelevant inputs to the training data does not increase the number of hidden units required, although it increases the amount of training data required.

    ORBF architectures are not good at ignoring irrelevant inputs. The number of hidden units required grows exponentially with the number of inputs, regardless of how many inputs are relevant. This exponential growth is related to the fact that ORBFs have local receptive fields, meaning that changing the hidden-to-output weights of a given unit will affect the output of the network only in a neighborhood of the center of the hidden unit, where the size of the neighborhood is determined by the width of the hidden unit. (Of course, if the width of the unit is learned, the receptive field could grow to cover the entire training set.)

    Local receptive fields are often an advantage compared to the distributed architecture of MLPs, since local units can adapt to local patterns in the data without having unwanted side effects in other regions. In a distributed architecture such as an MLP, adapting the network to fit a local pattern in the data can cause spurious side effects in other parts of the input space.

    However, ORBF architectures often must be used with relatively small neighborhoods, so that several hidden units are required to cover the range of an input. When there are many nonredundant inputs, the hidden units must cover the entire input space, and the number of units required is essentially the same as in the hybrid case (1) where the centers are in a regular grid; hence the exponential growth in the number of hidden units with the number of inputs, regardless of whether the inputs are relevant.

    You can enable an ORBF architecture to ignore irrelevant inputs by using an extra, linear hidden layer before the radial hidden layer. This type of network is sometimes called an "elliptical basis function" network. If the number of units in the linear hidden layer equals the number of inputs, the linear hidden layer performs an oblique rotation of the input space that can suppress irrelevant directions and differentally weight relevant directions according to their importance. If you think that the presence of irrelevant inputs is highly likely, you can force a reduction of dimensionality by using fewer units in the linear hidden layer than the number of inputs.

    Note that the linear and radial hidden layers must be connected in series, not in parallel, to ignore irrelevant inputs. In some applications it is useful to have linear and radial hidden layers connected in parallel, but in such cases the radial hidden layer will be sensitive to all inputs.

    For even greater flexibility (at the cost of more weights to be learned), you can have a separate linear hidden layer for each RBF unit, allowing a different oblique rotation for each RBF unit.

    NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the advantage of local receptive fields with the ability to ignore irrelevant inputs. The receptive field of one hidden unit extends from the center in all directions until it encounters the receptive field of another hidden unit. It is convenient to think of a "boundary" between the two receptive fields, defined as the hyperplane where the two units have equal activations, even though the effect of each unit will extend somewhat beyond the boundary. The location of the boundary depends on the heights of the hidden units. If the two units have equal heights, the boundary lies midway between the two centers. If the units have unequal heights, the boundary is farther from the higher unit.

    If a hidden unit is surrounded by other hidden units, its receptive field is indeed local, curtailed by the field boundaries with other units. But if a hidden unit is not completely surrounded, its receptive field can extend infinitely in certain directions. If there are irrelevant inputs, or more generally, irrelevant directions that are linear combinations of the inputs, the centers need only be distributed in a subspace orthogonal to the irrelevant directions. In this case, the hidden units can have local receptive fields in relevant directions but infinite receptive fields in irrelevant directions.

    For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH), the boundaries between receptive fields are generally hyperspheres rather than hyperplanes. In order to ignore irrelevant inputs, such networks must be trained to have equal widths. Hence, if you think there is a strong possibility that some of the inputs are irrelevant, it is usually better to use an architecture with equal widths.

    References:
    There are few good references on RBF networks. Bishop (1995) gives one of the better surveys, but also see Tao (1993) and Werntges (1993) for the importance of normalization.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Friedman, J.H. and Stuetzle, W. (1981), "Projection pursuit regression," J. of the American Statistical Association, 76, 817-823.

    Geiger, H. (1990), "Storing and Processing Information in Connectionist Systems," in Eckmiller, R., ed., Advanced Neural Computers, 271-277, Amsterdam: North-Holland.

    Green, P.J. and Silverman, B.W. (1994), Nonparametric Regression and Generalized Linear Models: A roughness penalty approach,, London: Chapman & Hall.

    Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models, London: Chapman & Hall.

    Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.

    Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1 (suppl 1), 303.

    Minsky, M.L. and Papert, S.A. (1969), Perceptrons, Cambridge, MA: MIT Press.

    Moody, J. and Darken, C.J. (1989), "Fast learning in networks of locally-tuned processing units," Neural Computation, 1, 281-294.

    Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Sarle, W.S. (1994a), "Neural Networks and Statistical Models," in SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, ftp://ftp.sas.com/pub/neural/neural1.ps.

    Sarle, W.S. (1994b), "Neural Network Implementation in SAS Software," in SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp 1551-1573, ftp://ftp.sas.com/pub/neural/neural2.ps.

    Shorten, R., and Murray-Smith, R. (1996), "Side effects of normalising radial basis function networks" International Journal of Neural Systems, 7, 167-179.

    Tao, K.M. (1993), "A closer look at the radial basis function (RBF) networks," Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers (Singh, A., ed.), vol 1, 401-405, Los Alamitos, CA: IEEE Comput. Soc. Press.

    Tarassenko, L. and Roberts, S. (1994), "Supervised and unsupervised learning in radial basis function classifiers," IEE Proceedings-- Vis. Image Signal Processing, 141, 210-216.

    Werntges, H.W. (1993), "Partitions of unity improve neural function approximation," Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, vol 2, 914-918.

    ------------------------------------------------------------------------

    Subject: What are OLS and subset/stepwise regression?

    If you are statistician, "OLS" means "ordinary least squares" (as opposed to weighted or generalized least squares), which is what the NN literature often calls "LMS" (least mean squares).

    If you are a neural networker, "OLS" means "orthogonal least squares", which is an algorithm for forward stepwise regression proposed by Chen et al. (1991) for training RBF networks.

    OLS is a variety of supervised training. But whereas backprop and other commonly-used supervised methods are forms of continuous optimization, OLS is a form of combinatorial optimization. Rather than treating the RBF centers as continuous values to be adjusted to reduce the training error, OLS starts with a large set of candidate centers and selects a subset that usually provides good training error. For small training sets, the candidates can include all of the training cases. For large training sets, it is more efficient to use a random subset of the training cases or to do a cluster analysis and use the cluster means as candidates.

    Each center corresponds to a predictor variable in a linear regression model. The values of these predictor variables are computed from the RBF applied to each center. There are numerous methods for selecting a subset of predictor variables in regression (Myers 1986; Miller 1990). The ones most often used are:

  • Forward selection begins with no centers in the network. At each step the center is added that most decreases the error function.
  • Backward elimination begins with all candidate centers in the network. At each step the center is removed that least increases the error function.
  • Stepwise selection begins like forward selection with no centers in the network. At each step, a center is added or removed. If there are any centers in the network, the one that contributes least to reducing the error criterion is subjected to a statistical test (usually based on the F statistic) to see if it is worth retaining in the network; if the center fails the test, it is removed. If no centers are removed, then the centers that are not currently in the network are examined; the one that would contribute most to reducing the error criterion is subjected to a statistical test to see if it is worth adding to the network; if the center passes the test, it is added. When all centers in the network pass the test for staying in the network, and all other centers fail the test for being added to the network, the stepwise method terminates.
  • Leaps and bounds (Furnival and Wilson 1974) is an algorithm for determining the subset of centers that minimizes the error function; this optimal subset can be found without examining all possible subsets, but the algorithm is practical only up to 30 to 50 candidate centers.
  • OLS is a particular algorithm for forward selection using modified Gram-Schmidt (MGS) orthogonalization. While MGS is not a bad algorithm, it is not the best algorithm for linear least-squares (Lawson and Hanson 1974). For ill-conditioned data, Householder and Givens methods are generally preferred, while for large, well-conditioned data sets, methods based on the normal equations require about one-third as many floating point operations and much less disk I/O than OLS. Normal equation methods based on sweeping (Goodnight 1979) or Gaussian elimination (Furnival and Wilson 1974) are especially simple to program.

    While the theory of linear models is the most thoroughly developed area of statistical inference, subset selection invalidates most of the standard theory (Miller 1990; Roecker 1991; Derksen and Keselman 1992; Freedman, Pee, and Midthune 1992).

    Subset selection methods usually do not generalize as well as regularization methods in linear models (Frank and Friedman 1993). Orr (1995) has proposed combining regularization with subset selection for RBF training (see also Orr 1996).

    References:

    Chen, S., Cowan, C.F.N., and Grant, P.M. (1991), "Orthogonal least squares learning for radial basis function networks," IEEE Transactions on Neural Networks, 2, 302-309.

    Derksen, S. and Keselman, H. J. (1992) "Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables," British Journal of Mathematical and Statistical Psychology, 45, 265-282,

    Frank, I.E. and Friedman, J.H. (1993) "A statistical view of some chemometrics regression tools," Technometrics, 35, 109-148.

    Freedman, L.S. , Pee, D. and Midthune, D.N. (1992) "The problem of underestimating the residual error variance in forward stepwise regression", The Statistician, 41, 405-412.

    Furnival, G.M. and Wilson, R.W. (1974), "Regression by Leaps and Bounds," Technometrics, 16, 499-511.

    Goodnight, J.H. (1979), "A Tutorial on the SWEEP Operator," The American Statistician, 33, 149-158.

    Lawson, C. L. and Hanson, R. J. (1974), Solving Least Squares Problems, Englewood Cliffs, NJ: Prentice-Hall, Inc. (2nd edition: 1995, Philadelphia: SIAM)

    Miller, A.J. (1990), Subset Selection in Regression, Chapman & Hall.

    Myers, R.H. (1986), Classical and Modern Regression with Applications, Boston: Duxbury Press.

    Orr, M.J.L. (1995), "Regularisation in the selection of radial basis function centres," Neural Computation, 7, 606-623.

    Orr, M.J.L. (1996), "Introduction to radial basis function networks," http://www.cns.ed.ac.uk/people/mark/intro.ps or http://www.cns.ed.ac.uk/people/mark/intro/intro.html .

    Roecker, E.B. (1991) "Prediction error and its estimation for subset-selected models," Technometrics, 33, 459-468.

    ------------------------------------------------------------------------

    Subject: Should I normalize/standardize/rescale the data?

    First, some definitions. "Rescaling" a vector means to add or subtract a constant and then multiply or divide by a constant, as you would do to change the units of measurement of the data, for example, to convert a temperature from Celsius to Fahrenheit.

    "Normalizing" a vector most often means dividing by a norm of the vector, for example, to make the Euclidean length of the vector equal to one. In the NN literature, "normalizing" also often refers to rescaling by the minimum and range of the vector, to make all the elements lie between 0 and 1.

    "Standardizing" a vector most often means subtracting a measure of location and dividing by a measure of scale. For example, if the vector contains random values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation, thereby obtaining a "standard normal" random variable with mean 0 and standard deviation 1.

    However, all of the above terms are used more or less interchangeably depending on the customs within various fields. Since the FAQ maintainer is a statistician, he is going to use the term "standardize" because that is what he is accustomed to.

    Now the question is, should you do any of these things to your data? The answer is, it depends.

    There is a common misconception that the inputs to a multilayer perceptron must be in the interval [0,1]. There is in fact no such requirement, although there often are benefits to standardizing the inputs as discussed below. But it is better to have the input values centered around zero, so scaling the inputs to the interval [0,1] is usually a bad choice.

    If your output activation function has a range of [0,1], then obviously you must ensure that the target values lie within that range. But it is generally better to choose an output activation function suited to the distribution of the targets than to force your data to conform to the output activation function. See "Why use activation functions?"

    When using an output activation with a range of [0,1], some people prefer to rescale the targets to a range of [.1,.9]. I suspect that the popularity of this gimmick is due to the slowness of standard backprop. But using a target range of [.1,.9] for a classification task gives you incorrect posterior probability estimates, it is unnecessary if you use an efficient training algorithm (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"), and it is unnecessary to avoid overflow (see How to avoid overflow in the logistic function?).

    Now for some of the gory details: note that the training data form a matrix. Let's set up this matrix so that each case forms a row, and the inputs and target variables form columns. You could conceivably standardize the rows or the columns or both or various other things, and these different ways of choosing vectors to standardize will have quite different effects on training.

    Standardizing either input or target variables tends to make the training process better behaved by improving the numerical condition of the optimization problem and ensuring that various default values involved in initialization and termination are appropriate. Standardizing targets can also affect the objective function.

    Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous.

    Subquestion: Should I standardize the input variables (column vectors)?

    That depends primarily on how the network combines input variables to compute the net input to the next (hidden or output) layer. If the input variables are combined via a distance function (such as Euclidean distance) in an RBF network, standardizing inputs can be crucial. The contribution of an input will depend heavily on its variability relative to other inputs. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input. So it is essential to rescale the inputs so that their variability reflects their importance, or at least is not in inverse relation to their importance. For lack of better prior information, it is common to standardize each input to the same range or the same standard deviation. If you know that some inputs are more important than others, it may help to scale the inputs such that the more important ones have larger variances and/or ranges.

    If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

    The main emphasis in the NN literature on initial values has been on the avoidance of saturation, hence the desire to use small random values. How small these random values should be depends on the scale of the inputs as well as the number of inputs and their correlations. Standardizing inputs removes the problem of scale dependence of the initial weights.

    But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to [-1,1] will work better than [0,1], although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good or better.

    Standardizing input variables also has different effects on different training algorithms for MLPs. For example:

  • Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an important consideration for gradient descent methods such as standard backprop.
  • Quasi-Newton and conjugate gradient methods begin with a steepest descent step and therefore are scale sensitive. However, they accumulate second-order information as training proceeds and hence are less scale sensitive than pure gradient descent.
  • Newton-Raphson and Gauss-Newton, if implemented correctly, are theoretically invariant under scale changes as long as none of the scaling is so extreme as to produce underflow or overflow.
  • Levenberg-Marquardt is scale invariant as long as no ridging is required. There are several different ways to implement ridging; some are scale invariant and some are not. Performance under bad scaling will depend on details of the implementation.
  • Two of the most useful ways to standardize inputs are:
  • Mean 0 and standard deviation 1
  • Midrange 0 and range 2 (i.e., minimum -1 and maximum 1)
  • Formulas are as follows:
    Notation:
    
       X = value of the raw input variable X for the ith training case
        i
       
       S = standardized value corresponding to X
        i                                       i
       
       N = number of training cases
    
                               
    Standardize X  to mean 0 and standard deviation 1:
                 i   
    
              sum X
               i   i   
       mean = ------
                 N
       
                                  2
                    sum( X - mean)
                     i    i
       std  = sqrt( --------------- )
                         N - 1
                               
    
           X  - mean
            i
       S = ---------
        i     std
    
                               
    Standardize X  to midrange 0 and range 2:
                 i   
    
                  max X  +  min X
                   i   i     i   i
       midrange = ----------------
                         2
    
    
       range = max X  -  min X
                i   i     i   i
    
    
           X  - midrange
            i
       S = -------------
        i     range / 2
    
    Various other pairs of location and scale estimators can be used besides the mean and standard deviation, or midrange and range. Robust estimates of location and scale are desirable if the inputs contain outliers. For example, see:
    Iglewicz, B. (1983), "Robust scale estimators and confidence intervals for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., Understanding Robust and Exploratory Data Analysis, NY: Wiley.

    Subquestion: Should I standardize the target variables (column vectors)?

    Standardizing target variables is typically more a convenience for getting good initial weights than a necessity. However, if you have two or more target variables and your error function is scale-sensitive like the usual least (mean) squares error function, then the variability of each target relative to the others can effect how well the net learns that target. If one target has a range of 0 to 1, while another target has a range of 0 to 1,000,000, the net will expend most of its effort learning the second target to the possible exclusion of the first. So it is essential to rescale the targets so that their variability reflects their importance, or at least is not in inverse relation to their importance. If the targets are of equal importance, they should typically be standardized to the same range or the same standard deviation.

    The scaling of the targets does not affect their importance in training if you use maximum likelihood estimation and estimate a separate scale parameter (such as a standard deviation) for each target variable. In this case, the importance of each target is inversely related to its estimated scale parameter. In other words, noisier targets will be given less importance.

    For weight decay and Bayesian estimation, the scaling of the targets affects the decay values and prior distributions. Hence it is usually most convenient to work with standardized targets.

    If you are standardizing targets to equalize their importance, then you should probably standardize to mean 0 and standard deviation 1, or use related robust estimators, as discussed under Should I standardize the input variables (column vectors)? If you are standardizing targets to force the values into the range of the output activation function, it is important to use lower and upper bounds for the values, rather than the minimum and maximum values in the training set. For example, if the output activation function has range [-1,1], you can use the following formulas:

       Y = value of the raw target variable Y for the ith training case
        i
       
       Z = standardized value corresponding to Y
        i                                       i
       
                  upper bound of Y  +  lower bound of Y
       midrange = -------------------------------------
                                    2
    
    
       range = upper bound of Y  -  lower bound of Y
    
    
           Y  - midrange
            i
       Z = -------------
        i    range / 2
    For a range of [0,1], you can use the following formula:
                   Y  - lower bound of Y
                    i
       Z = -------------------------------------
        i  upper bound of Y  -  lower bound of Y
    And of course, you apply the inverse of the standardization formula to the network outputs to restore them to the scale of the original target values.

    If the target variable does not have known upper and lower bounds, it is not advisable to use an output activation function with a bounded range. You can use an identity output activation function or other unbounded output activation function instead; see Why use activation functions?

    Subquestion: Should I standardize the variables (column vectors) for unsupervised learning?

    The most commonly used methods of unsupervised learning, including various kinds of vector quantization, Kohonen networks, Hebbian learning, etc., depend on Euclidean distances or scalar-product similarity measures. The considerations are therefore the same as for standardizing inputs in RBF networks--see Should I standardize the input variables (column vectors)? above. In particular, if one input has a large variance and another a small variance, the latter will have little or no influence on the results.

    If you are using unsupervised competitive learning to try to discover natural clusters in the data, rather than for data compression, simply standardizing the variables may be inadequate. For more sophisticated methods of preprocessing, see:

    Art, D., Gnanadesikan, R., and Kettenring, R. (1982), "Data-based Metrics for Cluster Analysis," Utilitas Mathematica, 21A, 75-99.

    Jannsen, P., Marron, J.S., Veraverbeke, N, and Sarle, W.S. (1995), "Scale measures for bandwidth selection", J. of Nonparametric Statistics, 5, 359-380.

    Better yet for finding natural clusters, try mixture models or nonparametric density estimation. For example::
    Girman, C.J. (1994), "Cluster Analysis and Classification Tree Methodology as an Aid to Improve Understanding of Benign Prostatic Hyperplasia," Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina.

    McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, New York: Marcel Dekker, Inc.

    SAS Institute Inc. (1993), SAS/STAT Software: The MODECLUS Procedure, SAS Technical Report P-256, Cary, NC: SAS Institute Inc.

    Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), Statistical Analysis of Finite Mixture Distributions, New York: John Wiley & Sons, Inc.

    Wong, M.A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering Procedure," Journal of the Royal Statistical Society, Series B, 45, 362-368.

    Subquestion: Should I standardize the input cases (row vectors)?

    Whereas standardizing variables is usually beneficial, the effect of standardizing cases (row vectors) depends on the particular data. Cases are typically standardized only across the input variables, since including the target variable(s) in the standardization would make prediction impossible.

    There are some kinds of networks, such as simple Kohonen nets, where it is necessary to standardize the input cases to a common Euclidean length; this is a side effect of the use of the inner product as a similarity measure. If the network is modified to operate on Euclidean distances instead of inner products, it is no longer necessary to standardize the input cases.

    Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous. Issues regarding the standardization of cases must be carefully evaluated in every application. There are no rules of thumb that apply to all applications.

    You may want to standardize each case if there is extraneous variability between cases. Consider the common situation in which each input variable represents a pixel in an image. If the images vary in exposure, and exposure is irrelevant to the target values, then it would usually help to subtract the mean of each case to equate the exposures of different cases. If the images vary in contrast, and contrast is irrelevant to the target values, then it would usually help to divide each case by its standard deviation to equate the contrasts of different cases. Given sufficient data, a NN could learn to ignore exposure and contrast. However, training will be easier and generalization better if you can remove the extraneous exposure and contrast information before training the network.

    As another example, suppose you want to classify plant specimens according to species but the specimens are at different stages of growth. You have measurements such as stem length, leaf length, and leaf width. However, the over-all size of the specimen is determined by age or growing conditions, not by species. Given sufficient data, a NN could learn to ignore the size of the specimens and classify them by shape instead. However, training will be easier and generalization better if you can remove the extraneous size information before training the network. Size in the plant example corresponds to exposure in the image example.

    If the input data are measured on an interval scale (for information on scales of measurement, see "Measurement theory: Frequently asked questions", at ftp://ftp.sas.com/pub/neural/measurement.html) you can control for size by subtracting a measure of the over-all size of each case from each datum. For example, if no other direct measure of size is available, you could subtract the mean of each row of the input matrix, producing a row-centered input matrix.

    If the data are measured on a ratio scale, you can control for size by dividing each datum by a measure of over-all size. It is common to divide by the sum or by the arithmetic mean. For positive ratio data, however, the geometric mean is often a more natural measure of size than the arithmetic mean. It may also be more meaningful to analyze the logarithms of positive ratio-scaled data, in which case you can subtract the arithmetic mean after taking logarithms. You must also consider the dimensions of measurement. For example, if you have measures of both length and weight, you may need to cube the measures of length or take the cube root of the weights.

    In NN aplications with ratio-level data, it is common to divide by the Euclidean length of each row. If the data are positive, dividing by the Euclidean length has properties similar to dividing by the sum or arithmetic mean, since the former projects the data points onto the surface of a hypersphere while the latter projects the points onto a hyperplane. If the dimensionality is not too high, the resulting configurations of points on the hypersphere and hyperplane are usually quite similar. If the data contain negative values, then the hypersphere and hyperplane can diverge widely.

    ------------------------------------------------------------------------

    Subject: Should I nonlinearly transform the data?

    Most importantly, nonlinear transformations of the targets are important with noisy data, via their effect on the error function. Many commonly used error functions are functions solely of the difference abs(target-output). Nonlinear transformations (unlike linear transformations) change the relative sizes of these differences. With most error functions, the net will expend more effort, so to speak, trying to learn target values for which abs(target-output) is large.

    For example, suppose you are trying to predict the price of a stock. If the price of the stock is 10 (in whatever currency unit) and the output of the net is 5 or 15, yielding a difference of 5, that is a huge error. If the price of the stock is 1000 and the output of the net is 995 or 1005, yielding the same difference of 5, that is a tiny error. You don't want the net to treat those two differences as equally important. By taking logarithms, you are effectively measuring errors in terms of ratios rather than differences, since a difference between two logs corresponds to the ratio of the original values. This has approximately the same effect as looking at percentage differences, abs(target-output)/target or abs(target-output)/output, rather than simple differences.

    Less importantly, smooth functions are usually easier to learn than rough functions. Generalization is also usually better for smooth functions. So nonlinear transformations (of either inputs or targets) that make the input-output function smoother are usually beneficial. For classification problems, you want the class boundaries to be smooth. When there are only a few inputs, it is often possible to transform the data to a linear relationship, in which case you can use a linear model instead of a more complex neural net, and many things (such as estimating generalization error and error bars) will become much simpler. A variety of NN architectures (RBF networks, B-spline networks, etc.) amount to using many nonlinear transformations, possibly involving multiple variables simultaneously, to try to make the input-output function approximately linear (Ripley 1996, chapter 4). There are particular applications, such as signal and image processing, in which very elaborate transformations are useful (Masters 1994).

    It is usually advisable to choose an error function appropriate for the distribution of noise in your target variables (McCullagh and Nelder 1989). But if your software does not provide a sufficient variety of error functions, then you may need to transform the target so that the noise distribution conforms to whatever error function you are using. For example, if you have to use least-(mean-)squares training, you will get the best results if the noise distribution is approximately Gaussian with constant variance, since least-(mean-)squares is maximum likelihood in that case. Heavy-tailed distributions (those in which extreme values occur more often than in a Gaussian distribution, often as indicated by high kurtosis) are especially of concern, due to the loss of statistical efficiency of least-(mean-)square estimates (Huber 1981). Note that what is important is the distribution of the noise, not the distribution of the target values.

    The distribution of inputs may suggest transformations, but this is by far the least important consideration among those listed here. If an input is strongly skewed, a logarithmic, square root, or other power (between -1 and 1) transformation may be worth trying. If an input has high kurtosis but low skewness, an arctan transform can reduce the influence of extreme values:

                 input - mean
       arctan( c ------------ )
                 stand. dev.
    where c is a constant that controls how far the extreme values are brought in towards the mean. Arctan usually works better than tanh, which squashes the extreme values too much. Using robust estimates of location and scale (Iglewicz 1983) instead of the mean and standard deviation will work even better for pathological distributions.

    References:

    Atkinson, A.C. (1985) Plots, Transformations and Regression, Oxford: Clarendon Press.

    Carrol, R.J. and Ruppert, D. (1988) Transformation and Weighting in Regression, London: Chapman and Hall.

    Huber, P.J. (1981), Robust Statistics, NY: Wiley.

    Iglewicz, B. (1983), "Robust scale estimators and confidence intervals for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., Understanding Robust and Exploratory Data Analysis, NY: Wiley.

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman and Hall.

    Masters, T. (1994), Signal and Image Processing with Neural Networks: A C++ Sourcebook, NY: Wiley.
     
     

    Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    ------------------------------------------------------------------------

    Subject: How to measure importance of inputs?

    The answer to this question is rather long and so is not included directly in the posted FAQ. See ftp://ftp.sas.com/pub/neural/importance.html.

    Also see Pierre van de Laar's bibliography at ftp://ftp.mbfys.kun.nl/snn/pub/pierre/connectionists.html, but don't believe everything you read in those papers.

    ------------------------------------------------------------------------

    Subject: What is ART?

    ART stands for "Adaptive Resonance Theory", invented by Stephen Grossberg in 1976. ART encompasses a wide variety of neural networks based explicitly on neurophysiology. ART networks are defined algorithmically in terms of detailed differential equations intended as plausible models of biological neurons. In practice, ART networks are implemented using analytical solutions or approximations to these differential equations.

    ART comes in several flavors, both supervised and unsupervised. As discussed by Moore (1988), the unsupervised ARTs are basically similar to many iterative clustering algorithms in which each case is processed by:

  • finding the "nearest" cluster seed (AKA prototype or template) to that case
  • updating that cluster seed to be "closer" to the case
  • where "nearest" and "closer" can be defined in hundreds of different ways. In ART, the framework is modified slightly by introducing the concept of "resonance" so that each case is processed by:
  • finding the "nearest" cluster seed that "resonates" with the case
  • updating that cluster seed to be "closer" to the case
  • "Resonance" is just a matter of being within a certain threshold of a second similarity measure. A crucial feature of ART is that if no seed resonates with the case, a new cluster is created as in Hartigan's (1975) leader algorithm. This feature is said to solve the "stability/plasticity dilemma".

    ART has its own jargon. For example, data are called an "arbitrary sequence of input patterns". The current training case is stored in "short term memory" and cluster seeds are "long term memory". A cluster is a "maximally compressed pattern recognition code". The two stages of finding the nearest seed to the input are performed by an "Attentional Subsystem" and an "Orienting Subsystem", the latter of which performs "hypothesis testing", which simply refers to the comparison with the vigilance threshhold, not to hypothesis testing in the statistical sense. "Stable learning" means that the algorithm converges. So the oft-repeated claim that ART algorithms are "capable of rapid stable learning of recognition codes in response to arbitrary sequences of input patterns" merely means that ART algorithms are clustering algorithms that converge; it does not mean, as one might naively assume, that the clusters are insensitive to the sequence in which the training patterns are presented--quite the opposite is true.

    There are various supervised ART algorithms that are named with the suffix "MAP", as in Fuzzy ARTMAP. These algorithms cluster both the inputs and targets and associate the two sets of clusters. The effect is somewhat similar to counterpropagation. The main disadvantage of the ARTMAP algorithms is that they have no mechanism to avoid overfitting and hence should not be used with noisy data.

    For more information, see the ART FAQ at http://www.wi.leidenuniv.nl/art/ and the "ART Headquarters" at Boston University, http://cns-web.bu.edu/. For a different view of ART, see Sarle, W.S. (1995), "Why Statisticians Should Not FART," ftp://ftp.sas.com/pub/neural/fart.txt.

    For C software, see the ART Gallery at http://cns-web.bu.edu/pub/laliden/WWW/nnet.frame.html

    References:

    Carpenter, G.A., Grossberg, S. (1996), "Learning, Categorization, Rule Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, pp. 1.3-1.45.

    Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.

    Kasuba, T. (1993), "Simplified Fuzzy ARTMAP," AI Expert, 8, 18-25.

    Moore, B. (1988), "ART 1 and Pattern Clustering," in Touretzky, D., Hinton, G. and Sejnowski, T., eds., Proceedings of the 1988 Connectionist Models Summer School, 174-185, San Mateo, CA: Morgan Kaufmann.

    ------------------------------------------------------------------------

    Subject: What is PNN?

    PNN or "Probabilistic Neural Network" is Donald Specht's term for kernel discriminant analysis. (Kernels are also called "Parzen windows".) You can think of it as a normalized RBF network in which there is a hidden unit centered at every training case. These RBF units are called "kernels" and are usually probability density functions such as the Gaussian. The hidden-to-output weights are usually 1 or 0; for each hidden unit, a weight of 1 is used for the connection going to the output that the case belongs to, while all other connections are given weights of 0. Alternatively, you can adjust these weights for the prior probabilities of each class. So the only weights that need to be learned are the widths of the RBF units. These widths (often a single width is used) are called "smoothing parameters" or "bandwidths" and are usually chosen by cross-validation or by more esoteric methods that are not well-known in the neural net literature; gradient descent is not used.

    Specht's claim that a PNN trains 100,000 times faster than backprop is at best misleading. While they are not iterative in the same sense as backprop, kernel methods require that you estimate the kernel bandwidth, and this requires accessing the data many times. Furthermore, computing a single output value with kernel methods requires either accessing the entire training data or clever programming, and either way is much slower than computing an output with a feedforward net. And there are a variety of methods for training feedforward nets that are much faster than standard backprop. So depending on what you are doing and how you do it, PNN may be either faster or slower than a feedforward net.

    PNN is a universal approximator for smooth class-conditional densities, so it should be able to solve any smooth classification problem given enough data. The main drawback of PNN is that, like kernel methods in general, it suffers badly from the curse of dimensionality. PNN cannot ignore irrelevant inputs without major modifications to the basic algorithm. So PNN is not likely to be the top choice if you have more than 5 or 6 nonredundant inputs. For modified algorithms that deal with irrelevant inputs, see Masters (1995) and Lowe (1995).

    But if all your inputs are relevant, PNN has the very useful ability to tell you whether a test case is similar (i.e. has a high density) to any of the training data; if not, you are extrapolating and should view the output classification with skepticism. This ability is of limited use when you have irrelevant inputs, since the similarity is measured with respect to all of the inputs, not just the relevant ones.

    References:

    Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press.

    Lowe, D.G. (1995), "Similarity metric learning for a variable-kernel classifier," Neural Computation, 7, 72-85, http://www.cs.ubc.ca/spider/lowe/pubs.html

    McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern Recognition, Wiley.

    Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego: Academic Press.

    Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0

    Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and Statistical Classification, Ellis Horwood; this book is out of print but available online at http://www.amsta.leeds.ac.uk/~charles/statlog/

    Scott, D.W. (1992) Multivariate Density Estimation, Wiley.

    Specht, D.F. (1990) "Probabilistic neural networks," Neural Networks, 3, 110-118.

    ------------------------------------------------------------------------

    Subject: What is GRNN?

    GRNN or "General Regression Neural Network" is Donald Specht's term for Nadaraya-Watson kernel regression, also reinvented in the NN literature by Schi\oler and Hartmann. (Kernels are also called "Parzen windows".) You can think of it as a normalized RBF network in which there is a hidden unit centered at every training case. These RBF units are called "kernels" and are usually probability density functions such as the Gaussian. The hidden-to-output weights are just the target values, so the output is simply a weighted average of the target values of training cases close to the given input case. The only weights that need to be learned are the widths of the RBF units. These widths (often a single width is used) are called "smoothing parameters" or "bandwidths" and are usually chosen by cross-validation or by more esoteric methods that are not well-known in the neural net literature; gradient descent is not used.

    GRNN is a universal approximator for smooth functions, so it should be able to solve any smooth function-approximation problem given enough data. The main drawback of GRNN is that, like kernel methods in general, it suffers badly from the curse of dimensionality. GRNN cannot ignore irrelevant inputs without major modifications to the basic algorithm. So GRNN is not likely to be the top choice if you have more than 5 or 6 nonredundant inputs.

    References:

    Caudill, M. (1993), "GRNN and Bear It," AI Expert, Vol. 8, No. 5 (May), 28-33.

    Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ. Press.

    Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0

    Nadaraya, E.A. (1964) "On estimating regression", Theory Probab. Applic. 10, 186-90.

    Schi\oler, H. and Hartmann, U. (1992) "Mapping Neural Network Derived from the Parzen Window Estimator", Neural Networks, 5, 903-909.

    Specht, D.F. (1968) "A practical technique for estimating general regression surfaces," Lockheed report LMSC 6-79-68-6, Defense Technical Information Center AD-672505.

    Specht, D.F. (1991) "A Generalized Regression Neural Network", IEEE Transactions on Neural Networks, 2, Nov. 1991, 568-576.

    Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman & Hall.

    Watson, G.S. (1964) "Smooth regression analysis", Sankhy{\=a}, Series A, 26, 359-72.

    ------------------------------------------------------------------------

    Subject: What does unsupervised learning learn?

    Unsupervised learning allegedly involves no target values. In fact, for most varieties of unsupervised learning, the targets are the same as the inputs (Sarle 1994). In other words, unsupervised learning usually performs the same task as an auto-associative network, compressing the information from the inputs (Deco and Obradovic 1996). Unsupervised learning is very useful for data visualization (Ripley 1996), although the NN literature generally ignores this application.

    Unsupervised competitive learning is used in a wide variety of fields under a wide variety of names, the most common of which is "cluster analysis" (see the Classification Society of North America's web site for more information on cluster analysis, including software, at http://www.pitt.edu/~csna/.) The main form of competitive learning in the NN literature is vector quantization (VQ, also called a "Kohonen network", although Kohonen invented several other types of networks as well--see "How many kinds of Kohonen networks exist?" which provides more reference on VQ). Kosko (1992) and Hecht-Nielsen (1990) review neural approaches to VQ, while the textbook by Gersho and Gray (1992) covers the area from the perspective of signal processing. In statistics, VQ has been called "principal point analysis" (Flury, 1990, 1993; Tarpey et al., 1994) but is more frequently encountered in the guise of k-means clustering. In VQ, each of the competitive units corresponds to a cluster center (also called a codebook vector), and the error function is the sum of squared Euclidean distances between each training case and the nearest center. Often, each training case is normalized to a Euclidean length of one, which allows distances to be simplified to inner products. The more general error function based on distances is the same error function used in k-means clustering, one of the most common types of cluster analysis (Max 1960; MacQueen 1967; Anderberg 1973; Hartigan 1975; Hartigan and Wong 1979; Linde, Buzo, and Gray 1980; Lloyd 1982). The k-means model is an approximation to the normal mixture model (McLachlan and Basford 1988) assuming that the mixture components (clusters) all have spherical covariance matrices and equal sampling probabilities. Normal mixtures have found a variety of uses in neural networks (e.g., Bishop 1995). Balakrishnan, Cooper, Jacob, and Lewis (1994) found that k-means algorithms used as normal-mixture approximations recover cluster membership more accurately than Kohonen algorithms.

    Hebbian learning is the other most common variety of unsupervised learning (Hertz, Krogh, and Palmer 1991). Hebbian learning minimizes the same error function as an auto-associative network with a linear hidden layer, trained by least squares, and is therefore a form of dimensionality reduction. This error function is equivalent to the sum of squared distances between each training case and a linear subspace of the input space (with distances measured perpendicularly), and is minimized by the leading principal components (Pearson 1901; Hotelling 1933; Rao 1964; Joliffe 1986; Jackson 1991; Diamantaras and Kung 1996). There are variations of Hebbian learning that explicitly produce the principal components (Hertz, Krogh, and Palmer 1991; Karhunen 1994; Deco and Obradovic 1996; Diamantaras and Kung 1996).

    Perhaps the most novel form of unsupervised learning in the NN literature is Kohonen's self-organizing (feature) map (SOM, Kohonen 1995). SOMs combine competitive learning with dimensionality reduction by smoothing the clusters with respect to an a priori grid (see "How many kinds of Kohonen networks exist?") for more explanation). But Kohonen's original SOM algorithm does not optimize an "energy" function (Erwin et al., 1992; Kohonen 1995, pp. 126, 237). The SOM algorithm involves a trade-off between the accuracy of the quantization and the smoothness of the topological mapping, but there is no explicit combination of these two properties into an energy function. Hence Kohonen's SOM is not simply an information-compression method like most other unsupervised learning networks. Neither does Kohonen's SOM have a clear interpretation as a density estimation method. Convergence of Kohonen's SOM algorithm is allegedly demonstrated by Yin and Allinson (1995), but their "proof" assumes the neighborhood size becomes zero, in which case the algorithm reduces to VQ and no longer has topological ordering properties (Kohonen 1995, p. 111). The best explanation of what a Kohonen SOM learns seems to be provided by the connection between SOMs and principal curves and surfaces explained by Mulier and Cherkassky (1995) and Ritter, Martinetz, and Schulten (1992). For further explanation, see "How many kinds of Kohonen networks exist?"

    A variety of energy functions for SOMs have been proposed (e.g., Luttrell, 1994), some of which show a connection between SOMs and multidimensional scaling (Goodhill and Sejnowski 1997). There are also other approaches to SOMs that have clearer theoretical justification using mixture models with Bayesian priors or constraints (Utsugi, 1996, 1997; Bishop, Svens\'en, and Williams, 1997).

    For additional references on cluster analysis, see ftp://ftp.sas.com/pub/neural/clus_bib.txt.

    References:

    Anderberg, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press, Inc.

    Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with k-means clustering", Psychometrika, 59, 509-525.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Bishop, C.M., Svens\'en, M., and Williams, C.K.I (1997), "GTM: A principled alternative to the self-organizing map," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 354-360. Also see http://www.ncrg.aston.ac.uk/GTM/

    Deco, G. and Obradovic, D. (1996), An Information-Theoretic Approach to Neural Computing, NY: Springer-Verlag.

    Diamantaras, K.I., and Kung, S.Y. (1996) Principal Component Neural Networks: Theory and Applications, NY: Wiley.

    Erwin, E., Obermayer, K., and Schulten, K. (1992), "Self-organizing maps: Ordering, convergence properties and energy functions," Biological Cybernetics, 67, 47-55.

    Flury, B. (1990), "Principal points," Biometrika, 77, 33-41.

    Flury, B. (1993), "Estimation of principal points," Applied Statistics, 42, 139-151.

    Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers.

    Goodhill, G.J., and Sejnowski, T.J. (1997), "A unifying objective function for topographic mappings," Neural Computation, 9, 1291-1303.

    Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley.

    Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means clustering algorithm," Applied Statistics, 28-100-108.

    Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.

    Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley: Redwood City, California.

    Hotelling, H. (1933), "Analysis of a Complex of Statistical Variables into Principal Components," Journal of Educational Psychology, 24, 417-441, 498-520.

    Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering utilizing hybrid search strategies," Pattern Recognition, 22, 75-89.

    Jackson, J.E. (1991), A User's Guide to Principal Components, NY: Wiley.

    Jolliffe, I.T. (1986), Principal Component Analysis, Springer-Verlag.

    Karhunen, J. (1994), "Stability of Oja's PCA subspace rule," Neural Computation, 6, 739-747.

    Kohonen, T. (1995), Self-Organizing Maps, Berlin: Springer-Verlag.

    Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall.

    Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector quantizer design," IEEE Transactions on Communications, 28, 84-95.

    Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions on Information Theory, 28, 129-137.

    Luttrell, S.P. (1994), "A Bayesian analysis of self-organizing maps," Neural Computation, 6, 767-794.

    McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, NY: Marcel Dekker, Inc.

    MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.

    Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on Information Theory, 6, 7-12.

    Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an Iterative Kernel Smoothing Process," Neural Computation, 7, 1165-1177.

    Pearson, K. (1901) "On Lines and Planes of Closest Fit to Systems of Points in Space," Phil. Mag., 2(6), 559-572.

    Rao, C.R. (1964), "The Use and Interpretation of Principal Component Analysis in Applied Research," Sankya A, 26, 329-358.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Ritter, H., Martinetz, T., and Schulten, K. (1992), Neural Computation and Self-Organizing Maps: An Introduction, Reading, MA: Addison-Wesley.

    Tarpey, T., Luning, L>, and Flury, B. (1994), "Principal points and self-consistent points of elliptical distributions," Annals of Statistics, ?.

    Sarle, W.S. (1994), "Neural Networks and Statistical Models," in SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550, ftp://ftp.sas.com/pub/neural/neural1.ps.

    Utsugi, A. (1996), "Topology selection for self-organizing maps," Network: Computation in Neural Systems, 7, 727-740, available on-line at http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

    Utsugi, A. (1997), "Hyperparameter selection for self-organizing maps," Neural Computation, 9, 623-635, available on-line at http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html

    Yin, H. and Allinson, N.M. (1995), "On the Distribution and Convergence of Feature Space in Self-Organizing Maps," Neural Computation, 7, 1178-1187.

    Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector quantizer design by stochastic relaxation," IEEE Transactions on Signal Procesing, 40, 310-322.

    ------------------------------------------------------------------------
    Next part is part 3 (of 7). Previous part is part 1.

    Subject: How is generalization possible?

    During learning, the outputs of a supervised neural net come to approximate the target values given the inputs in the training set. This ability may be useful in itself, but more often the purpose of using a neural net is to generalize--i.e., to have the outputs of the net approximate target values given inputs that are not in the training set. Generalizaton is not always possible, despite the blithe assertions of some authors. For example, Caudill and Butler, 1990, p. 8, claim that "A neural network is able to generalize", but they provide no justification for this claim, and they completely neglect the complex issues involved in getting good generalization. Anyone who reads comp.ai.neural-nets is well aware from the numerous posts pleading for help that artificial neural networks do not automatically generalize.

    There are three conditions that are typically necessary (although not sufficient) for good generalization.

    The first necessary condition is that the inputs to the network contain sufficient information pertaining to the target, so that there exists a mathematical function relating correct outputs to inputs with the desired degree of accuracy. You can't expect a network to learn a nonexistent function--neural nets are not clairvoyant! For example, if you want to forecast the price of a stock, a historical record of the stock's prices is rarely sufficient input; you need detailed information on the financial state of the company as well as general economic conditions, and to avoid nasty surprises, you should also include inputs that can accurately predict wars in the Middle East and earthquakes in Japan. Finding good inputs for a net and collecting enough training data often take far more time and effort than training the network.

    The second necessary condition is that the function you are trying to learn (that relates inputs to correct outputs) be, in some sense, smooth. In other words, a small change in the inputs should, most of the time, produce a small change in the outputs. For continuous inputs and targets, smoothness of the function implies continuity and restrictions on the first derivative over most of the input space. Some neural nets can learn discontinuities as long as the function consists of a finite number of continuous pieces. Very nonsmooth functions such as those produced by pseudo-random number generators and encryption algorithms cannot be generalized by neural nets. Often a nonlinear transformation of the input space can increase the smoothness of the function and improve generalization.

    For classification, if you do not need to estimate posterior probabilities, then smoothness is not theoretically necessary. In particular, feedforward networks with one hidden layer trained by minimizing the error rate (a very tedious training method) are universally consistent classifiers if the number of hidden units grows at a suitable rate relative to the number of training cases (Devroye, Gy\"orfi, and Lugosi, 1996). However, you are likely to get better generalization with realistic sample sizes if the classification boundaries are smoother.

    For Boolean functions, the concept of smoothness is more elusive. It seems intuitively clear that a Boolean network with a small number of hidden units and small weights will compute a "smoother" input-output function than a network with many hidden units and large weights. If you know a good reference characterizing Boolean functions for which good generalization is possible, please inform the FAQ maintainer (saswss@unx.sas.com).

    The third necessary condition for good generalization is that the training cases be a sufficiently large and representative subset ("sample" in statistical terminology) of the set of all cases that you want to generalize to (the "population" in statistical terminology). The importance of this condition is related to the fact that there are, loosely speaking, two different types of generalization: interpolation and extrapolation. Interpolation applies to cases that are more or less surrounded by nearby training cases; everything else is extrapolation. In particular, cases that are outside the range of the training data require extrapolation. Cases inside large "holes" in the training data may also effectively require extrapolation. Interpolation can often be done reliably, but extrapolation is notoriously unreliable. Hence it is important to have sufficient training data to avoid the need for extrapolation. Methods for selecting good training sets are discussed in numerous statistical textbooks on sample surveys and experimental design.

    Thus, for an input-output function that is smooth, if you have a test case that is close to some training cases, the correct output for the test case will be close to the correct outputs for those training cases. If you have an adequate sample for your training set, every case in the population will be close to a sufficient number of training cases. Hence, under these conditions and with proper training, a neural net will be able to generalize reliably to the population.

    If you have more information about the function, e.g. that the outputs should be linearly related to the inputs, you can often take advantage of this information by placing constraints on the network or by fitting a more specific model, such as a linear model, to improve generalization. Extrapolation is much more reliable in linear models than in flexible nonlinear models, although still not nearly as safe as interpolation. You can also use such information to choose the training cases more efficiently. For example, with a linear model, you should choose training cases at the outer limits of the input space instead of evenly distributing them throughout the input space.

    References:

    Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT Press: Cambridge, Massachusetts.

    Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer.

    Goodman, N. (1954/1983), Fact, Fiction, and Forecast, 1st/4th ed., Cambridge, MA: Harvard University Press.

    Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R. (1986), Induction: Processes of Inference, Learning, and Discovery, Cambridge, MA: The MIT Press.

    Howson, C. and Urbach, P. (1989), Scientific Reasoning: The Bayesian Approach, La Salle, IL: Open Court.

    Hume, D. (1739/1978), A Treatise of Human Nature, Selby-Bigge, L.A., and Nidditch, P.H. (eds.), Oxford: Oxford University Press.

    Plotkin, H. (1993), Darwin Machines and the Nature of Knowledge, Cambridge, MA: Harvard University Press.

    Russell, B. (1948), Human Knowledge: Its Scope and Limits, London: Routledge.

    Stone, C.J. (1977), "Consistent nonparametric regression," Annals of Statistics, 5, 595-645.

    Stone, C.J. (1982), "Optimal global rates of convergence for nonparametric regression," Annals of Statistics, 10, 1040-1053.

    Vapnik, V.N. (1995), The Nature of Statistical Learning Theory, NY: Springer.

    Wolpert, D.H. (1995a), "The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework," in Wolpert (1995b), 117-214.

    Wolpert, D.H. (ed.) (1995b), The Mathematics of Generalization: The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, Santa Fe Institute Studies in the Sciences of Complexity, Volume XX, Reading, MA: Addison-Wesley.

    Wolpert, D.H. (1996a), "The lack of a priori distinctions between learning algorithms," Neural Computation, 8, 1341-1390.

    Wolpert, D.H. (1996b), "The existence of a priori distinctions between learning algorithms," Neural Computation, 8, 1391-1420.

    ------------------------------------------------------------------------

    Subject: How does noise affect generalization?

    Noise in the actual data is never a good thing, since it limits the accuracy of generalization that can be achieved no matter how extensive the training set is. On the other hand, injecting artificial noise (jitter) into the inputs during training is one of several ways to improve generalization for smooth functions when you have a small training set.

    Certain assumptions about noise are necessary for theoretical results. Usually, the noise distribution is assumed to have zero mean and finite variance. The noise in different cases is usually assumed to be independent or to follow some known stochastic model, such as an autoregressive process. The more you know about the noise distribution, the more effectively you can train the network (e.g., McCullagh and Nelder 1989).

    If you have noise in the target values, the mean squared generalization error can never be less than the variance of the noise, no matter how much training data you have. But you can estimate the mean of the target values, conditional on a given set of input values, to any desired degree of accuracy by obtaining a sufficiently large and representative training set, assuming that the function you are trying to learn is one that can indeed be learned by the type of net you are using, and assuming that the complexity of the network is regulated appropriately (White 1990).

    Noise in the target values is exacerbated by overfitting (Moody 1992).

    Noise in the inputs also limits the accuracy of generalization, but in a more complicated way than does noise in the targets. In a region of the input space where the function being learned is fairly flat, input noise will have little effect. In regions where that function is steep, input noise can degrade generalization severely.

    Furthermore, if the target function is Y=f(X), but you observe noisy inputs X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no matter how large a training set you use. The net will not learn f(X), but will instead learn a convolution of f(X) with the distribution of the noise D (see "What is jitter?)"

    For more details, see one of the statistically-oriented references on neural nets such as:

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, especially section 6.4.

    Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.

    Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", NIPS 4, 847-854.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535-550. Reprinted in White (1992b).

    White, H. (1992), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.

    ------------------------------------------------------------------------

    Subject: What is overfitting and how can I avoid it?

    The critical issue in developing a neural network is generalization: how well will the network make predictions for cases that are not in the training set? NNs, like other flexible nonlinear estimation methods such as kernel regression and smoothing splines, can suffer from either underfitting or overfitting. A network that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. A network that is too complex may fit the noise, not just the signal, leading to overfitting. Overfitting is especially dangerous because it can easily lead to predictions that are far beyond the range of the training data with many of the common types of NNs. Overfitting can also produce wild predictions in multilayer perceptrons even with noise-free data.

    For an elementary discussion of overfitting, see Smith (1993). For a more rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992) on the bias/variance trade-off (it's not really a dilemma). We are talking statistical bias here: the difference between the average value of an estimator and the correct value. Underfitting produces excessive bias in the outputs, whereas overfitting produces excessive variance. There are graphical examples of overfitting and underfitting in Sarle (1995).

    The best way to avoid overfitting is to use lots of training data. If you have at least 30 times as many training cases as there are weights in the network, you are unlikely to suffer from much overfitting, although you may get some slight overfitting no matter how large the training set is. For noise-free data, 5 times as many training cases as weights may be sufficient. But you can't arbitrarily reduce the number of weights for fear of underfitting.

    Given a fixed amount of training data, there are at least five effective approaches to avoiding underfitting and overfitting, and hence getting good generalization:

  • Model selection
  • Jittering
  • Early stopping
  • Weight decay
  • Bayesian learning
  • There approaches are discussed in more detail under subsequent questions.

    The complexity of a network is related to both the number of weights and the size of the weights. Model selection is concerned with the number of weights, and hence the number of hidden units and layers. The more weights there are, relative to the number of training cases, the more overfitting amplifies noise in the targets (Moody 1992). The other approaches listed above are concerned, directly or indirectly, with the size of the weights. Reducing the size of the weights reduces the "effective" number of weights--see Moody (1992) regarding weight decay and Weigend (1994) regarding early stopping. Bartlett (1997) obtained learning-theory results in which generalization error is related to the L_1 norm of the weights instead of the VC dimension.

    References:

    Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.

    Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

    Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", NIPS 4, 847-854.

    Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)

    Smith, M. (1993), Neural Networks for Statistical Modeling, NY: Van Nostrand Reinhold.

    Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.

    ------------------------------------------------------------------------

    Subject: What is jitter? (Training with noise)

    Jitter is artificial noise deliberately added to the inputs during training. Training with jitter is a form of smoothing related to kernel regression (see "What is GRNN?"). It is also closely related to regularization methods such as weight decay and ridge regression.

    Training with jitter works because the functions that we want NNs to learn are mostly smooth. NNs can learn functions with discontinuities, but the functions must be piecewise continuous in a finite number of regions if our network is restricted to a finite number of hidden units.

    In other words, if we have two cases with similar inputs, the desired outputs will usually be similar. That means we can take any training case and generate new training cases by adding small amounts of jitter to the inputs. As long as the amount of jitter is sufficiently small, we can assume that the desired output will not change enough to be of any consequence, so we can just use the same target value. The more training cases, the merrier, so this looks like a convenient way to improve training. But too much jitter will obviously produce garbage, while too little jitter will have little effect (Koistinen and Holmstr\"om 1992).

    Consider any point in the input space, not necessarily one of the original training cases. That point could possibly arise as a jittered input as a result of jittering any of several of the original neighboring training cases. The average target value at the given input point will be a weighted average of the target values of the original training cases. For an infinite number of jittered cases, the weights will be proportional to the probability densities of the jitter distribution, located at the original training cases and evaluated at the given input point. Thus the average target values given an infinite number of jittered cases will, by definition, be the Nadaraya-Watson kernel regression estimator using the jitter density as the kernel. Hence, training with jitter is an approximation to training with the kernel regression estimator as target. Choosing the amount (variance) of jitter is equivalent to choosing the bandwidth of the kernel regression estimator (Scott 1992).

    When studying nonlinear models such as feedforward NNs, it is often helpful first to consider what happens in linear models, and then to see what difference the nonlinearity makes. So let's consider training with jitter in a linear model. Notation:

       x_ij is the value of the jth input (j=1, ..., p) for the
            ith training case (i=1, ..., n).
       X={x_ij} is an n by p matrix.
       y_i is the target value for the ith training case.
       Y={y_i} is a column vector.
    Without jitter, the least-squares weights are B = inv(X'X)X'Y, where "inv" indicates a matrix inverse and "'" indicates transposition. Note that if we replicate each training case c times, or equivalently stack c copies of the X and Y matrices on top of each other, the least-squares weights are inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before.

    With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ..., c, where z_ijk is produced by some random number generator, usually with a normal distribution with mean 0 and standard deviation s, and the z_ijk's are all independent. In place of the n by p matrix X, this gives us a big matrix, say Q, with cn rows and p columns. To compute the least-squares weights, we need Q'Q. Let's consider the jth diagonal element of Q'Q, which is

                       2           2       2
       sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
       i,k                i,k
    which is approximately, for c large,
                 2     2
       c(sum x_ij  + ns ) 
          i
    which is c times the corresponding diagonal element of X'X plus ns^2. Now consider the u,vth off-diagonal element of Q'Q, which is
       sum (x_iu+z_iuk)(x_iv+z_ivk)
       i,k
    which is approximately, for c large,
       c(sum x_iu x_iv)
          i
    which is just c times the corresponding element of X'X. Thus, Q'Q equals c(X'X+ns^2I), where I is an identity matrix of appropriate size. Similar computations show that the crossproduct of Q with the target values is cX'Y. Hence the least-squares weights with jitter of variance s^2 are given by
           2                2                    2
       B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y
    In the statistics literature, B(ns^2) is called a ridge regression estimator with ridge value ns^2.

    If we were to add jitter to the target values Y, the cross-product X'Y would not be affected for large c for the same reason that the off-diagonal elements of X'X are not afected by jitter. Hence, adding jitter to the targets will not change the optimal weights; it will just slow down training (An 1996).

    The ordinary least squares training criterion is (Y-XB)'(Y-XB). Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B, where d is the decay rate. Weight decay can also be implemented by inventing artificial training cases. Augment the training data with p new training cases containing the matrix dI for the inputs and a zero vector for the targets. To put this in a formula, let's use A;B to indicate the matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D. Thus the augmented inputs are X;dI and the augmented targets are Y;0, where 0 indicates the zero vector of the appropriate size. The squared error for the augmented training data is:

       (Y;0-(X;dI)B)'(Y;0-(X;dI)B)
       = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
       = Y'Y - 2Y'XB + B'(X'X+d^2I)B
       = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
       = (Y-XB)'(Y-XB)+d^2B'B
    which is the weight-decay training criterion. Thus the weight-decay estimator is:
        inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y
    which is the same as the jitter estimator B(d^2), i.e. jitter with variance d^2/n. The equivalence between the weight-decay estimator and the jitter estimator does not hold for nonlinear models unless the jitter variance is small relative to the curvature of the nonlinear function (An 1996). However, the equivalence of the two estimators for linear models suggests that they will often produce similar results even for nonlinear models. Details for nonlinear models, including classification problems, are given in An (1996).

    B(0) is obviously the ordinary least-squares estimator. It can be shown that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in other words, adding jitter causes the weights to shrink. It can also be shown that under the usual statistical assumptions, there always exists some value of ns^2 > 0 such that B(ns^2) provides better expected generalization than B(0). Unfortunately, there is no way to calculate a value of ns^2 from the training data that is guaranteed to improve generalization. There are other types of shrinkage estimators called Stein estimators that do guarantee better generalization than B(0), but I'm not aware of a nonlinear generalization of Stein estimators applicable to neural networks.

    The statistics literature describes numerous methods for choosing the ridge value. The most obvious way is to estimate the generalization error by cross-validation, generalized cross-validation, or bootstrapping, and to choose the ridge value that yields the smallest such estimate. There are also quicker methods based on empirical Bayes estimation, one of which yields the following formula, useful as a first guess:

        2    p(Y-XB(0))'(Y-XB(0))
       s   = --------------------
        1      n(n-p)B(0)'B(0)
    You can iterate this a few times:
        2      p(Y-XB(0))'(Y-XB(0))
       s     = --------------------
        l+1              2     2
                n(n-p)B(s )'B(s )
                         l     l
    Note that the more training cases you have, the less noise you need.

    References:

    An, G. (1996), "The effects of adding noise during backpropagation training on a generalization performance," Neural Computation, 8, 643-674.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Holmstr\"om, L. and Koistinen, P. (1992) "Using additive noise in back-propagation training", IEEE Transaction on Neural Networks, 3, 24-38.

    Koistinen, P. and Holmstr\"om, L. (1992) "Kernel regression and backpropagation training with noise," NIPS4, 1033-1039.

    Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.

    Scott, D.W. (1992) Multivariate Density Estimation, Wiley.

    Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods, NY: Marcel-Dekker.

    ------------------------------------------------------------------------

    Subject: What is early stopping?

    NN practitioners often use nets with many times as many parameters as training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training a network with 16,219 parameters with only 50 training cases! The method used is called "early stopping" or "stopped training" and proceeds as follows:
  • Divide the available data into training and validation sets.
  • Use a large number of hidden units.
  • Use very small random initial values.
  • Use a slow learning rate.
  • Compute the validation error rate periodically during training.
  • Stop training when the validation error rate "starts to go up".
  • It is crucial to realize that the validation error is not a good estimate of the generalization error. One method for getting an unbiased estimate of the generalization error is to run the net on a third set of data, the test set, that is not used at all during the training process. For other methods, see "How can generalization error be estimated?"

    Early stopping has several advantages:

  • It is fast.
  • It can be applied successfully to networks in which the number of weights far exceeds the sample size.
  • It requires only one major decision by the user: what proportion of validation cases to use.
  • But there are several unresolved practical issues in early stopping:
  • How many cases do you assign to the training and validation sets? Rules of thumb abound, but appear to be no more than folklore. The only systematic results known to the FAQ maintainer are in Sarle (1995), which deals only with the case of a single input. Amari et al. (1995) attempts a theoretical approach but contains serious errors that completely invalidate the results, especially the incorrect assumption that the direction of approach to the optimum is distributed isotropically.
  • Do you split the data into training and validation sets randomly or by some systematic algorithm?
  • How do you tell when the validation error rate "starts to go up"? It may go up and down numerous times during training. The safest approach is to train to convergence, then go back and see which iteration had the lowest validation error. For more elaborate algorithms, see section 3.3 of Prechelt (1994).
  • Statisticians tend to be skeptical of stopped training because it appears to be statistically inefficient due to the use of the split-sample technique; i.e., neither training nor validation makes use of the entire sample, and because the usual statistical theory does not apply. However, there has been recent progress addressing both of the above concerns (Wang 1994).

    Early stopping is closely related to ridge regression. If the learning rate is sufficiently small, the sequence of weight vectors on each iteration will approximate the path of continuous steepest descent down the error function. Early stopping chooses a point along this path that optimizes an estimate of the generalization error computed from the validation set. Ridge regression also defines a path of weight vectors by varying the ridge value. The ridge value is often chosen by optimizing an estimate of the generalization error computed by cross-validation, generalized cross-validation, or bootstrapping (see "What are cross-validation and bootstrapping?") There always exists a positive ridge value that will improve the expected generalization error in a linear model. A similar result has been obtained for early stopping in linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge path lies close to, but does not coincide with, the path of continuous steepest descent; in nonlinear models, the two paths can diverge widely. The relationship is explored in more detail by Sjo\"berg and Ljung (1992).

    References:

    S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic Statistical Theory of Overtraining and Cross-Validation. METR 95-06, 1995, Department of Mathematical Engineering and Information Physics, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.

    Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model selection by nonconvergent methods," Neural Networks, 6, 771-783.

    Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural Nets, Reading, MA: Addison-Wesley.

    Prechelt, L. (1994), "PROBEN1--A set of neural network benchmark problems and benchmarking rules," Technical Report 21/94, Universitat Karlsruhe, 76128 Karlsruhe, Germany, ftp://ftp.ira.uka.de/pub/papers/techreports/1994/1994-21.ps.gz.

    Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)

    Sjo\"berg, J. and Ljung, L. (1992), "Overtraining, Regularization, and Searching for Minimum in Neural Networks," Technical Report LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping University, S-581 83 Linkoping, Sweden, http://www.control.isy.liu.se .

    Wang, C. (1994), A Theory of Generalisation in Learning Machines with Neural Network Application, Ph.D. thesis, University of Pennsylvania.

    Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and Effective Machine Complexity in Learning," NIPS6, 303-310.

    Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.

    ------------------------------------------------------------------------

    Subject: What is weight decay?

    Weight decay adds a penalty term to the error function. The usual penalty is the sum of squared weights times a decay constant. In a linear model, this form of weight decay is equivalent to ridge regression. See "What is jitter?" for more explanation of ridge regression.

    Weight decay is a subset of regularization methods. The penalty term in weight decay, by definition, penalizes large weights. Other regularization methods may involve not only the weights but various derivatives of the output function (Bishop 1995).

    The weight decay penalty term causes the weights to converge to smaller absolute values than they otherwise would. Large weights can hurt generalization in two different ways. Excessively large weights leading to hidden units can cause the output function to be too rough, possibly with near discontinuities. Excessively large weights leading to output units can cause wild outputs far beyond the range of the data if the output activation function is not bounded to the same range as the data. To put it another way, large weights can cause excessive variance of the output (Geman, Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1 norm) of the weights is more important than the number of weights in determining generalization.

    Other penalty terms besides the sum of squared weights are sometimes used. Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses:

              (w_i)^2
       sum -------------
        i  (w_i)^2 + c^2
    where w_i is the ith weight and c is a user-specified constant. Whereas decay using the sum of squared weights tends to shrink the large coefficients more than the small ones, weight elimination tends to shrink the small coefficients more, and is therefore more useful for suggesting subset models (pruning).

    The generalization ability of the network can depend crucially on the decay constant, especially with small training sets. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. Weigend, Rumelhart, and Huberman (1991) iteratively update the decay constant during training.

    There are other important considerations for getting good results from weight decay. You must either standardize the inputs and targets, or adjust the penalty term for the standard deviations of all the inputs and targets. It is usually a good idea to omit the biases from the penalty term.

    A fundamental problem with weight decay is that different types of weights in the network will usually require different decay constants for good generalization. At the very least, you need three different decay constants for input-to-hidden, hidden-to-hidden, and hidden-to-output weights. Adjusting all these decay constants to produce the best estimated generalization error often requires vast amounts of computation.

    Fortunately, there is a superior alternative to weight decay: hierarchical Bayesian learning. Bayesian learning makes it possible to estimate efficiently numerous decay constants.

    References:

    Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann.

    ------------------------------------------------------------------------

    Subject: What is Bayesian Learning?

    By Radford Neal.

    Conventional training methods for multilayer perceptrons ("backprop" nets) can be interpreted in statistical terms as variations on maximum likelihood estimation. The idea is to find a single set of weights for the network that maximize the fit to the training data, perhaps modified by some sort of weight penalty to prevent overfitting.

    The Bayesian school of statistics is based on a different view of what it means to learn from data, in which probability is used to represent uncertainty about the relationship being learned (a use that is shunned in conventional--i.e., frequentist--statistics). Before we have seen any data, our prior opinions about what the true relationship might be can be expresssed in a probability distribution over the network weights that define this relationship. After we look at the data (or after our program looks at the data), our revised opinions are captured by a posterior distribution over network weights. Network weights that seemed plausible before, but which don't match the data very well, will now be seen as being much less likely, while the probability for values of the weights that do fit the data well will have increased.

    Typically, the purpose of training is to make predictions for future cases in which only the inputs to the network are known. The result of conventional network training is a single set of weights that can be used to make such predictions. In contrast, the result of Bayesian training is a posterior distribution over network weights. If the inputs of the network are set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known as the predictive distribution for this new case. If a single-valued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is.

    Why bother with all this? The hope is that Bayesian methods will provide solutions to such fundamental problems as:

  • How to judge the uncertainty of predictions. This can be solved by looking at the predictive distribution, as described above.
  • How to choose an appropriate network architecture (eg, the number hidden layers, the number of hidden units in each layer).
  • How to adapt to the characteristics of the data (eg, the smoothness of the function, the degree to which different inputs are relevant).
  • Good solutions to these problems, especially the last two, depend on using the right prior distribution, one that properly represents the uncertainty that you probably have about which inputs are relevant, how smooth the function is, how much noise there is in the observations, etc. Such carefully vague prior distributions are usually defined in a hierarchical fashion, using hyperparameters, some of which are analogous to the weight decay constants of more conventional training procedures. The use of hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a, 1996), who in particular use an "Automatic Relevance Determination" scheme that aims to allow many possibly-relevant inputs to be included without damaging effects.

    Selection of an appropriate network architecture is another place where prior knowledge plays a role. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. This approach is emphasized by Neal (1996), who argues that there is no statistical need to limit the complexity of the network architecture when using well-designed Bayesian methods. It is also possible to choose between architectures in a Bayesian fashion, using the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b).

    Implementing all this is one of the biggest problems with Bayesian methods. Dealing with a distribution over weights (and perhaps hyperparameters) is not as simple as finding a single "best" value for the weights. Exact analytical methods for models as complex as neural networks are out of the question. Two approaches have been tried:

  • Find the weights/hyperparameters that are most probable, using methods similar to conventional training (with regularization), and then approximate the distribution over weights using information available at this maximum.
  • Use a Monte Carlo method to sample from the distribution over weights. The most efficient implementations of this use dynamical Monte Carlo methods whose operation resembles that of backprop with momentum.
  • The first method comes in two flavours. Buntine and Weigend (1991) describe a procedure in which the hyperparameters are first integrated out analytically, and numerical methods are then used to find the most probable weights. MacKay (1992a, 1992b) instead finds the values for the hyperparameters that are most likely, integrating over the weights (using an approximation around the most probable weights, conditional on the hyperparameter values). There has been some controversy regarding the merits of these two procedures, with Wolpert (1993) claiming that analytically integrating over the hyperparameters is preferable because it is "exact". This criticism has been rebutted by Mackay (1993). It would be inappropriate to get into the details of this controversy here, but it is important to realize that the procedures based on analytical integration over the hyperparameters do not provide exact solutions to any of the problems of practical interest. The discussion of an analogous situation in a different statistical context by O'Hagan (1985) may be illuminating.

    Monte Carlo methods for Bayesian neural networks have been developed by Neal (1993a, 1996). In this approach, the posterior distribution is represented by a sample of perhaps a few dozen sets of network weights. The sample is obtained by simulating a Markov chain whose equilibrium distribution is the posterior distribution for weights and hyperparameters. This technique is known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review. The method is exact in the limit as the size of the sample and the length of time for which the Markov chain is run increase, but convergence can sometimes be slow in practice, as for any network training method.

    Work on Bayesian neural network learning has so far concentrated on multilayer perceptron networks, but Bayesian methods can in principal be applied to other network models, as long as they can be interpreted in statistical terms. For some models (eg, RBF networks), this should be a fairly simple matter; for others (eg, Boltzmann Machines), substantial computational problems would need to be solved.

    Software implementing Bayesian neural network models (intended for research use) is available from the home pages of David MacKay and Radford Neal.

    There are many books that discuss the general concepts of Bayesian inference, though they mostly deal with models that are simpler than neural networks. Here are some recent ones:

    Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York: John Wiley.

    Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5.

    O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced Theory of Statistics), ISBN 0-340-52922-9.

    Robert, C. P. (1995) The Bayesian Choice, New York: Springer-Verlag.

    The following books and papers have tutorial material on Bayesian learning as applied to neural network models:
    Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Lee, H.K.H (1999), Model Selection and Model Averaging for Neural Networks, Doctoral dissertation, Carnegie Mellon University, Pittsburgh, USA, http://lib.stat.cmu.edu/~herbie/thesis.html

    MacKay, D. J. C. (1995) "Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz.

    Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural Network Models," Neural Computation, 10, 571-592, (also Institute of Statistics and Decision Sciences Working Paper 95-31), ftp://ftp.isds.duke.edu/pub/WorkingPapers/95-31.ps

    Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York: Springer-Verlag, ISBN 0-387-94724-8.

    Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Thodberg, H. H. (1996) "A review of Bayesian neural networks with an application to near infrared spectroscopy", IEEE Transactions on Neural Networks, 7, 56-72.

    Some other references:
    Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds., (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland).

    Buntine, W. L. and Weigend, A. S. (1991) "Bayesian back-propagation", Complex Systems, 5, 603-643.

    MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation, 4, 415-447.

    MacKay, D. J. C. (1992b) "A practical Bayesian framework for backpropagation networks," Neural Computation, 4, 448-472.

    MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz.

    Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 475-482.

    Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte Carlo Methods, available at ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z.

    O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 697-710.

    Sarle, W. S. (1995) "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)

    Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 539-546.

    Finally, David MacKay maintains a FAQ about Bayesian methods for neural networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html .

    Comments on Bayesian learning

    By Warren Sarle.

    Bayesian purists may argue over the proper way to do a Bayesian analysis, but even the crudest Bayesian computation (maximizing over both parameters and hyperparameters) is shown by Sarle (1995) to generalize better than early stopping when learning nonlinear functions. This approach requires the use of slightly informative hyperpriors and at least twice as many training cases as weights in the network. A full Bayesian analysis by MCMC can be expected to work even better under even broader conditions. Bayesian learning works well by frequentist standards--what MacKay calls the "evidence framework" is used by frequentist statisticians under the name "empirical Bayes." Although considerable research remains to be done, Bayesian learning seems to be the most promising approach to training neural networks.

    Bayesian learning should not be confused with the "Bayes classifier." In the latter, the distribution of the inputs given the target class is assumed to be known exactly, and the prior probabilities of the classes are assumed known, so that the posterior probabilities can be computed by a (theoretically) simple application of Bayes' theorem. The Bayes classifier involves no learning--you must already know everything that needs to be known! The Bayes classifier is a gold standard that can almost never be used in real life but is useful in theoretical work and in simulation studies that compare classification methods. The term "Bayes rule" is also used to mean any classification rule that gives results identical to those of a Bayes classifier.

    Bayesian learning also should not be confused with the "naive" or "idiot's" Bayes classifier (Warner et al. 1961; Ripley, 1996), which assumes that the inputs are conditionally independent given the target class. The naive Bayes classifier is usually applied with categorical inputs, and the distribution of each input is estimated by the proportions in the training set; hence the naive Bayes classifier is a frequentist method.

    The term "Bayesian network" often refers not to a neural network but to a belief network (also called a causal net, influence diagram, constraint network, qualitative Markov network, or gallery). Belief networks are more closely related to expert systems than to neural networks, and do not necessarily involve learning (Pearl, 1988; Ripley, 1996). Here are some URLs on Bayesian belief networks:

  • http://bayes.stat.washington.edu/almond/belief.html
  • http://www.cs.orst.edu/~dambrosi/bayesian/frame.html
  • http://www2.sis.pitt.edu/~genie
  • http://www.research.microsoft.com/dtg/msbn
  • References for comments:
    Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.

    Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Warner, H.R., Toronto, A.F., Veasy, L.R., and Stephenson, R. (1961), "A mathematical model for medical diagnosis--application to congenital heart disease," J. of the American Medical Association, 177, 177-184.

    ------------------------------------------------------------------------

    Subject: How many hidden layers should I use?

    You may not need any hidden layers at all. Linear and generalized linear models are useful in a wide variety of applications (McCullagh and Nelder 1989). And even if the function you want to learn is mildly nonlinear, you may get better generalization with a simple linear model than with a complicated nonlinear model if there is too little data or too much noise to estimate the nonlinearities accurately.

    In MLPs with step/threshold/Heaviside activation functions, you need two hidden layers for full generality (Sontag 1992). For further discussion, see Bishop (1995, 121-126).

    In MLPs with any of a wide variety of continuous nonlinear hidden-layer activation functions, one hidden layer with an arbitrarily large number of units suffices for the "universal approximation" property (e.g., Hornik, Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop 1995, 130, and Ripley, 1996, 173-180). But there is no theory yet to tell you how many hidden units are needed to approximate any given function.

    If you have only one input, there seems to be no advantage to using more than one hidden layer. But things get much more complicated when there are two or more inputs. To illustrate, examples with two inputs and one output will be used so that the results can be shown graphically. In each example there are 441 training cases on a regular 21-by-21 grid. The test sets have 1681 cases on a regular 41-by-41 grid over the same domain as the training set. If you are reading the HTML version of this document via a web browser, you can see surface plots based on the test set by clicking on the file names mentioned in the folowing text. Each plot is a gif file, approximately 9K in size.

    Consider a target function of two inputs, consisting of a Gaussian hill in the middle of a plane (hill.gif). An MLP with an identity output activation function can easily fit the hill by surrounding it with a few sigmoid (logistic, tanh, arctan, etc.) hidden units, but there will be spurious ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif).

    Now suppose you use a logistic output activation function. As the input to a logistic function goes to negative infinity, the output approaches zero. The plane in the Gaussian target function also has a value of zero. If the weights and bias for the output layer yield large negative values outside the base of the hill, the logistic function will flatten out any spurious ridges and valleys. So fitting the flat part of the target function is easy (h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends to lower the top of the hill.

    If instead of a rounded hill, the target function was a mesa with a large, flat top with a value of one, the logistic output activation function would be able to smooth out the top of the mesa just like it smooths out the plane below. Target functions like this, with large flat areas with values of either zero or one, are just what you have in many noise-free classificaton problems. In such cases, a single hidden layer is likely to work well.

    When using a logistic output activation function, it is common practice to scale the target values to a range of .1 to .9. Such scaling is bad in a noise-free classificaton problem, because it prevents the logistic function from smoothing out the flat areas (h_mlpt1-9_3.gif).

    For the Gaussian target function, [.1,.9] scaling would make it easier to fit the top of the hill, but would reintroduce undulations in the plane. It would be better for the Gaussian target function to scale the target values to a range of 0 to .9. But for a more realistic and complicated target function, how would you know the best way to scale the target values?

    By introducing a second hidden layer with one sigmoid activation function and returning to an identity output activation function, you can let the net figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight for the output layer scale the output rather than the target values, and you can use whatever range of target values is convenient.

    For more complicated target functions, especially those with several hills or valleys, it is useful to have several units in the second hidden layer. Each unit in the second hidden layer enables the net to fit a separate hill or valley. So an MLP with two hidden layers can often yield an accurate approximation with fewer weights than an MLP with one hidden layer. (Chester 1990).

    To illustrate the use of multiple units in the second hidden layer, the next example resembles a landscape with a Gaussian hill and a Gaussian valley, both elliptical (hillanvale.gif). The table below gives the RMSE (root mean squared error) for the test set with various architectures. If you are reading the HTML version of this document via a web browser, click on any number in the table to see a surface plot of the corresponding network output.

    The MLP networks in the table have one or two hidden layers with a tanh activation function. The output activation function is the identity. Using a squashing function on the output layer is of no benefit for this function, since the only flat area in the function has a target value near the middle of the target range.

              Hill and Valley Data: RMSE for the Test Set
                  (Number of weights in parentheses)
                             MLP Networks
    
    HUs in                  HUs in Second Layer
    First  ----------------------------------------------------------
    Layer    0           1           2           3           4
     1     0.204(  5)  0.204(  7)  0.189( 10)  0.187( 13)  0.185( 16)
     2     0.183(  9)  0.163( 11)  0.147( 15)  0.094( 19)  0.096( 23)
     3     0.159( 13)  0.095( 15)  0.054( 20)  0.033( 25)  0.045( 30)
     4     0.137( 17)  0.093( 19)  0.009( 25)  0.021( 31)  0.016( 37)
     5     0.121( 21)  0.092( 23)              0.010( 37)  0.011( 44)
     6     0.100( 25)  0.092( 27)              0.007( 43)  0.005( 51)
     7     0.086( 29)  0.077( 31)
     8     0.079( 33)  0.062( 35)
     9     0.072( 37)  0.055( 39)
    10     0.059( 41)  0.047( 43)
    12     0.047( 49)  0.042( 51)
    15     0.039( 61)  0.032( 63)
    20     0.025( 81)  0.018( 83)  
    25     0.021(101)  0.016(103)  
    30     0.018(121)  0.015(123)  
    40     0.012(161)  0.015(163)  
    50     0.008(201)  0.014(203)
    For an MLP with only one hidden layer (column 0), Gaussian hills and valleys require a large number of hidden units to approximate well. When there is one unit in the second hidden layer, the network can represent one hill or valley easily, which is what happens with three to six units in the first hidden layer. But having only one unit in the second hidden layer is of little benefit for learning two hills or valleys. Using two units in the second hidden layer enables the network to approximate two hills or valleys easily; in this example, only four units are required in the first hidden layer to get an excellent fit. Each additional unit in the second hidden layer enables the network to learn another hill or valley with a relatively small number of units in the first hidden layer, as explained by Chester (1990). In this example, having three or four units in the second hidden layer helps little, and actually produces a worse approximation when there are four units in the first hidden layer due to problems with local minima.

    Unfortunately, using two hidden layers exacerbates the problem of local minima, and it is important to use lots of random initializations or other methods for global optimization. Local minima with two hidden layers can have extreme spikes or blades even when the number of weights is much smaller than the number of training cases. One of the few advantages of standard backprop is that it is so slow that spikes and blades will not become very sharp for practical training times.

    More than two hidden layers can be useful in certain architectures such as cascade correlation (Fahlman and Lebiere 1990) and in special applications, such as the two-spirals problem (Lang and Witbrock 1988) and ZIP code recognition (Le Cun et al. 1989).

    RBF networks are most often used with a single hidden layer. But an extra, linear hidden layer before the radial hidden layer enables the network to ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear hidden layer allows the RBFs to take elliptical, rather than radial (circular), shapes in the space of the inputs. Hence the linear layer gives you an elliptical basis function (EBF) network. In the hill and valley example, an ORBFUN network requires nine hidden units (37 weights) to get the test RMSE below .01, but by adding a linear hidden layer, you can get an essentially perfect fit with three linear units followed by two radial units (20 weights).

    References:

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    Chester, D.L. (1990), "Why Two Hidden Layers are Better than One," IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.

    Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning Architecture," NIPS2, 524-532, ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z.

    Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward networks are universal approximators," Neural Networks, 2, 359-366.

    Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 1069-1072.

    Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds., Procedings of the 1988 Connectionist Models Summer School, San Mateo, CA: Morgan Kaufmann.

    Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to handwritten ZIP code recognition", Neural Computation, 1, 541-551.

    McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Sontag, E.D. (1992), "Feedback stabilization using two-hidden-layer nets", IEEE Transactions on Neural Networks, 3, 981-990.

    ------------------------------------------------------------------------

    Subject: How many hidden units should I use?

    The best number of hidden units depends in a complex way on:
  • the numbers of input and output units
  • the number of training cases
  • the amount of noise in the targets
  • the complexity of the function or classification to be learned
  • the architecture
  • the type of hidden unit activation function
  • the training algorithm
  • regularization
  • In most situations, there is no way to determine the best number of hidden units without training several networks and estimating the generalization error of each. If you have too few hidden units, you will get high training error and high generalization error due to underfitting and high statistical bias. If you have too many hidden units, you may get low training error but still have high generalization error due to overfitting and high variance. Geman, Bienenstock, and Doursat (1992) discuss how the number of hidden units affects the bias/variance trade-off.

    Some books and articles offer "rules of thumb" for choosing an architecture; for example:

  • "A rule of thumb is for the size of this [hidden] layer to be somewhere between the input layer size ... and the output layer size ..." (Blum, 1992, p. 60).
  • "To calculate the number of hidden nodes we use a general rule of: (Number of inputs + outputs) * (2/3)" (from the FAQ for a commercial neural network software company). 
  • "you will never require more than twice the number of hidden units as you have inputs" in an MLP with one hidden layer (Swingler, 1996, p. 53). See the section in Part 4 of the FAQ on The Worst books for the source of this myth.)
  • "How large should the hidden layer be? One rule of thumb is that it should never be more than twice as large as the input layer." (Berry and Linoff, 1997, p. 323).
  • "Typically, we specify as many hidden nodes as dimensions [principal components] needed to capture 70-90% of the variance of the input data set." (Boger and Guterman, 1997)
  • These rules of thumb are nonsense because they ignore the number of training cases, the amount of noise in the targets, and the complexity of the function. Even if you restrict consideration to minimizing training error on data with lots of training cases and no noise, it is easy to construct counterexamples that disprove these rules of thumb. For example:
  • There are 100 Boolean inputs and 100 Boolean targets. Each target is a conjunction of some subset of inputs. No hidden units are needed.
  • There are two continuous inputs X and Y which take values uniformly distributed on a square [0,8] by [0,8]. Think of the input space as a chessboard, and number the squares 1 to 64. The categorical target variable C is the square number, so there are 64 output units. For example, you could generate the data as follows (this is the SAS programming language, but it should be easy to translate into any other language):
  • data chess;
       step = 1/4;
       do x = step/2 to 8-step/2 by step;
          do y = step/2 to 8-step/2 by step;
             c = 8*floor(x) + floor(y) + 1;
             output;
          end;
       end;
    run;
    No hidden units are needed.
  • The classic two-spirals problem has two continuous inputs and a Boolean classification target. The data can be generated as follows:
  • data spirals;
       pi = arcos(-1);
       do i = 0 to 96;
          angle = i*pi/16.0;
          radius = 6.5*(104-i)/104;
          x = radius*cos(angle);
          y = radius*sin(angle);
          c = 1;
          output;
          x = -x;
          y = -y;
          c = 0;
          output;
       end;
    run;
    With one hidden layer, about 50 tanh hidden units are needed. Many NN programs may actually need closer to 100 hidden units to get zero training error.
  • There is one continuous input X that takes values on [0,100]. There is one continuous target Y = sin(X). Getting a good approximation to Y requires about 20 to 25 tanh hidden units. Of course, 1 sine hidden unit would do the job.
  • Some rules of thumb relate the total number of trainable weights in the network to the number of training cases. A typical recommendation is that the number of weights should be no more than 1/30 of the number of training cases. Such rules are only concerned with overfitting and are at best crude approximations. Also, these rules do not apply when regularization is used. It is true that without regularization, if the number of training cases is much larger (but no one knows exactly how much larger) than the number of weights, you are unlikely to get overfitting, but you may suffer from underfitting. For a noise-free quantitative target variable, twice as many training cases as weights may be more than enough to avoid overfitting. For a very noisy categorical target variable, 30 times as many training cases as weights may not be enough to avoid overfitting.

    An intelligent choice of the number of hidden units depends on whether you are using early stopping or some other form of regularization. If not, you must simply try many networks with different numbers of hidden units, estimate the generalization error for each one, and choose the network with the minimum estimated generalization error.

    Using conventional optimization algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"), there is little point in trying a network with more weights than training cases, since such a large network is likely to overfit. But Lawrence, Giles, and Tsoi (1996) have shown that standard online backprop can have considerable difficulty reducing training error to a level near the globally optimal value, hence using "oversize" networks can reduce both training error and generalization error.

    If you are using early stopping, it is essential to use lots of hidden units to avoid bad local optima (Sarle 1995). There seems to be no upper limit on the number of hidden units, other than that imposed by computer time and memory requirements. Weigend (1994) makes this assertion, but provides only one example as evidence. Tetko, Livingstone, and Luik (1995) provide simulation studies that are more convincing. Similar results were obtained in conjunction with the simulations in Sarle (1995), but those results are not reported in the paper for lack of space. On the other hand, there seems to be no advantage to using more hidden units than you have training cases, since bad local minima do not occur with so many hidden units.

    If you are using weight decay or Bayesian estimation, you can also use lots of hidden units (Neal 1995). However, it is not strictly necessary to do so, because other methods are available to avoid local minima, such as multiple random starts and simulated annealing (such methods are not safe to use with early stopping). You can use one network with lots of hidden units, or you can try different networks with different numbers of hidden units, and choose on the basis of estimated generalization error. With weight decay or MAP Bayesian estimation, it is prudent to keep the number of weights less than half the number of training cases.

    Bear in mind that with two or more inputs, an MLP with one hidden layer containing only a few units can fit only a limited variety of target functions. Even simple, smooth surfaces such as a Gaussian bump in two dimensions may require 20 to 50 hidden units for a close approximation. Networks with a smaller number of hidden units often produce spurious ridges and valleys in the output surface (see Chester 1990 and "How do MLPs compare with RBFs?") Training a network with 20 hidden units will typically require anywhere from 150 to 2500 training cases if you do not use early stopping or regularization. Hence, if you have a smaller training set than that, it is usually advisable to use early stopping or regularization rather than to restrict the net to a small number of hidden units.

    Ordinary RBF networks containing only a few hidden units also produce peculiar, bumpy output functions. Normalized RBF networks are better at approximating simple smooth surfaces with a small number of hidden units (see How do MLPs compare with RBFs?).

    There are various theoretical results on how fast approximation error decreases as the number of hidden units increases, but the conclusions are quite sensitive to the assumptions regarding the function you are trying to approximate. See p. 178 in Ripley (1996) for a summary. According to a well-known result by Barron (1993), in a network with I inputs and H units in a single hidden layer, the root integrated squared error (RISE) will decrease at least as fast as H^(-1/2) under some quite complicated smoothness assumptions. Ripley cites another paper by DeVore et al. (1989) that says if the function has only R bounded derivatives, RISE may decrease as slowly as H^(-R/I). For some examples with from 1 to 4 hidden layers see How many hidden layers should I use?" and "How do MLPs compare with RBFs?"

    For learning a finite training set exactly, bounds for the number of hidden units required are provided by Elisseeff and Paugam-Moisy (1997).

    References:

    Barron, A.R. (1993), "Universal approximation bounds for superpositions of a sigmoid function," IEEE Transactions on Information Theory, 39, 930-945.

    Berry, M.J.A., and Linoff, G. (1997), Data Mining Techniques, NY: John Wiley & Sons.

    Blum, A. (1992), Neural Networks in C++, NY: Wiley.

    Boger, Z., and Guterman, H. (1997), "Knowledge extraction from artificial neural network models," IEEE Systems, Man, and Cybernetics Conference, Orlando, FL.

    Chester, D.L. (1990), "Why Two Hidden Layers are Better than One," IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268.

    DeVore, R.A., Howard, R., and Micchelli, C.A. (1989), "Optimal nonlinear approximation," Manuscripta Mathematica, 63, 469-478.

    Elisseeff, A., and Paugam-Moisy, H. (1997), "Size of multilayer networks for exact learning: analytic approach," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp.162-168.

    Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

    Lawrence, S., Giles, C.L., and Tsoi, A.C. (1996), "What size neural network gives optimal generalization? Convergence properties of backpropagation," Technical Report UMIACS-TR-96-22 and CS-TR-3617, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, http://www.neci.nj.nec.com/homepages/lawrence/papers/minima-tr96/minima-tr96.html

    Neal, R.M. (1995), Bayesian Learning for Neural Networks, Ph.D. thesis, University of Toronto, ftp://ftp.cs.toronto.edu/pub/radford/thesis.ps.Z.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press,

    Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)

    Swingler, K. (1996), Applying Neural Networks: A Practical Guide, London: Academic Press.

    Tetko, I.V., Livingstone, D.J., and Luik, A.I. (1995), "Neural Network Studies. 1. Comparison of Overfitting and Overtraining," J. Chem. Info. Comp. Sci., 35, 826-833.

    Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342.

    ------------------------------------------------------------------------

    Subject: How can generalization error be estimated?

    There are many methods for estimating generalization error.
    Single-sample statistics: AIC, SBC, FPE, Mallows' C_p, etc.
    In linear models, statistical theory provides several simple estimators of the generalization error under various sampling assumptions (Darlington 1968; Efron and Tibshirani 1993; Miller 1990). These estimators adjust the training error for the number of weights being estimated, and in some cases for the noise variance if that is known.


    These statistics can also be used as crude estimates of the generalization error in nonlinear models when you have a "large" training set. Correcting these statistics for nonlinearity requires substantially more computation (Moody 1992), and the theory does not always hold for neural networks due to violations of the regularity conditions.

    Among the simple generalization estimators that do not require the noise variance to be known, Schwarz's Bayesian Criterion (known as both SBC and BIC; Schwarz 1978; Judge et al. 1980; Raftery 1995) often works well for NNs (Sarle 1995). AIC and FPE tend to overfit with NNs. Rissanen's Minimum Description Length principle (MDL; Rissanen 1978, 1987) is closely related to SBC. Several articles on SBC/BIC are available at the University of Washigton's web site at http://www.stat.washington.edu/tech.reports

    For classification problems, the formulas are not as simple as for regression with normal noise. See Efron (1986) regarding logistic regression.

    Split-sample or hold-out validation.
    The most commonly used method for estimating generalization error in neural networks is to reserve part of the data as a "test" set, which must not be used in any way during training. The test set must be a representative sample of the cases that you want to generalize to. After training, run the network on the test set, and the error on the test set provides an unbiased estimate of the generalization error, provided that the test set was chosen randomly. The disadvantage of split-sample validation is that it reduces the amount of data available for both training and validation. See Weiss and Kulikowski (1991).
    Cross-validation (e.g., leave one out).
    Cross-validation is an improvement on split-sample validation that allows you to use all of the data for training. The disadvantage of cross-validation is that you have to retrain the net many times. See "What are cross-validation and bootstrapping?".
    Bootstrapping.
    Bootstrapping is an improvement on cross-validation that often provides better estimates of generalization error at the cost of even more computing time. See "What are cross-validation and bootstrapping?".

     

     

    If you use any of the above methods to choose which of several different networks to use for prediction purposes, the estimate of the generalization error of the best network will be optimistic. For example, if you train several networks using one data set, and use a second (validation set) data set to decide which network is best, you must use a third (test set) data set to obtain an unbiased estimate of the generalization error of the chosen network. Hjorth (1994) explains how this principle extends to cross-validation and bootstrapping.

    References:

    Darlington, R.B. (1968), "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69, 161-182.

    Efron, B. (1986), "How biased is the apparent error rate of a prediction rule?" J. of the American Statistical Association, 81, 461-470.

    Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall.

    Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap, London: Chapman & Hall.

    Miller, A.J. (1990), Subset Selection in Regression, London: Chapman & Hall.

    Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", NIPS 4, 847-854.

    Raftery, A.E. (1995), "Bayesian Model Selection in Social Research," in Marsden, P.V. (ed.), Sociological Methodology 1995, Cambridge, MA: Blackwell, ftp://ftp.stat.washington.edu/pub/tech.reports/bic.ps.z or http://www.stat.washington.edu/tech.reports/bic.ps

    Rissanen, J. (1978), "Modelling by shortest data description," Automatica, 14, 465-471.

    Rissanen, J. (1987), "Stochastic complexity" (with discussion), J. of the Royal Statistical Society, Series B, 49, 223-239.

    Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages)

    Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann.

    ------------------------------------------------------------------------

    Subject: What are cross-validation and bootstrapping?

    Cross-validation and bootstrapping are both methods for estimating generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao and Tu 1995). The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures.

    Cross-validation

    In k-fold cross-validation, you divide the data into k subsets of (approximately) equal size. You train the net k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute whatever error criterion interests you. If k equals the sample size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a more elaborate and expensive version of cross-validation that involves leaving out all possible subsets of v cases.

    Note that cross-validation is quite different from the "split-sample" or "hold-out" method that is commonly used for early stopping in NNs. In the split-sample method, only a single subset (the validation set) is used to estimate the error function, instead of k different subsets; i.e., there is no "crossing". While various people have suggested that cross-validation be applied to early stopping, the proper way of doing so is not obvious.

    The distinction between cross-validation and split-sample validation is extremely important because cross-validation is markedly superior for small data sets; this fact is demonstrated dramatically by Goutte (1997) in a reply to Zhu and Rohwer (1996). For an insightful discussion of the limitations of cross-validatory choice among several learning methods, see Stone (1977).

    Jackknifing

    Leave-one-out cross-validation is also easily confused with jackknifing. Both involve omitting each training case in turn and retraining the network on the remaining subset. But cross-validation is used to estimate generalization error, while the jackknife is used to estimate the bias of a statistic. In the jackknife, you compute some statistic of interest in each subset of the data. The average of these subset statistics is compared with the corresponding statistic computed from the entire sample in order to estimate the bias of the latter. You can also get a jackknife estimate of the standard error of a statistic. Jackknifing can be used to estimate the bias of the training error and hence to estimate the generalization error, but this process is more complicated than leave-one-out cross-validation (Efron, 1982; Ripley, 1996, p. 73).

    Choice of cross-validation method

    Cross-validation can be used simply to estimate the generalization error of a given model, or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example, you might use cross-validation to choose the number of hidden units, or you could use cross-validation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "good" and "best" in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error.

    Leave-one-out cross-validation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as the number of misclassified cases. In the latter case, k-fold cross-validation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in training-set size between the full-sample analysis and the cross-validation analyses. (For model-selection purposes, this bias can actually help; see the discussion below of Shao, 1993.) A value of 10 for k is popular for estimating generalization error.

    Leave-one-out cross-validation can also run into trouble with various model-selection methods. Again, one problem is lack of continuity--a small change in the data can cause a large change in the model selected (Breiman, 1996). For choosing subsets of inputs in linear regression, Breiman and Spector (1992) found 10-fold and 5-fold cross-validation to work better than leave-one-out. Kohavi (1995) also obtained good results for 10-fold cross-validation with empirical decision trees (C4.5). Values of k as small as 5 or even 2 may work even better if you analyze several different random k-way splits of the data to reduce the variability of the cross-validation estimate.

    Leave-one-out cross-validation also has more subtle deficiencies for model selection. Shao (1995) showed that in linear models, leave-one-out cross-validation is asymptotically equivalent to AIC (and Mallows' C_p), but leave-v-out cross-validation is asymptotically equivalent to Schwarz's Bayesian criterion (called SBC or BIC) when v = n[1-1/(log(n)-1)], where n is the number of training cases. SBC provides consistent subset-selection, while AIC does not. That is, SBC will choose the "best" subset with probability approaching one as the size of the training set goes to infinity. AIC has an asymptotic probability of one of choosing a "good" subset, but less than one of choosing the "best" subset (Stone, 1979). Many simulation studies have also found that AIC overfits badly in small samples, and that SBC works well (e.g., Hurvich and Tsai, 1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out cross-validation should overfit in small samples, but leave-v-out cross-validation with appropriate v should do better. However, when true models have an infinite number of parameters, SBC is not efficient, and other criteria that are asymptotically efficient but not consistent for model selection may produce better generalization (Hurvich and Tsai, 1989).

    Shao (1993) obtained the surprising result that for selecting subsets of inputs in a linear regression, the probability of selecting the "best" does not converge to 1 (as the sample size n goes to infinity) for leave-v-out cross-validation unless the proportion v/n approaches 1. At first glance, Shao's result seems inconsistent with the analysis by Kearns (1997) of split-sample validation, which shows that the best generalization is obtained with v/n strictly between 0 and 1, with little sensitivity to the precise value of v/n for large data sets. But the apparent conflict is due to the fundamentally different properties of cross-validation and split-sample validation.

    To obtain an intuitive understanding of Shao (1993), let's review some background material on generalization error. Generalization error can be broken down into three additive parts, noise variance + estimation variance + squared estimation bias. Noise variance is the same for all subsets of inputs. Bias is nonzero for subsets that are not "good", but it's zero for all "good" subsets, since we are assuming that the function to be learned is linear. Hence the generalization error of "good" subsets will differ only in the estimation variance. The estimation variance is (2p/t)s^2 where p is the number of inputs in the subset, t is the training set size, and s^2 is the noise variance. The "best" subset is better than other "good" subsets only because the "best" subset has (by definition) the smallest value of p. But the t in the denominator means that differences in generalization error among the "good" subsets will all go to zero as t goes to infinity. Therefore it is difficult to guess which subset is "best" based on the generalization error even when t is very large. It is well known that unbiased estimates of the generalization error, such as those based on AIC, FPE, and C_p, do not produce consistent estimates of the "best" subset (e.g., see Stone, 1979).

    In leave-v-out cross-validation, t=n-v. The differences of the cross-validation estimates of generalization error among the "good" subsets contain a factor 1/t, not 1/n. Therefore by making t small enough (and thereby making each regression based on t cases bad enough), we can make the differences of the cross-validation estimates large enough to detect. It turns out that to make t small enough to guess the "best" subset consistently, we have to have t/n go to 0 as n goes to infinity.

    The crucial distinction between cross-validation and split-sample validation is that with cross-validation, after guessing the "best" subset, we train the linear regression model for that subset using all n cases, but with split-sample validation, only t cases are ever used for training. If our main purpose were really to choose the "best" subset, I suspect we would still have to have t/n go to 0 even for split-sample validation. But choosing the "best" subset is not the same thing as getting the best generalization. If we are more interested in getting good generalization than in choosing the "best" subset, we do not want to make our regression estimate based on only t cases as bad as we do in cross-validation, because in split-sample validation that bad regression estimate is what we're stuck with. So there is no conflict between Shao and Kearns, but there is a conflict between the two goals of choosing the "best" subset and getting the best generalization in split-sample validation.

    Bootstrapping

    Bootstrapping seems to work better than cross-validation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the full sample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error but also for estimating confidence bounds for network outputs (Efron and Tibshirani 1993). For estimating generalization error in classification problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap) is one of the currently favored methods that has the advantage of performing well even when there is severe overfitting. Use of bootstrapping for NNs is described in Baxt and White (1995), Tibshirani (1996), and Masters (1995). However, the results obtained so far are not very thorough, and it is known that bootstrapping does not work well for some other methodologies such as empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984; Kohavi, 1995), for which it can be excessively optimistic.

    For further information

    Cross-validation and bootstrapping become considerably more complicated for time series data; see Hjorth (1994) and Snijders (1988).

    More information on jackknife and bootstrap confidence intervals is available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plain-text file).

    References (see also http://www.statistics.com/books.html):

    Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624-638.

    Breiman, L. (1996), "Heuristics of instability and stabilization in model selection," Annals of Statistics, 24, 2350-2383.

    Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), Classification and Regression Trees, Belmont, CA: Wadsworth.

    Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation in regression: The X-random case," International Statistical Review, 60, 291-319.

    Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical Implications, Proceedings of a workshop held in Groningen, The Netherlands, September 25-26, 1986, Berlin: Springer-Verlag.

    Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans, Philadelphia: SIAM.

    Efron, B. (1983), "Estimating the error rate of a prediction rule: Improvement on cross-validation," J. of the American Statistical Association, 78, 316-331.

    Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall.

    Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation: The .632+ bootstrap method," J. of the American Statistical Association, 92, 548-560.

    Goutte, C. (1997), "Note on free lunches and cross-validation," Neural Computation, 9, 1211-1215, ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz.

    Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, London: Chapman & Hall.

    Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model selection in small samples," Biometrika, 76, 297-307.

    Kearns, M. (1997), "A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split," Neural Computation, 9, 1143-1161.

    Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI), pp. ?, http://robotics.stanford.edu/users/ronnyk/

    Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0

    Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 391-398.

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press.

    Shao, J. (1993), "Linear model selection by cross-validation," J. of the American Statistical Association, 88, 486-494.

    Shao, J. (1995), "An asymptotic theory for linear model selection," Statistica Sinica ?.

    Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: Springer-Verlag.

    Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in time series," in Dijkstra (1988), pp. 56-69.

    Stone, M. (1977), "Asymptotics for and against cross-validation," Biometrika, 64, 29-35.

    Stone, M. (1979), "Comments on model selection criteria of Akaike and Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278.

    Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163.

    Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann.

    Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation," Neural Computation, 8, 1421-1426.

    ------------------------------------------------------------------------

    Subject: How to compute prediction and confidence intervals (error bars)?

    (This answer is only about half finished. I will get around to the other half eventually.)

    In addition to estimating over-all generalization error, it is often useful to be able to estimate the accuracy of the network's predictions for individual cases.

    Let:

       Y      = the target variable
       y_i    = the value of Y for the ith case
       X      = the vector of input variables
       x_i    = the value of X for the ith case
       N      = the noise in the target variable
       n_i    = the value of N for the ith case
       m(X)   = E(Y|X) = the conditional mean of Y given X
       w      = a vector of weights for a neural network
       w^     = the weight obtained via training the network
       p(X,w) = the output of a neural network given input X and weights w
       p_i    = p(x_i,w)
       L      = the number of training (learning) cases, (y_i,x_i), i=1, ..., L
       Q(w)   = the objective function
    Assume the data are generated by the model:
       Y = m(X) + N
       E(N|X) = 0
       N and X are independent
    The network is trained by attempting to minimize the objective function Q(w), which, for example, could be the sum of squared errors or the negative log likelihood based on an assumed family of noise distributions.

    Given a test input x_0, a 100c% prediction interval for y_0 is an interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <= UPB_0) = c, where c is typically .95 or .99, and the probability is computed over repeated random selection of the training set and repeated observation of Y given the test input x_0. A 100c% confidence interval for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <= p_0 <= UCB_0) = c, where again the probability is computed over repeated random selection of the training set. Note that p_0 is a nonrandom quantity, since x_0 is given. A confidence interval is narrower than the corresponding prediction interval, since the prediction interval must include variation due to noise in y_0, while the confidence interval does not. Both intervals include variation due to sampling of the training set and possible variation in the training process due, for example, to random initial weights and local minima of the objective function.

    Traditional statistical methods for nonlinear models depend on several assumptions (Gallant, 1987):

  • The inputs for the training cases are either fixed or obtained by simple random sampling or some similarly well-behaved process.
  • Q(w) has continuous first and second partial derivatives with respect to w over some convex, bounded subset S_W of the weight space.
  • Q(w) has a unique global minimum at w^, which is an interior point of S_W.
  • The model is well-specified, which requires (a) that there exist weights w$ in the interior of S_W such that m(x) = p(x,w$), and (b) that the assumptions about the noise distribution are correct. (Sorry about the w$ notation, but I'm running out of plain text symbols.)
  • These traditional methods are based on a linear approximation to p(x,w) in a neighborhood of w$, yielding a quadratic approximation to Q(w). Hence the Hessian of Q(w) (the square matrix of second-order partial derivatives with respect to w) frequently appears in these methods.

    Assumption (3) is not satisfied for neural nets, because networks with hidden units always have multiple global minima, and the global minima are often improper. Hence, confidence intervals for the weights cannot be obtained using standard Hessian-based methods. However, Hwang and Ding (1997) have shown that confidence intervals for predicted values can be obtained because the predicted values are statistically identified even though the weights are not.

    Cardell, Joerding, and Li (1994) describe a more serious violation of assumption (3), namely that that for some m(x), no finite global minimum exists. In such situations, it may be possible to use regularization methods such as weight decay to obtain valid confidence intervals (De Veaux, Schumi, Schweinsberg, and Ungar, 1998), but more research is required on this subject, since the derivation in the cited paper assumes a finite global minimum.

    For large samples, the sampling variability in w^ can be approximated in various ways:

  • Fisher's information matrix, which is the expected value of the Hessian of Q(w) divided by L, can be used when Q(w) is the negative log likelihood (Spall, 1998).
  • The delta method, based on the Hessian of Q(w) or the Gauss-Newton approximation using the cross-product Jacobian of Q(w), can also be used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998).
  • The sandwich estimator, a more elaborate Hessian-based method, relaxes assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996).
  • Bootstrapping can be used without knowing the form of the noise distribution and takes into account variability introduced by local minima in training, but requires training the network many times on different resamples of the training set (Tibshirani, 1996; Heskes 1997).
  • References:
    Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward networks cannot learn some polynomials," Neural Computation, 6, 761-766.

    De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998), "Prediction intervals for neural networks via nonlinear regression," Technometrics, 40, 273-282.

    Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.

    Heskes, T. (1997), "Practical confidence and prediction intervals," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 176-182.

    Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for artificial neural networks," J. of the American Statistical Association, 92, 748-757.

    Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) Advances in Neural Information Processing Systems 7, Cambrideg, MA: The MIT Press, pp. 489-496.

    Spall, J.C. (1998), "Resampling-based calculation of the information matrix in nonlinear statistical models," Proceedings of the 4th Joint Conference on Information Sciences, October 23-28, Research Triangle PArk, NC, USA, Vol 4, pp. 35-39.

    Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152-163.

    White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 1008-1013.

    ------------------------------------------------------------------------
    Next part is part 4 (of 7). Previous part is part 2.

    Subject: Books and articles about Neural Networks?

    Most books in print can be ordered online from http://www.amazon.com/. Amazon's prices and search engine are good and their service is excellent.

     Bookpool at http://www.bookpool.com/ does not have as large a selection as Amazon but they often offer exceptional discounts.

    The neural networks reading group at the University of Illinois at Urbana-Champaign, the Artifical Neural Networks and Computational Brain Theory (ANNCBT) forum, has compiled a large number of book and paper reviews at http://anncbt.ai.uiuc.edu/, with an emphasis more on cognitive science rather than practical applications of NNs.

    The Best

    The best of the best

    Bishop (1995) is clearly the single best book on artificial NNs. This book excels in organization and choice of material, and is a close runner-up to Ripley (1996) for accuracy. If you are new to the field, read it from cover to cover. If you have lots of experience with NNs, it's an excellent reference. If you don't know calculus, take a class. I hope a second edition comes out soon! For more information, see The best intermediate textbooks on NNs below.

    If you have questions on feedforward nets that aren't answered by Bishop, try Reed and Marks (1999) for practical issues or Ripley (1996) for theortical issues, both of which are reviewed below.

    The best popular introduction to NNs

    Hinton, G.E. (1992), "How Neural Networks Learn from Experience", Scientific American, 267 (September), 144-151.
    Author's Webpage: http://www.cs.utoronto.ca/DCS/People/Faculty/hinton.html (official)
    and http://www.cs.toronto.edu/~hinton (private)
    Journal Webpage: http://www.sciam.com/
    Additional Information: Unfortunately that article is not available there.

    The best introductory book for business executives

    Bigus, J.P. (1996), Data Mining with Neural Networks: Solving Business Problems--from Application Development to Decision Support, NY: McGraw-Hill, ISBN 0-07-005779-6, xvii+221 pages.
    The stereotypical business executive (SBE) does not want to know how or why NNs work--he (SBEs are usually male) just wants to make money. The SBE may know what an average or percentage is, but he is deathly afraid of "statistics". He understands profit and loss but does not want to waste his time learning things involving complicated math, such as high-school algebra. For further information on the SBE, see the "Dilbert" comic strip.

    Bigus has written an excellent introduction to NNs for the SBE. Bigus says (p. xv), "For business executives, managers, or computer professionals, this book provides a thorough introduction to neural network technology and the issues related to its application without getting bogged down in complex math or needless details. The reader will be able to identify common business problems that are amenable to the neural netwrk approach and will be sensitized to the issues that can affect successful completion of such applications." Bigus succeeds in explaining NNs at a practical, intuitive, and necessarily shallow level without formulas--just what the SBE needs. This book is far better than Caudill and Butler (1990), a popular but disastrous attempt to explain NNs without formulas.

    Chapter 1 introduces data mining and data warehousing, and sketches some applications thereof. Chapter 2 is the semi-obligatory philosophico-historical discussion of AI and NNs and is well-written, although the SBE in a hurry may want to skip it. Chapter 3 is a very useful discussion of data preparation. Chapter 4 describes a variety of NNs and what they are good for. Chapter 5 goes into practical issues of training and testing NNs. Chapters 6 and 7 explain how to use the results from NNs. Chapter 8 discusses intelligent agents. Chapters 9 through 12 contain case histories of NN applications, including market segmentation, real-estate pricing, customer ranking, and sales forecasting.

    Bigus provides generally sound advice. He briefly discusses overfitting and overtraining without going into much detail, although I think his advice on p. 57 to have at least two training cases for each connection is somewhat lenient, even for noise-free data. I do not understand his claim on pp. 73 and 170 that RBF networks have advantages over backprop networks for nonstationary inputs--perhaps he is using the word "nonstationary" in a sense different from the statistical meaning of the term. There are other things in the book that I would quibble with, but I did not find any of the flagrant errors that are common in other books on NN applications such as Swingler (1996).

    The one serious drawback of this book is that it is more than one page long and may therefore tax the attention span of the SBE. But any SBE who succeeds in reading the entire book should learn enough to be able to hire a good NN expert to do the real work.

    The best elementary textbooks on practical use of NNs

    Smith, M. (1993). Neural Networks for Statistical Modeling, NY: Van Nostrand Reinhold.
    Book Webpage (Publisher): http://www.thompson.com/
    Additional Information: seems to be out of print.
    Smith is not a statistician, but he has made an impressive effort to convey statistical fundamentals applied to neural networks. The book has entire brief chapters on overfitting and validation (early stopping and split-sample sample validation, which he incorrectly calls cross-validation), putting it a rung above most other introductions to NNs. There are also brief chapters on data preparation and diagnostic plots, topics usually ignored in elementary NN books. Only feedforward nets are covered in any detail.

    Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann. ISBN 1 55860 065 5.
    Author's Webpage: Kulikowski: http://ruccs.rutgers.edu/faculty/kulikowski.html
    Book Webpage (Publisher): http://www.mkp.com/books_catalog/1-55860-065-5.asp
    Additional Information: Information of Weiss, S.M. are not available.
    Briefly covers at a very elementary level feedforward nets, linear and nearest-neighbor discriminant analysis, trees, and expert sytems, emphasizing practical applications. For a book at this level, it has an unusually good chapter on estimating generalization error, including bootstrapping.

    Reed, R.D., and Marks, R.J, II (1999),Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8.
    Author's Webpage: Marks: http://cialab.ee.washington.edu/Marks.html
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262181908
    After you have read Smith (1993) or Weiss and Kulikowski (1991), consult Reed and Marks for practical details on training MLPs (other types of neural nets such as RBF networks are barely even mentioned). They provide extensive coverage of backprop and its variants, and they also survey conventional optimization algorithms. Their coverage of initialization methods, constructive networks, pruning, and regularization methods is unusually thorough. Unlike the vast majority of books on neural nets, this one has lots of really informative graphs. The chapter on generalization assessment is slightly weak, which is why you should read Smith (1993) or Weiss and Kulikowski (1991) first. Also, there is little information on data preparation, for which Smith (1993) and Masters (1993; see below) should be consulted. There is some elementary calculus, but not enough that it should scare off anybody. Many second-rate books treat neural nets as mysterious black boxes, but Reed and Marks open up the box and provide genuine insight into the way neural nets work.

    One problem with the book is that the terms "validation set" and "test set" are used inconsistently.

    The best elementary textbook on using and programming NNs

    Masters, Timothy (1993). Practical Neural Network Recipes in C++, Academic Press, ISBN 0-12-479040-2, US $45 incl. disks.
    Book Webpage (Publisher): http://www.apcatalog.com/cgi-bin/AP?ISBN=0124790402&LOCATION=US&FORM=FORM2
    Masters has written three exceptionally good books on NNs (the two others are listed below). He combines generally sound practical advice with some basic statistical knowledge to produce a programming text that is far superior to the competition (see "The Worst" below). Not everyone likes his C++ code (the usual complaint is that the code is not sufficiently OO) but, unlike the code in some other books, Masters's code has been successfully compiled and run by some readers of comp.ai.neural-nets. Masters's books are well worth reading even for people who have no interest in programming.

    Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0
    Book Webpage (Publisher): http://www.wiley.com/
    Additional Information: One has to search.
    Clear explanations of conjugate gradient and Levenberg-Marquardt optimization algorithms, simulated annealing, kernel regression (GRNN) and discriminant analysis (PNN), Gram-Charlier networks, dimensionality reduction, cross-validation, and bootstrapping.

    The best elementary textbooks on NN research

    Fausett, L. (1994), Fundamentals of Neural Networks: Architectures, Algorithms, and Applications, Englewood Cliffs, NJ: Prentice Hall, ISBN 0-13-334186-0. Also published as a Prentice Hall International Edition, ISBN 0-13-042250-9. Sample software (source code listings in C and Fortran) is included in an Instructor's Manual.

    Book Webpage (Publisher): http://www.prenhall.com/books/esm_0133341860.html
    Additional Information: The mentioned programs / additional support is not available.

    Review by Ian Cresswell:

    What a relief! As a broad introductory text this is without any doubt the best currently available in its area. It doesn't include source code of any kind (normally this is badly written and compiler specific). The algorithms for many different kinds of simple neural nets are presented in a clear step by step manner in plain English.

    Equally, the mathematics is introduced in a relatively gentle manner. There are no unnecessary complications or diversions from the main theme.

    The examples that are used to demonstrate the various algorithms are detailed but (perhaps necessarily) simple.

    There are bad things that can be said about most books. There are only a small number of minor criticisms that can be made about this one. More space should have been given to backprop and its variants because of the practical importance of such methods. And while the author discusses early stopping in one paragraph, the treatment of generalization is skimpy compared to the books by Weiss and Kulikowski or Smith listed above.

    If you're new to neural nets and you don't want to be swamped by bogus ideas, huge amounts of intimidating looking mathematics, a programming language that you don't know etc. etc. then this is the book for you.

    In summary, this is the best starting point for the outsider and/or beginner... a truly excellent text.

    Anderson, J.A. (1995), An Introduction to Neural Networks, Cambridge,MA: The MIT Press, ISBN 0-262-01144-1.
    Author's Webpage: http://www.cog.brown.edu/~anderson
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262510812 or
    http://mitpress.mit.edu/book-home.tcl?isbn=0262011441 (hardback)
    Additional Information: Programs and additional information can be found at: ftp://mitpress.mit.edu/pub/Intro-to-NeuralNets/
    Anderson provides an accessible introduction to the AI and neurophysiological sides of NN research, although the book is weak regarding practical aspects of using NNs. Recommended for classroom use if the instructor provides supplementary material on how to get good generalization.

    The best intermediate textbooks on NNs

    Bishop, C.M. (1995). Neural Networks for Pattern Recognition, Oxford: Oxford University Press. ISBN 0-19-853849-9 (hardback) or 0-19-853864-2 (paperback), xvii+482 pages.
    Author's Webpage: http://neural-server.aston.ac.uk/People/bishopc/Welcome.html
    Book Webpage (Publisher): http://www1.oup.co.uk/bin/readcat?Version=887069107&title=Neural+Networks+for+Pattern+Recognition&TOB=52305&H1=19808&H2=47489&H3=48287&H4=48306&count=1&style=full
    This is definitely the best book on neural nets for practical applications for readers comfortable with calculus. Geoffrey Hinton writes in the foreword:
    "Bishop is a leading researcher who has a deep understanding of the material and has gone to great lengths to organize it in a sequence that makes sense. He has wisely avoided the temptation to try to cover everything and has therefore omitted interesting topics like reinforcement learning, Hopfield networks, and Boltzmann machines in order to focus on the types of neural networks that are most widely used in practical applications. He assumes that the reader has the basic mathematical literacy required for an undergraduate science degree, and using these tools he explains everything from scratch. Before introducing the multilayer perceptron, for example, he lays a solid foundation of basic statistical concepts. So the crucial concept of overfitting is introduced using easily visualized examples of one-dimensional polynomials and only later applied to neural networks. An impressive aspect of this book is that it takes the reader all the way from the simplest linear models to the very latest Bayesian multilayer neural networks without ever requiring any great intellectual leaps."

    Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, ISBN 0-201-50395-6 (hardbound) and 0-201-51560-1 (paperbound)
    Book Webpage (Publisher): http://www2.awl.com/gb/abp/sfi/computer.html
    This is an excellent classic work on neural nets from the perspective of physics. Comments from readers of comp.ai.neural-nets: "My first impression is that this one is by far the best book on the topic. And it's below $30 for the paperback."; "Well written, theoretical (but not overwhelming)"; It provides a good balance of model development, computational algorithms, and applications. The mathematical derivations are especially well done"; "Nice mathematical analysis on the mechanism of different learning algorithms"; "It is NOT for mathematical beginner. If you don't have a good grasp of higher level math, this book can be really tough to get through."
     
     

    The best advanced textbook covering NNs

    Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, ISBN 0-521-46086-7 (hardback), xii+403 pages.
    Author's Webpage: http://www.stats.ox.ac.uk/~ripley/
    Book Webpage (Publisher): http://www.cup.cam.ac.uk/
    Additional Information: The Webpage includes errata and additional information, which hasn't been available at publishing time, for this book.
    Brian Ripley's new book is an excellent sequel to Bishop (1995). Ripley starts up where Bishop left off, with Bayesian inference and statistical decision theory, and then covers some of the same material on NNs as Bishop but at a higher mathematical level. Ripley also covers a variety of methods that are not discussed, or discussed only briefly, by Bishop, such as tree-based methods and belief networks. While Ripley is best appreciated by people with a background in mathematical statistics, the numerous realistic examples in his book will be of interest even to beginners in neural nets.

     Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern Recognition, NY: Springer, ISBN 0-387-94618-7, vii+636 pages.
    This book has relatively little material explicitly about neural nets, but what it has is very interesting and much of it is not found in other texts. The emphasis is on statistical proofs of universal consistency for a wide variety of methods, including histograms, (k) nearest neighbors, kernels (PNN), trees, generalized linear discriminants, MLPs, and RBF networks. There is also considerable material on validation and cross-validation. The authors say, "We did not scar the pages with backbreaking simulations or quick-and-dirty engineering solutions" (p. 7). The formula-to-text ratio is high, but the writing is quite clear, and anyone who has had a year or two of mathematical statistics should be able to follow the exposition.

    The best books on image and signal processing with NNs

    Masters, T. (1994), Signal and Image Processing with Neural Networks: A C++ Sourcebook, NY: Wiley.
    Book Webpage (Publisher): http://www.wiley.com/
    Additional Information: One has to search.

    Cichocki, A. and Unbehauen, R. (1993). Neural Networks for Optimization and Signal Processing. NY: John Wiley & Sons, ISBN 0-471-930105 (hardbound), 526 pages, $57.95.
    Book Webpage (Publisher): http://www.wiley.com/
    Additional Information: One has to search.
    Comments from readers of comp.ai.neural-nets:"Partly a textbook and partly a research monograph; introduces the basic concepts, techniques, and models related to neural networks and optimization, excluding rigorous mathematical details. Accessible to a wide readership with a differential calculus background. The main coverage of the book is on recurrent neural networks with continuous state variables. The book title would be more appropriate without mentioning signal processing. Well edited, good illustrations."
     
     

    The best book on time-series forecasting with NNs

    Weigend, A.S. and Gershenfeld, N.A., eds. (1994) Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: Addison-Wesley. Book Webpage (Publisher): http://www2.awl.com/gb/abp/sfi/complexity.html

    The best books on reinforcement learning

    Elementary:
    Sutton, R.S., and Barto, A.G. (1998), Reinforcement Learning: An Introduction, The MIT Press, ISBN: 0-262193-98-1.
    Author's Webpage: http://envy.cs.umass.edu/~rich/sutton.html and http://www-anw.cs.umass.edu/People/barto/barto.html
    Book Webpage (Publisher):http://mitpress.mit.edu/book-home.tcl?isbn=0262193981
    Additional Information: http://www-anw.cs.umass.edu/~rich/book/the-book.html

    Advanced:
    Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
    Author's Webpage: http://www.mit.edu:8001/people/dimitrib/home.html and http://web.mit.edu/jnt/www/home.html
    Book Webpage (Publisher):http://world.std.com/~athenasc/ndpbook.html

    The best books on neurofuzzy systems

    Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and Control, NY: Prentice Hall.
    Author's Webpage: http://www.isis.ecs.soton.ac.uk/people/m_brown.html
    and http://www.ecs.soton.ac.uk/~cjh/
    Book Webpage (Publisher): http://www.prenhall.com/books/esm_0131344536.html
    Additional Information: Additional page at: http://www.isis.ecs.soton.ac.uk/publications/neural/mqbcjh94e.html and an abstract can be found at: http://www.isis.ecs.soton.ac.uk/publications/neural/mqb93.html
    Brown and Harris rely on the fundamental insight that that a fuzzy system is a nonlinear mapping from an input space to an output space that can be parameterized in various ways and therefore can be adapted to data using the usual neural training methods (see "What is backprop?") or conventional numerical optimization algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"). Their approach makes clear the intimate connections between fuzzy systems, neural networks, and statistical methods such as B-spline regression.

    Kosko, B. (1997), Fuzzy Engineering, Upper Saddle River, NJ: Prentice Hall.
    Kosko's new book is a big improvement over his older neurofuzzy book and makes an excellent sequel to Brown and Harris (1994).

    The best comparison of NNs with other classification methods

    Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), Machine Learning, Neural and Statistical Classification, Ellis Horwood. Author's Webpage: Donald Michie: http://www.aiai.ed.ac.uk/~dm/dm.html
    Additional Information: This book is out of print but available online at http://www.amsta.leeds.ac.uk/~charles/statlog/

    Books for the Beginner

    Aleksander, I. and Morton, H. (1990). An Introduction to Neural Computing. Chapman and Hall. (ISBN 0-412-37780-2).
    Book Webpage (Publisher): http://www.chaphall.com/
    Additional Information: Seems to be out of print.
    Comments from readers of comp.ai.neural-nets:: "This book seems to be intended for the first year of university education."

     Beale, R. and Jackson, T. (1990). Neural Computing, an Introduction. Adam Hilger, IOP Publishing Ltd : Bristol. (ISBN 0-85274-262-2).
    Comments from readers of comp.ai.neural-nets: "It's clearly written. Lots of hints as to how to get the adaptive models covered to work (not always well explained in the original sources). Consistent mathematical terminology. Covers perceptrons, error-backpropagation, Kohonen self-org model, Hopfield type models, ART, and associative memories."

     Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT Press: Cambridge, Massachusetts. (ISBN 0-262-03156-6).
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262531135
    The authors try to translate mathematical formulas into English. The results are likely to disturb people who appreciate either mathematics or English. Have the authors never heard that "a picture is worth a thousand words"? What few diagrams they have (such as the one on p. 74) tend to be confusing. Their jargon is peculiar even by NN standards; for example, they refer to target values as "mentor inputs" (p. 66). The authors do not understand elementary properties of error functions and optimization algorithms. For example, in their discussion of the delta rule, the authors seem oblivious to the differences between batch and on-line training, and they attribute magical properties to the algorithm (p. 71):

    [The on-line delta] rule always takes the most efficient route from the current position of the weight vector to the "ideal" position, based on the current input pattern. The delta rule not only minimizes the mean squared error, it does so in the most efficient fashion possible--quite an achievement for such a simple rule.
    While the authors realize that backpropagation networks can suffer from local minima, they mistakenly think that counterpropagation has some kind of global optimization ability (p. 202):
    Unlike the backpropagation network, a counterpropagation network cannot be fooled into finding a local minimum solution. This means that the network is guaranteed to find the correct response (or the nearest stored response) to an input, no matter what.
    But even though they acknowledge the problem of local minima, the authors are ignorant of the importance of initial weight values (p. 186):
    To teach our imaginary network something using backpropagation, we must start by setting all the adaptive weights on all the neurodes in it to random values. It won't matter what those values are, as long as they are not all the same and not equal to 1.
    Like most introductory books, this one neglects the difficulties of getting good generalization--the authors simply declare (p. 8) that "A neural network is able to generalize"!

    Chester, M. (1993). Neural Networks: A Tutorial, Englewood Cliffs, NJ: PTR Prentice Hall.
    Book Webpage (Publisher): http://www.prenhall.com/
    Additional Information: Seems to be out of print.
    Shallow, sometimes confused, especially with regard to Kohonen networks.

    Dayhoff, J. E. (1990). Neural Network Architectures: An Introduction. Van Nostrand Reinhold: New York.
    Comments from readers of comp.ai.neural-nets: "Like Wasserman's book, Dayhoff's book is also very easy to understand".

     Freeman, James (1994). Simulating Neural Networks with Mathematica, Addison-Wesley, ISBN: 0-201-56629-X. Book Webpage (Publisher): http://cseng.aw.com/bookdetail.qry?ISBN=0-201-56629-X&ptype=0
    Additional Information: Sourcecode available under: ftp://ftp.mathsource.com/pub/Publications/BookSupplements/Freeman-1993
    Helps the reader make his own NNs. The mathematica code for the programs in the book is also available through the internet: Send mail to MathSource@wri.com or try http://www.wri.com/ on the World Wide Web.

     Freeman, J.A. and Skapura, D.M. (1991). Neural Networks: Algorithms, Applications, and Programming Techniques, Reading, MA: Addison-Wesley.
    Book Webpage (Publisher): http://www.awl.com/
    Additional Information: Seems to be out of print.
    A good book for beginning programmers who want to learn how to write NN programs while avoiding any understanding of what NNs do or why they do it.

    Gately, E. (1996). Neural Networks for Financial Forecasting. New York: John Wiley and Sons, Inc.
    Book Webpage (Publisher): http://www.wiley.com/
    Additional Information: One has to search.
    Franco Insana comments:

    * Decent book for the neural net beginner
    * Very little devoted to statistical framework, although there 
        is some formulation of backprop theory
    * Some food for thought
    * Nothing here for those with any neural net experience
    Hecht-Nielsen, R. (1990). Neurocomputing. Addison Wesley.
    Book Webpage (Publisher): http://www.awl.com/
    Additional Information: Seems to be out of print.
    Comments from readers of comp.ai.neural-nets: "A good book", "comprises a nice historical overview and a chapter about NN hardware. Well structured prose. Makes important concepts clear."

     McClelland, J. L. and Rumelhart, D. E. (1988). Explorations in Parallel Distributed Processing: Computational Models of Cognition and Perception (software manual). The MIT Press.
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=026263113X (IBM version) and
    http://mitpress.mit.edu/book-home.tcl?isbn=0262631296 (Macintosh)
    Comments from readers of comp.ai.neural-nets: "Written in a tutorial style, and includes 2 diskettes of NN simulation programs that can be compiled on MS-DOS or Unix (and they do too !)"; "The programs are pretty reasonable as an introduction to some of the things that NNs can do."; "There are *two* editions of this book. One comes with disks for the IBM PC, the other comes with disks for the Macintosh".

     McCord Nelson, M. and Illingworth, W.T. (1990). A Practical Guide to Neural Nets. Addison-Wesley Publishing Company, Inc. (ISBN 0-201-52376-0).
    Book Webpage (Publisher): http://cseng.aw.com/bookdetail.qry?ISBN=0-201-63378-7&ptype=1174
    Lots of applications without technical details, lots of hype, lots of goofs, no formulas.

     Muller, B., Reinhardt, J., Strickland, M. T. (1995). Neural Networks.:An Introduction (2nd ed.). Berlin, Heidelberg, New York: Springer-Verlag. ISBN 3-540-60207-0. (DOS 3.5" disk included.)
    Book Webpage (Publisher): http://www.springer.de/catalog/html-files/deutsch/phys/3540602070.html
    Comments from readers of comp.ai.neural-nets: "The book was developed out of a course on neural-network models with computer demonstrations that was taught by the authors to Physics students. The book comes together with a PC-diskette. The book is divided into three parts: (1) Models of Neural Networks; describing several architectures and learing rules, including the mathematics. (2) Statistical Physiscs of Neural Networks; "hard-core" physics section developing formal theories of stochastic neural networks. (3) Computer Codes; explanation about the demonstration programs. First part gives a nice introduction into neural networks together with the formulas. Together with the demonstration programs a 'feel' for neural networks can be developed."

    Orchard, G.A. & Phillips, W.A. (1991). Neural Computation: A Beginner's Guide. Lawrence Earlbaum Associates: London.
    Comments from readers of comp.ai.neural-nets: "Short user-friendly introduction to the area, with a non-technical flavour. Apparently accompanies a software package, but I haven't seen that yet".

     Rao, V.B & H.V. (1993). C++ Neural Networks and Fuzzy Logic. MIS:Press, ISBN 1-55828-298-x, US $45 incl. disks.
    Covers a wider variety of networks than Masters (1993), Practical Neural Network Recipes in C++,, but lacks Masters's insight into practical issues of using NNs.

     Wasserman, P. D. (1989). Neural Computing: Theory & Practice. Van Nostrand Reinhold: New York. (ISBN 0-442-20743-3)
    Comments from readers of comp.ai.neural-nets: "Wasserman flatly enumerates some common architectures from an engineer's perspective ('how it works') without ever addressing the underlying fundamentals ('why it works') - important basic concepts such as clustering, principal components or gradient descent are not treated. It's also full of errors, and unhelpful diagrams drawn with what appears to be PCB board layout software from the '70s. For anyone who wants to do active research in the field I consider it quite inadequate"; "Okay, but too shallow"; "Quite easy to understand"; "The best bedtime reading for Neural Networks. I have given this book to numerous collegues who want to know NN basics, but who never plan to implement anything. An excellent book to give your manager."
     
     

    The Classics

    Kohonen, T. (1984). Self-organization and Associative Memory. Springer-Verlag: New York. (2nd Edition: 1988; 3rd edition: 1989).
    Author's Webpage: http://www.cis.hut.fi/nnrc/teuvo.html
    Book Webpage (Publisher): http://www.springer.de/
    Additional Information: Book is out of print.
    Comments from readers of comp.ai.neural-nets: "The section on Pattern mathematics is excellent."

     Rumelhart, D. E. and McClelland, J. L. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition (volumes 1 & 2). The MIT Press.
    Author's Webpage: http://www-med.stanford.edu/school/Neurosciences/faculty/rumelhart.html
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262631121
    Comments from readers of comp.ai.neural-nets: "As a computer scientist I found the two Rumelhart and McClelland books really heavy going and definitely not the sort of thing to read if you are a beginner."; "It's quite readable, and affordable (about $65 for both volumes)."; "THE Connectionist bible".
     
     

    Introductory Journal Articles

    Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, Vol. 40, pp. 185--234.
    Author's Webpage: http://www.cs.utoronto.ca/DCS/People/Faculty/hinton.html (official) and
    http://www.cs.toronto.edu/~hinton (private)
    Journal Webpage (Publisher): http://www.elsevier.nl/locate/artint
    Comments from readers of comp.ai.neural-nets: "One of the better neural networks overview papers, although the distinction between network topology and learning algorithm is not always very clear. Could very well be used as an introduction to neural networks."

     Knight, K. (1990). Connectionist, Ideas and Algorithms. Communications of the ACM. November 1990. Vol.33 nr.11, pp 59-74.
    Comments from readers of comp.ai.neural-nets:"A good article, while it is for most people easy to find a copy of this journal."

     Kohonen, T. (1988). An Introduction to Neural Computing. Neural Networks, vol. 1, no. 1. pp. 3-16.
    Author's Webpage: http://www.cis.hut.fi/nnrc/teuvo.html
    Journal Webpage (Publisher): http://www.eeb.ele.tue.nl/neural/neural.html
    Additional Information: Article not available there.
    Comments from readers of comp.ai.neural-nets: "A general review".

     Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, vol 323 (9 October), pp. 533-536.
    Journal Webpage (Publisher): http://www.nature.com/
    Additional Information: Article not available there.
    Comments from readers of comp.ai.neural-nets: "Gives a very good potted explanation of backprop NN's. It gives sufficient detail to write your own NN simulation."
     
     

    Not-quite-so-introductory Literature

    Anderson, J. A. and Rosenfeld, E. (Eds). (1988). Neurocomputing: Foundations of Research. The MIT Press: Cambridge, MA.
    Author's Webpage: http://www.cog.brown.edu/~anderson
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262510480
    Comments from readers of comp.ai.neural-nets: "An expensive book, but excellent for reference. It is a collection of reprints of most of the major papers in the field."

    Anderson, J. A., Pellionisz, A. and Rosenfeld, E. (Eds). (1990). Neurocomputing 2: Directions for Research. The MIT Press: Cambridge, MA.
    Author's Webpage: http://www.cog.brown.edu/~anderson
    Book Webpage (Publisher): http://mitpress.mit.edu/book-home.tcl?isbn=0262510758
    Comments from readers of comp.ai.neural-nets: "The sequel to their well-known Neurocomputing book."

     Bourlard, H.A., and Morgan, N. (1994), Connectionist Speech Recognition: A Hybrid Approach, Boston: Kluwer Academic Publishers.

     Deco, G. and Obradovic, D. (1996), An Information-Theoretic Approach to Neural Computing, NY: Springer-Verlag.

    Haykin, S. (1994). Neural Networks, a Comprehensive Foundation. Macmillan, New York, NY.
    Comments from readers of comp.ai.neural-nets: "A very readable, well written intermediate text on NNs Perspective is primarily one of pattern recognition, estimation and signal processing. However, there are well-written chapters on neurodynamics and VLSI implementation. Though there is emphasis on formal mathematical models of NNs as universal approximators, statistical estimators, etc., there are also examples of NNs used in practical applications. The problem sets at the end of each chapter nicely complement the material. In the bibliography are over 1000 references."

     Khanna, T. (1990). Foundations of Neural Networks. Addison-Wesley: New York.
    Book Webpage (Publisher): http://www.awl.com/
    Comments from readers of comp.ai.neural-nets: "Not so bad (with a page of erroneous formulas (if I remember well), and #hidden layers isn't well described)."; "Khanna's intention in writing his book with math analysis should be commended but he made several mistakes in the math part".

     Kung, S.Y. (1993). Digital Neural Networks, Prentice Hall, Englewood Cliffs, NJ.

     Book Webpage (Publisher): http://www.prenhall.com/books/ptr_0136123260.html
    Levine, D. S. (1990). Introduction to Neural and Cognitive Modeling. Lawrence Erlbaum: Hillsdale, N.J.
    Comments from readers of comp.ai.neural-nets: "Highly recommended".

     Lippmann, R. P. (April 1987). An introduction to computing with neural nets. IEEE Acoustics, Speech, and Signal Processing Magazine. vol. 2, no. 4, pp 4-22.
    Comments from readers of comp.ai.neural-nets: "Much acclaimed as an overview of neural networks, but rather inaccurate on several points. The categorization into binary and continuous- valued input neural networks is rather arbitrary, and may work confusing for the unexperienced reader. Not all networks discussed are of equal importance."

     Maren, A., Harston, C. and Pap, R., (1990). Handbook of Neural Computing Applications. Academic Press. ISBN: 0-12-471260-6. (451 pages)
    Comments from readers of comp.ai.neural-nets: "They cover a broad area"; "Introductory with suggested applications implementation".

     Pao, Y. H. (1989). Adaptive Pattern Recognition and Neural Networks Addison-Wesley Publishing Company, Inc. (ISBN 0-201-12584-6)
    Book Webpage (Publisher): http://www.awl.com/
    Comments from readers of comp.ai.neural-nets: "An excellent book that ties together classical approaches to pattern recognition with Neural Nets. Most other NN books do not even mention conventional approaches."

     Refenes, A. (Ed.) (1995). Neural Networks in the Capital Markets. Chichester, England: John Wiley and Sons, Inc.
    Book Webpage (Publisher): http://www.wiley.com/
    Additional Information: One has to search.
    Franco Insana comments:

    * Not for the beginner
    * Excellent introductory material presented by editor in first 5 
      chapters, which could be a valuable reference source for any 
      practitioner
    * Very thought-provoking
    * Mostly backprop-related
    * Most contributors lay good statistical foundation
    * Overall, a wealth of information and ideas, but the reader has to 
      sift through it all to come away with anything useful
    Simpson, P. K. (1990). Artificial Neural Systems: Foundations, Paradigms, Applications and Implementations. Pergamon Press: New York.
    Comments from readers of comp.ai.neural-nets: "Contains a very useful 37 page bibliography. A large number of paradigms are presented. On the negative side the book is very shallow. Best used as a complement to other books".

     Wasserman, P.D. (1993). Advanced Methods in Neural Computing. Van Nostrand Reinhold: New York (ISBN: 0-442-00461-3).
    Comments from readers of comp.ai.neural-nets: "Several neural network topics are discussed e.g. Probalistic Neural Networks, Backpropagation and beyond, neural control, Radial Basis Function Networks, Neural Engineering. Furthermore, several subjects related to neural networks are mentioned e.g. genetic algorithms, fuzzy logic, chaos. Just the functionality of these subjects is described; enough to get you started. Lots of references are given to more elaborate descriptions. Easy to read, no extensive mathematical background necessary."

    Zeidenberg. M. (1990). Neural Networks in Artificial Intelligence. Ellis Horwood, Ltd., Chichester.
    Comments from readers of comp.ai.neural-nets: "Gives the AI point of view".

     Zornetzer, S. F., Davis, J. L. and Lau, C. (1990). An Introduction to Neural and Electronic Networks. Academic Press. (ISBN 0-12-781881-2)
    Comments from readers of comp.ai.neural-nets: "Covers quite a broad range of topics (collection of articles/papers )."; "Provides a primer-like introduction and overview for a broad audience, and employs a strong interdisciplinary emphasis".

     Zurada, Jacek M. (1992). Introduction To Artificial Neural Systems. Hardcover, 785 Pages, 317 Figures, ISBN 0-534-95460-X, 1992, PWS Publishing Company, Price: $56.75 (includes shipping, handling, and the ANS software diskette). Solutions Manual available.
    Comments from readers of comp.ai.neural-nets: "Cohesive and comprehensive book on neural nets; as an engineering-oriented introduction, but also as a research foundation. Thorough exposition of fundamentals, theory and applications. Training and recall algorithms appear in boxes showing steps of algorithms, thus making programming of learning paradigms easy. Many illustrations and intuitive examples. Winner among NN textbooks at a senior UG/first year graduate level-[175 problems]." Contents: Intro, Fundamentals of Learning, Single-Layer & Multilayer Perceptron NN, Assoc. Memories, Self-organizing and Matching Nets, Applications, Implementations, Appendix)

    Books with Source Code (C, C++)

    Blum, Adam (1992), Neural Networks in C++, Wiley.

    Review by Ian Cresswell. (For a review of the text, see "The Worst" below.)

    Mr Blum has not only contributed a masterpiece of NN inaccuracy but also seems to lack a fundamental understanding of Object Orientation.

    The excessive use of virtual methods (see page 32 for example), the inclusion of unnecessary 'friend' relationships (page 133) and a penchant for operator overloading (pick a page!) demonstrate inability in C++ and/or OO.

    The introduction to OO that is provided trivialises the area and demonstrates a distinct lack of direction and/or understanding.

    The public interfaces to classes are overspecified and the design relies upon the flawed neuron/layer/network model.

    There is a notable disregard for any notion of a robust class hierarchy which is demonstrated by an almost total lack of concern for inheritance and associated reuse strategies.

    The attempt to rationalise differing types of Neural Network into a single very shallow but wide class hierarchy is naive.

    The general use of the 'float' data type would cause serious hassle if this software could possibly be extended to use some of the more sensitive variants of backprop on more difficult problems. It is a matter of great fortune that such software is unlikely to be reusable and will therefore, like all good dinosaurs, disappear with the passage of time.

    The irony is that there is a card in the back of the book asking the unfortunate reader to part with a further $39.95 for a copy of the software (already included in print) on a 5.25" disk.

    The author claims that his work provides an 'Object Oriented Framework ...'. This can best be put in his own terms (Page 137):

    ... garble(float noise) ...

    Swingler, K. (1996), Applying Neural Networks: A Practical Guide, London: Academic Press.

    Review by Ian Cresswell. (For a review of the text, see "The Worst" below.)

    Before attempting to review the code associated with this book it should be clearly stated that it is supplied as an extra--almost as an afterthought. This may be a wise move.

    Although not as bad as other (even commercial) implementations, the code provided lacks proper OO structure and is typical of C++ written in a C style.

    Style criticisms include:

  • The use of public data fields within classes (loss of encapsulation).
  • Classes with no protected or private sections.
  • Little or no use of inheritance and/or run-time polymorphism.
  • Use of floats not doubles (a common mistake) to store values for connection weights.
  • Overuse of classes and public methods. The network class has 59 methods in its public section.
  • Lack of planning is evident for the construction of a class hierarchy.
  • This code is without doubt written by a rushed C programmer. Whilst it would require a C++ compiler to be successfully used, it lacks the tight (optimised) nature of good C and the high level of abstraction of good C++.

    In a generous sense the code is free and the author doesn't claim any expertise in software engineering. It works in a limited sense but would be difficult to extend and/or reuse. It's fine for demonstration purposes in a stand-alone manner and for use with the book concerned.

    If you're serious about nets you'll end up rewriting the whole lot (or getting something better).

    The Worst

    How not to use neural nets in any programming language

    Blum, Adam (1992), Neural Networks in C++, NY: Wiley.

    Welstead, Stephen T. (1994), Neural Network and Fuzzy Logic Applications in C/C++, NY: Wiley.

    (For a review of Blum's source code, see "Books with Source Code" above.)

    Both Blum and Welstead contribute to the dangerous myth that any idiot can use a neural net by dumping in whatever data are handy and letting it train for a few days. They both have little or no discussion of generalization, validation, and overfitting. Neither provides any valid advice on choosing the number of hidden nodes. If you have ever wondered where these stupid "rules of thumb" that pop up frequently come from, here's a source for one of them:

    "A rule of thumb is for the size of this [hidden] layer to be somewhere between the input layer size ... and the output layer size ..." Blum, p. 60.
    (John Lazzaro tells me he recently "reviewed a paper that cited this rule of thumb--and referenced this book! Needless to say, the final version of that paper didn't include the reference!")

    Blum offers some profound advice on choosing inputs:

    "The next step is to pick as many input factors as possible that might be related to [the target]."
    Blum also shows a deep understanding of statistics:
    "A statistical model is simply a more indirect way of learning correlations. With a neural net approach, we model the problem directly." p. 8.
    Blum at least mentions some important issues, however simplistic his advice may be. Welstead just ignores them. What Welstead gives you is code--vast amounts of code. I have no idea how anyone could write that much code for a simple feedforward NN. Welstead's approach to validation, in his chapter on financial forecasting, is to reserve two cases for the validation set!

    My comments apply only to the text of the above books. I have not examined or attempted to compile the code.

    An impractical guide to neural nets

    Swingler, K. (1996), Applying Neural Networks: A Practical Guide, London: Academic Press.
    (For a review of the source code, see "Books with Source Code" above.)

    This book has lots of good advice liberally sprinkled with errors, incorrect formulas, some bad advice, and some very serious mistakes. Experts will learn nothing, while beginners will be unable to separate the useful information from the dangerous. For example, there is a chapter on "Data encoding and re-coding" that would be very useful to beginners if it were accurate, but the formula for the standard deviation is wrong, and the description of the softmax function is of something entirely different than softmax (see What is a softmax activation function?). Even more dangerous is the statement on p. 28 that "Any pair of variables with high covariance are dependent, and one may be chosen to be discarded." Although high correlations can be used to identify redundant inputs, it is incorrect to use high covariances for this purpose, since a covariance can be high simply because one of the inputs has a high standard deviation.

    The most ludicrous thing I've found in the book is the claim that Hecht-Neilsen used Kolmogorov's theorem to show that "you will never require more than twice the number of hidden units as you have inputs" (p. 53) in an MLP with one hidden layer. Actually, Hecht-Neilsen, says "the direct usefulness of this result is doubtful, because no constructive method for developing the [output activation] functions is known." Then Swingler implies that V. Kurkova (1991, "Kolmogorov's theorem is relevant," Neural Computation, 3, 617-622) confirmed this alleged upper bound on the number of hidden units, saying that, "Kurkova was able to restate Kolmogorov's theorem in terms of a set of sigmoidal functions." If Kolmogorov's theorem, or Hecht-Nielsen's adaptation of it, could be restated in terms of known sigmoid activation functions in the (single) hidden and output layers, then Swingler's alleged upper bound would be correct, but in fact no such restatement of Kolmogorov's theorem is possible, and Kurkova did not claim to prove any such restatement. Swingler omits the crucial details that Kurkova used two hidden layers, staircase-like activation functions (not ordinary sigmoidal functions such as the logistic) in the first hidden layer, and a potentially large number of units in the second hidden layer. Kurkova later estimated the number of units required for uniform approximation within an error epsilon as nm(m+1) in the first hidden layer and m^2(m+1)^n in the second hidden layer, where n is the number of inputs and m "depends on epsilon/||f|| as well as on the rate with which f increases distances." In other words, Kurkova says nothing to support Swinglers advice (repeated on p. 55), "Never choose h to be more than twice the number of input units." Furthermore, constructing a counter example to Swingler's advice is trivial: use one input and one output, where the output is the sine of the input, and the domain of the input extends over many cycles of the sine wave; it is obvious that many more than two hidden units are required. For some sound information on choosing the number of hidden units, see How many hidden units should I use?

    Choosing the number of hidden units is one important aspect of getting good generalization, which is the most crucial issue in neural network training. There are many other considerations involved in getting good generalization, and Swingler makes several more mistakes in this area:

  • There is dangerous misinformation on p. 55, where Swingler says, "If a data set contains no noise, then there is no risk of overfitting as there is nothing to overfit." It is true that overfitting is more common with noisy data, but severe overfitting can occur with noise-free data, even when there are more training cases than weights. There is an example of such overfitting under How many hidden layers should I use?
  • Regarding the use of added noise (jitter) in training, Swingler says on p. 60, "The more noise you add, the more general your model becomes." This statement makes no sense as it stands (it would make more sense if "general" were changed to "smooth"), but it could certainly encourage a beginner to use far too much jitter--see What is jitter? (Training with noise).
  • On p. 109, Swingler describes leave-one-out cross-validation, which he ascribes to Hecht-Neilsen. But Swingler concludes, "the method provides you with L minus 1 networks to choose from; none of which has been validated properly," completely missing the point that cross-validation provides an estimate of the generalization error of a network trained on the entire training set of L cases--see What are cross-validation and bootstrapping? Also, there are L leave-one-out networks, not L-1.

  •  

     

    While Swingler has some knowldege of statistics, his expertise is not sufficient for him to detect that certain articles on neural nets are statistically nonsense. For example, on pp. 139-140 he uncritically reports a method that allegedly obtains error bars by doing a simple linear regression on the target vs. output scores. To a trained statistician, this method is obviously wrong (and, as usual in this book, the formula for variance given for this method on p. 150 is wrong). On p. 110, Swingler reports an article that attempts to apply bootstrapping to neural nets, but this article is also obviously wrong to anyone familiar with bootstrapping. While Swingler cannot be blamed entirely for accepting these articles at face value, such misinformation provides yet more hazards for beginners.

    Swingler addresses many important practical issues, and often provides good practical advice. But the peculiar combination of much good advice with some extremely bad advice, a few examples of which are provided above, could easily seduce a beginner into thinking that the book as a whole is reliable. It is this danger that earns the book a place in "The Worst" list.

    Bad science writing

    Dewdney, A.K. (1997), Yes, We Have No Neutrons: An Eye-Opening Tour through the Twists and Turns of Bad Science, NY: Wiley.
    This book, allegedly an expose of bad science, contains only one chapter of 19 pages on "the neural net debacle" (p. 97). Yet this chapter is so egregiously misleading that the book has earned a place on "The Worst" list. A detailed criticism of this chapter, along with some other sections of the book, can be found at ftp://ftp.sas.com/pub/neural/badscience.html. Other chapters of the book are reviewed in the November, 1997, issue of Scientific American.
    ------------------------------------------------------------------------

    Subject: Journals and magazines about Neural Networks?

    [to be added: comments on speed of reviewing and publishing,
                  whether they accept TeX format or ASCII by e-mail, etc.]

    A. Dedicated Neural Network Journals:

    Title:   Neural Networks
    Publish: Pergamon Press
    Address: Pergamon Journals Inc., Fairview Park, Elmsford,
             New York 10523, USA and Pergamon Journals Ltd.
             Headington Hill Hall, Oxford OX3, 0BW, England
    Freq.:   10 issues/year (vol. 1 in 1988)
    Cost/Yr: Free with INNS or JNNS or ENNS membership ($45?),
             Individual $65, Institution $175
    ISSN #:  0893-6080
    WWW:     http://www.elsevier.nl/locate/inca/841
    Remark:  Official Journal of International Neural Network Society (INNS),
             European Neural Network Society (ENNS) and Japanese Neural
             Network Society (JNNS).
             Contains Original Contributions, Invited Review Articles, Letters
             to Editor, Book Reviews, Editorials, Announcements, Software Surveys.
    
    Title:   Neural Computation
    Publish: MIT Press
    Address: MIT Press Journals, 55 Hayward Street Cambridge,
             MA 02142-9949, USA, Phone: (617) 253-2889
    Freq.:   Quarterly (vol. 1 in 1989)
    Cost/Yr: Individual $45, Institution $90, Students $35; Add $9 Outside USA
    ISSN #:  0899-7667
    URL:     http://mitpress.mit.edu/journals-legacy.tcl
    Remark:  Combination of Reviews (10,000 words), Views (4,000 words)
             and Letters (2,000 words).  I have found this journal to be of
             outstanding quality.
             (Note: Remarks supplied by Mike Plonski "plonski@aero.org")
    
    Title:   IEEE Transactions on Neural Networks
    Publish: Institute of Electrical and Electronics Engineers (IEEE)
    Address: IEEE Service Cemter, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ,
             08855-1331 USA. Tel: (201) 981-0060
    Cost/Yr: $10 for Members belonging to participating IEEE societies
    Freq.:   Quarterly (vol. 1 in March 1990)
    URL:     http://www.ieee.org/nnc/pubs/transactions.html
    Remark:  Devoted to the science and technology of neural networks
             which disclose significant  technical knowledge, exploratory
             developments and applications of neural networks from biology to
             software to hardware.  Emphasis is on artificial neural networks.
             Specific aspects include self organizing systems, neurobiological
             connections, network dynamics and architecture, speech recognition,
             electronic and photonic implementation, robotics and controls.
             Includes Letters concerning new research results.
             (Note: Remarks are from journal announcement)
    
    Title:   IEEE Transactions on Evolutionary Computation
    Publish: Institute of Electrical and Electronics Engineers (IEEE)
    Address: IEEE Service Cemter, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ,
             08855-1331 USA. Tel: (201) 981-0060
    Cost/Yr: $10 for Members belonging to participating IEEE societies
    Freq.:   Quarterly (vol. 1 in May 1997)
    URL:     http://engine.ieee.org/nnc/pubs/transactions.html
    Remark:  The IEEE Transactions on Evolutionary Computation will publish archival
             journal quality original papers in evolutionary computation and related
             areas, with particular emphasis on the practical application of the
             techniques to solving real problems in industry, medicine, and other
             disciplines.  Specific techniques include but are not limited to
             evolution strategies, evolutionary programming, genetic algorithms, and
             associated methods of genetic programming and classifier systems.  Papers
             emphasizing mathematical results should ideally seek to put these results
             in the context of algorithm design, however purely theoretical papers will
             be considered.  Other papers in the areas of cultural algorithms, artificial
             life, molecular computing, evolvable hardware, and the use of simulated
             evolution to gain a better understanding of naturally evolved systems are
             also encouraged.
             (Note: Remarks are from journal CFP)
    
    Title:   International Journal of Neural Systems
    Publish: World Scientific Publishing
    Address: USA: World Scientific Publishing Co., 1060 Main Street, River Edge,
             NJ 07666. Tel: (201) 487 9655; Europe: World Scientific Publishing
             Co. Ltd., 57 Shelton Street, London WC2H 9HE, England.
             Tel: (0171) 836 0888; Asia: World Scientific Publishing Co. Pte. Ltd.,
             1022 Hougang Avenue 1 #05-3520, Singapore 1953, Rep. of Singapore
             Tel: 382 5663.
    Freq.:   Quarterly (Vol. 1 in 1990)
    Cost/Yr: Individual $122, Institution $255 (plus $15-$25 for postage)
    ISSN #:  0129-0657 (IJNS)
    Remark:  The International Journal of Neural Systems is a quarterly
             journal which covers information processing in natural
             and artificial neural systems. Contributions include research papers,
             reviews, and Letters to the Editor - communications under 3,000
             words in length, which are published within six months of receipt.
             Other contributions are typically published within nine months.
             The journal presents a fresh undogmatic attitude towards this
             multidisciplinary field and aims to be a forum for novel ideas and
             improved understanding of collective and cooperative phenomena with
             computational capabilities.
             Papers should be submitted to World Scientific's UK office. Once a
             paper is accepted for publication, authors are invited to e-mail
             the LaTeX source file of their paper in order to expedite publication.
    
    Title:   International Journal of Neurocomputing
    Publish: Elsevier Science Publishers, Journal Dept.; PO Box 211;
             1000 AE Amsterdam, The Netherlands
    Freq.:   Quarterly (vol. 1 in 1989)
    WWW:     http://www.elsevier.nl/locate/inca/505628
    
    Title:   Neural Processing Letters
    Publish: Kluwer Academic publishers
    Address: P.O. Box 322, 3300 AH Dordrecht, The Netherlands
    Freq:    6 issues/year (vol. 1 in 1994)
    Cost/Yr: Individuals $198, Institution $400 (including postage)
    ISSN #:  1370-4621
    URL:     http://www.wkap.nl/journalhome.htm/1370-4621
    Remark:  The aim of the journal is to rapidly publish new ideas, original
             developments and work in progress.  Neural Processing Letters
             covers all aspects of the Artificial Neural Networks field.
             Publication delay is about 3 months.
    
    Title:   Neural Network News
    Publish: AIWeek Inc.
    Address: Neural Network News, 2555 Cumberland Parkway, Suite 299,
             Atlanta, GA 30339 USA. Tel: (404) 434-2187
    Freq.:   Monthly (beginning September 1989)
    Cost/Yr: USA and Canada $249, Elsewhere $299
    Remark:  Commercial Newsletter
    
    Title:   Network: Computation in Neural Systems
    Publish: IOP Publishing Ltd
    Address: Europe: IOP Publishing Ltd, Techno House, Redcliffe Way, Bristol
             BS1 6NX, UK; IN USA: American Institute of Physics, Subscriber
             Services 500 Sunnyside Blvd., Woodbury, NY  11797-2999
    Freq.:   Quarterly (1st issue 1990)
    Cost/Yr: USA: $180,  Europe: 110 pounds
    Remark:  Description: "a forum for integrating theoretical and experimental
             findings across relevant interdisciplinary boundaries."  Contents:
             Submitted articles reviewed by two technical referees  paper's
             interdisciplinary format and accessability."  Also Viewpoints and
             Reviews commissioned by the editors, abstracts (with reviews) of
             articles published in other journals, and book reviews.
             Comment: While the price discourages me (my comments are based
             upon a free sample copy), I think that the journal succeeds
             very well.  The highest density of interesting articles I
             have found in any journal.
             (Note: Remarks supplied by kehoe@csufres.CSUFresno.EDU)
    
    Title:   Connection Science: Journal of Neural Computing,
             Artificial Intelligence and Cognitive Research
    Publish: Carfax Publishing
    Address: Europe: Carfax Publishing Company, PO Box 25, Abingdon, Oxfordshire
             OX14 3UE, UK.
             USA: Carfax Publishing Company, PO Box 2025, Dunnellon, Florida
             34430-2025, USA
             Australia: Carfax Publishing Company, Locked Bag 25, Deakin,
             ACT 2600, Australia
    Freq.:   Quarterly (vol. 1 in 1989)
    Cost/Yr: Personal rate:
             48 pounds (EC) 66 pounds (outside EC) US$118 (USA and Canada)
             Institutional rate:
             176 pounds (EC) 198 pounds (outside EC) US$340 (USA and Canada)
    
    Title:   International Journal of Neural Networks
    Publish: Learned Information
    Freq.:   Quarterly (vol. 1 in 1989)
    Cost/Yr: 90 pounds
    ISSN #:  0954-9889
    Remark:  The journal contains articles, a conference report (at least the
             issue I have), news and a calendar.
             (Note: remark provided by J.R.M. Smits "anjos@sci.kun.nl")
    
    Title:   Sixth Generation Systems (formerly Neurocomputers)
    Publish: Gallifrey Publishing
    Address: Gallifrey Publishing, PO Box 155, Vicksburg, Michigan, 49097, USA
             Tel: (616) 649-3772, 649-3592 fax
    Freq.    Monthly (1st issue January, 1987)
    ISSN #:  0893-1585
    Editor:  Derek F. Stubbs
    Cost/Yr: $79 (USA, Canada), US$95 (elsewhere)
    Remark:  Runs eight to 16 pages monthly. In 1995 will go to floppy disc-based
    publishing with databases +, "the equivalent to 50 pages per issue are
    planned." Often focuses on specific topics: e.g., August, 1994 contains two
    articles: "Economics, Times Series and the Market," and "Finite Particle
    Analysis - [part] II."  Stubbs also directs the company Advanced Forecasting
    Technologies. (Remark by Ed Rosenfeld: ier@aol.com)
    
    Title:   JNNS Newsletter (Newsletter of the Japan Neural Network Society)
    Publish: The Japan Neural Network Society
    Freq.:   Quarterly (vol. 1 in 1989)
    Remark:  (IN JAPANESE LANGUAGE) Official Newsletter of the Japan Neural
             Network Society(JNNS)
             (Note: remarks by Osamu Saito "saito@nttica.NTT.JP")
    
    Title:   Neural Networks Today
    Remark:  I found this title in a bulletin board of october last year.
             It was a message of Tim Pattison, timpatt@augean.OZ
             (Note: remark provided by J.R.M. Smits "anjos@sci.kun.nl")
    
    Title:   Computer Simulations in Brain Science
    
    Title:   Internation Journal of Neuroscience
    
    Title:   Neural Network Computation
    Remark:  Possibly the same as "Neural Computation"
    
    Title:   Neural Computing and Applications
    Freq.:   Quarterly
    Publish: Springer Verlag
    Cost/yr: 120 Pounds
    Remark:  Is the journal of the Neural Computing Applications Forum.
             Publishes original research and other information
             in the field of practical applications of neural computing.

    B. NN Related Journals:

    Title:   Complex Systems
    Publish: Complex Systems Publications
    Address: Complex Systems Publications, Inc., P.O. Box 6149, Champaign,
             IL 61821-8149, USA
    Freq.:   6 times per year (1st volume is 1987)
    ISSN #:  0891-2513
    Cost/Yr: Individual $75, Institution $225
    Remark:  Journal COMPLEX SYSTEMS  devotes to rapid publication of research
             on science, mathematics, and engineering of systems with simple
             components but complex overall behavior. Send mail to
             "jcs@jaguar.ccsr.uiuc.edu" for additional info.
             (Remark is from announcement on Net)
    
    Title:   Biological Cybernetics (Kybernetik)
    Publish: Springer Verlag
    Remark:  Monthly (vol. 1 in 1961)
    
    Title:   Various IEEE Transactions and Magazines
    Publish: IEEE
    Remark:  Primarily see IEEE Trans. on System, Man and Cybernetics;
             Various Special Issues: April 1990 IEEE Control Systems
             Magazine.; May 1989 IEEE Trans. Circuits and Systems.;
             July 1988 IEEE Trans. Acoust. Speech Signal Process.
    
    Title:   The Journal of Experimental and Theoretical Artificial Intelligence
    Publish: Taylor & Francis, Ltd.
    Address: London, New York, Philadelphia
    Freq.:   ? (1st issue Jan 1989)
    Remark:  For submission information, please contact either of the editors:
             Eric Dietrich                        Chris Fields
             PACSS - Department of Philosophy     Box 30001/3CRL
             SUNY Binghamton                      New Mexico State University
             Binghamton, NY 13901                 Las Cruces, NM 88003-0001
             dietrich@bingvaxu.cc.binghamton.edu  cfields@nmsu.edu
    
    Title:   The Behavioral and Brain Sciences
    Publish: Cambridge University Press
    Remark:  (Expensive as hell, I'm sure.)
             This is a delightful journal that encourages discussion on a
             variety of controversial topics.  I have especially enjoyed
             reading some papers in there by Dana Ballard and Stephen
             Grossberg (separate papers, not collaborations) a few years
             back.  They have a really neat concept: they get a paper,
             then invite a number of noted scientists in the field to
             praise it or trash it.  They print these commentaries, and
             give the author(s) a chance to make a rebuttal or
             concurrence.  Sometimes, as I'm sure you can imagine, things
             get pretty lively.  I'm reasonably sure they are still at
             it--I think I saw them make a call for reviewers a few
             months ago.  Their reviewers are called something like
             Behavioral and Brain Associates, and I believe they have to
             be nominated by current associates, and should be fairly
             well established in the field.  That's probably more than I
             really know about it but maybe if you post it someone who
             knows more about it will correct any errors I have made.
             The main thing is that I liked the articles I read. (Note:
             remarks by Don Wunsch )
    
    Title:   International Journal of Applied Intelligence
    Publish: Kluwer Academic Publishers
    Remark:  first issue in 1990(?)
    
    Title:   Bulletin of Mathematical Biology
    
    Title:   Intelligence
    
    Title:   Journal of Mathematical Biology
    
    Title:   Journal of Complex System
    
    Title:   International Journal of Modern Physics C
    Publish: USA: World Scientific Publishing Co., 1060 Main Street, River Edge,
             NJ 07666. Tel: (201) 487 9655; Europe: World Scientific Publishing
             Co. Ltd., 57 Shelton Street, London WC2H 9HE, England.
             Tel: (0171) 836 0888; Asia: World Scientific Publishing Co. Pte. Ltd.,
             1022 Hougang Avenue 1 #05-3520, Singapore 1953, Rep. of Singapore
             Tel: 382 5663.
    Freq:    bi-monthly
    Eds:     H. Herrmann, R. Brower, G.C. Fox and S Nose
    
    Title:   Machine Learning
    Publish: Kluwer Academic Publishers
    Address: Kluwer Academic Publishers
             P.O. Box 358
             Accord Station
             Hingham, MA 02018-0358 USA
    Freq.:   Monthly (8 issues per year; increasing to 12 in 1993)
    Cost/Yr: Individual $140 (1992); Member of AAAI or CSCSI $88
    Remark:  Description: Machine Learning is an international forum for
             research on computational approaches to learning.  The journal
             publishes articles reporting substantive research results on a
             wide range of learning methods applied to a variety of task
             domains.  The ideal paper will make a theoretical contribution
             supported by a computer implementation.
             The journal has published many key papers in learning theory,
             reinforcement learning, and decision tree methods.  Recently
             it has published a special issue on connectionist approaches
             to symbolic reasoning.  The journal regularly publishes
             issues devoted to genetic algorithms as well.
    
    Title:   INTELLIGENCE - The Future of Computing
    Published by: Intelligence
    Address: INTELLIGENCE, P.O. Box 20008, New York, NY 10025-1510, USA,
    212-222-1123 voice & fax; email: ier@aol.com, CIS: 72400,1013
    Freq.    Monthly plus four special reports each year (1st issue: May, 1984)
    ISSN #:  1042-4296
    Editor:  Edward Rosenfeld
    Cost/Yr: $395 (USA), US$450 (elsewhere)
    Remark:  Has absorbed several other newsletters, like Synapse/Connection
             and Critical Technology Trends (formerly AI Trends).
             Covers NN, genetic algorithms, fuzzy systems, wavelets, chaos
             and other advanced computing approaches, as well as molecular
             computing and nanotechnology.
    
    Title:   Journal of Physics A: Mathematical and General
    Publish: Inst. of Physics, Bristol
    Freq:    24 issues per year.
    Remark:  Statistical mechanics aspects of neural networks
             (mostly Hopfield models).
    
    Title:   Physical Review A: Atomic, Molecular and Optical Physics
    Publish: The American Physical Society (Am. Inst. of Physics)
    Freq:    Monthly
    Remark:  Statistical mechanics of neural networks.
    
    Title:   Information Sciences
    Publish: North Holland (Elsevier Science)
    Freq.:   Monthly
    ISSN:    0020-0255
    Editor:  Paul P. Wang; Department of Electrical Engineering; Duke University;
             Durham, NC 27706, USA

    C. Journals loosely related to NNs:

    Title:   JOURNAL OF COMPLEXITY
    Remark:  (Must rank alongside Wolfram's Complex Systems)
    
    Title:   IEEE ASSP Magazine
    Remark:  (April 1987 had the Lippmann intro. which everyone likes to cite)
    
    Title:   ARTIFICIAL INTELLIGENCE
    Remark:  (Vol 40, September 1989 had the survey paper by Hinton)
    
    Title:   COGNITIVE SCIENCE
    Remark:  (the Boltzmann machine paper by Ackley et al appeared here
             in Vol 9, 1983)
    
    Title:   COGNITION
    Remark:  (Vol 28, March 1988 contained the Fodor and Pylyshyn
             critique of connectionism)
    
    Title:   COGNITIVE PSYCHOLOGY
    Remark:  (no comment!)
    
    Title:   JOURNAL OF MATHEMATICAL PSYCHOLOGY
    Remark:  (several good book reviews)
    ------------------------------------------------------------------------

    Subject: Conferences and Workshops on Neural Networks?

  • The journal "Neural Networks" has a list of conferences, workshops and meetings in each issue. It is also available on the WWW from http://www.ph.kcl.ac.uk/neuronet/bakker.html.
  • The IEEE Neural Network Council maintains a list of conferences at http://www.ieee.org/nnc.
  • Conferences, workshops, and other events concerned with neural networks, vision, speech, and related fields are listed at Georg Thimm's web page http://www.idiap.ch/~thimm.
  • ------------------------------------------------------------------------

    Subject: Neural Network Associations?

    International Neural Network Society (INNS).

    INNS membership includes subscription to "Neural Networks", the official journal of the society. Membership is $55 for non-students and $45 for students per year. Address: INNS Membership, P.O. Box 491166, Ft. Washington, MD 20749.

    International Student Society for Neural Networks (ISSNNets).

    Membership is $5 per year. Address: ISSNNet, Inc., P.O. Box 15661, Boston, MA 02215 USA

    Women In Neural Network Research and technology (WINNERS).

    Address: WINNERS, c/o Judith Dayhoff, 11141 Georgia Ave., Suite 206, Wheaton, MD 20902. Phone: 301-933-9000.

    European Neural Network Society (ENNS)

    ENNS membership includes subscription to "Neural Networks", the official journal of the society. Membership is currently (1994) 50 UK pounds (35 UK pounds for students) per year. Address: ENNS Membership, Centre for Neural Networks, King's College London, Strand, London WC2R 2LS, United Kingdom.

    Japanese Neural Network Society (JNNS)

    Address: Japanese Neural Network Society; Department of Engineering, Tamagawa University; 6-1-1, Tamagawa Gakuen, Machida City, Tokyo; 194 JAPAN; Phone: +81 427 28 3457, Fax: +81 427 28 3597

    Association des Connexionnistes en THese (ACTH)

    (the French Student Association for Neural Networks); Membership is 100 FF per year; Activities: newsletter, conference (every year), list of members, electronic forum; Journal 'Valgo' (ISSN 1243-4825); WWW page: http://www.supelec-rennes.fr/acth/welcome.html ; Contact: acth@loria.fr

    Neurosciences et Sciences de l'Ingenieur (NSI)

    Biology & Computer Science Activity : conference (every year) Address : NSI - TIRF / INPG 46 avenue Felix Viallet 38031 Grenoble Cedex FRANCE

    IEEE Neural Networks Council

    Web page at http://www.ieee.org/nnc

    SNN (Foundation for Neural Networks)

    The Foundation for Neural Networks (SNN) is a university based non-profit organization that stimulates basic and applied research on neural networks in the Netherlands. Every year SNN orgines a symposium on Neural Networks. See http://www.mbfys.kun.nl/SNN/.
     
     
    You can find nice lists of NN societies in the WWW at http://www.emsl.pnl.gov:2080/proj/neuron/neural/societies.html and at http://www.ieee.org:80/nnc/research/othernnsoc.html.
    ------------------------------------------------------------------------

    Subject: On-line and machine-readable information about NNs?

    See also "Other NN links?"

    Neuron Digest

    Internet Mailing List. From the welcome blurb: "Neuron-Digest is a list (in digest form) dealing with all aspects of neural networks (and any type of network or neuromorphic system)" To subscribe, send email to neuron-request@psych.upenn.edu. The ftp archives (including back issues) are available from psych.upenn.edu in pub/Neuron-Digest or by sending email to "archive-server@psych.upenn.edu". comp.ai.neural-net readers also find the messages in that newsgroup in the form of digests.

    Usenet groups comp.ai.neural-nets (Oha!) and comp.theory.self-org-sys.

    There is a periodic posting on comp.ai.neural-nets sent by srctran@world.std.com (Gregory Aharonian) about Neural Network patents.

    USENET newsgroup comp.org.issnnet

    Forum for discussion of academic/student-related issues in NNs, as well as information on ISSNNet (see question "associations") and its activities.

    Central Neural System Electronic Bulletin Board

    Modem: 409-737-5222; Sysop: Wesley R. Elsberry; 4160 Pirates' Beach, Galveston, TX 77554; welsberr@orca.tamu.edu. Many MS-DOS PD and shareware simulations, source code, benchmarks, demonstration packages, information files; some Unix, Macintosh, Amiga related files. Also available are files on AI, AI Expert listings 1986-1991, fuzzy logic, genetic algorithms, artificial life, evolutionary biology, and many Project Gutenberg and Wiretap etexts. No user fees have ever been charged. Home of the NEURAL_NET Echo, available thrugh FidoNet, RBBS-Net, and other EchoMail compatible bulletin board systems.

    AI CD-ROM

    Network Cybernetics Corporation produces the "AI CD-ROM". It is an ISO-9660 format CD-ROM and contains a large assortment of software related to artificial intelligence, artificial life, virtual reality, and other topics. Programs for OS/2, MS-DOS, Macintosh, UNIX, and other operating systems are included. Research papers, tutorials, and other text files are included in ASCII, RTF, and other universal formats. The files have been collected from AI bulletin boards, Internet archive sites, University computer deptartments, and other government and civilian AI research organizations. Network Cybernetics Corporation intends to release annual revisions to the AI CD-ROM to keep it up to date with current developments in the field. The AI CD-ROM includes collections of files that address many specific AI/AL topics including Neural Networks (Source code and executables for many different platforms including Unix, DOS, and Macintosh. ANN development tools, example networks, sample data, tutorials. A complete collection of Neural Digest is included as well.) The AI CD-ROM may be ordered directly by check, money order, bank draft, or credit card from:
            Network Cybernetics Corporation;
            4201 Wingren Road Suite 202;
            Irving, TX 75062-2763;
            Tel 214/650-2002;
            Fax 214/650-1929;
    The cost is $129 per disc + shipping ($5/disc domestic or $10/disc foreign) (See the comp.ai FAQ for further details)

    INTCON mailing list

    INTCON (Intelligent Control) is a moderated mailing list set up to provide a forum for communication and exchange of ideas among researchers in neuro-control, fuzzy logic control, reinforcement learning and other related subjects grouped under the topic of intelligent control. Send your subscribe requests to intcon-request@phoenix.ee.unsw.edu.au
     
     
    ------------------------------------------------------------------------

    Subject: How to benchmark learning methods?

    The NN benchmarking resources page at http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html was created after a NIPS 1995 workshop on NN benchmarking. The page contains pointers to various papers on proper benchmarking methodology and to various sources of datasets.

    Benchmark studies require some familiarity with the statistical design and analysis of experiments. There are many textbooks on this subject, of which Cohen (1995) will probably be of particular interest to researchers in neural nets and machine learning (see also the review of Cohen's book by Ron Kohavi in the International Journal of Neural Systems, which can be found on-line at http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html).

    Reference:

    Cohen, P.R. (1995), Empirical Methods for Artificial Intelligence, Cambridge, MA: The MIT Press.
    ------------------------------------------------------------------------

    Subject: Databases for experimentation with NNs?

    UCI machine learning database

    A large collection of data sets accessible via anonymous FTP at ftp.ics.uci.edu [128.195.1.1] in directory /pub/machine-learning-databases" or via web browser at http://www.ics.uci.edu/~mlearn/MLRepository.html

    The neural-bench Benchmark collection

    Accessible WWW at http://www.boltz.cs.cmu.edu/ or via anonymous FTP at ftp://ftp.boltz.cs.cmu.edu/pub/neural-bench/. In case of problems or if you want to donate data, email contact is "neural-bench@cs.cmu.edu". The data sets in this repository include the 'nettalk' data, 'two spirals', protein structure prediction, vowel recognition, sonar signal classification, and a few others.

    Proben1

    Proben1 is a collection of 12 learning problems consisting of real data. The datafiles all share a single simple common format. Along with the data comes a technical report describing a set of rules and conventions for performing and reporting benchmark tests and their results. Accessible via anonymous FTP on ftp.cs.cmu.edu [128.2.206.173] as /afs/cs/project/connect/bench/contrib/prechelt/proben1.tar.gz. and also on ftp.ira.uka.de as /pub/neuron/proben1.tar.gz. The file is about 1.8 MB and unpacks into about 20 MB.

    Delve: Data for Evaluating Learning in Valid Experiments

    Delve is a standardised, copyrighted environment designed to evaluate the performance of learning methods. Delve makes it possible for users to compare their learning methods with other methods on many datasets. The Delve learning methods and evaluation procedures are well documented, such that meaningful comparisons can be made. The data collection includes not only isolated data sets, but "families" of data sets in which properties of the data, such as number of inputs and degree of nonlinearity or noise, are systematically varied. The Delve web page is at http://www.cs.toronto.edu/~delve/

    NIST special databases of the National Institute Of Standards And Technology:

    Several large databases, each delivered on a CD-ROM. Here is a quick list.
  • NIST Binary Images of Printed Digits, Alphas, and Text
  • NIST Structured Forms Reference Set of Binary Images
  • NIST Binary Images of Handwritten Segmented Characters
  • NIST 8-bit Gray Scale Images of Fingerprint Image Groups
  • NIST Structured Forms Reference Set 2 of Binary Images
  • NIST Test Data 1: Binary Images of Hand-Printed Segmented Characters
  • NIST Machine-Print Database of Gray Scale and Binary Images
  • NIST 8-Bit Gray Scale Images of Mated Fingerprint Card Pairs
  • NIST Supplemental Fingerprint Card Data (SFCD) for NIST Special Database 9
  • NIST Binary Image Databases of Census Miniforms (MFDB)
  • NIST Mated Fingerprint Card Pairs 2 (MFCP 2)
  • NIST Scoring Package Release 1.0
  • NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM
  • Here are example descriptions of two of these databases:

    NIST special database 2: Structured Forms Reference Set (SFRS)

    The NIST database of structured forms contains 5,590 full page images of simulated tax forms completed using machine print. THERE IS NO REAL TAX DATA IN THIS DATABASE. The structured forms used in this database are 12 different forms from the 1988, IRS 1040 Package X. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F and SE. Eight of these forms contain two pages or form faces making a total of 20 form faces represented in the database. Each image is stored in bi-level black and white raster format. The images in this database appear to be real forms prepared by individuals but the images have been automatically derived and synthesized using a computer and contain no "real" tax data. The entry field values on the forms have been automatically generated by a computer in order to make the data available without the danger of distributing privileged tax information. In addition to the images the database includes 5,590 answer files, one for each image. Each answer file contains an ASCII representation of the data found in the entry fields on the corresponding image. Image format documentation and example software are also provided. The uncompressed database totals approximately 5.9 gigabytes of data.

    NIST special database 3: Binary Images of Handwritten Segmented Characters (HWSC)

    Contains 313,389 isolated character images segmented from the 2,100 full-page images distributed with "NIST Special Database 1". 223,125 digits, 44,951 upper-case, and 45,313 lower-case character images. Each character image has been centered in a separate 128 by 128 pixel region, error rate of the segmentation and assigned classification is less than 0.1%. The uncompressed database totals approximately 2.75 gigabytes of image data and includes image format documentation and example software.

     The system requirements for all databases are a 5.25" CD-ROM drive with software to read ISO-9660 format. Contact: Darrin L. Dimmick; dld@magi.ncsl.nist.gov; (301)975-4147

     The prices of the databases are between US$ 250 and 1895 If you wish to order a database, please contact: Standard Reference Data; National Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899; Phone: (301)975-2208; FAX: (301)926-0416

     Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in directory /pub/data A more complete description of the available databases can be obtained from the same host as /pub/databases/catalog.txt

    CEDAR CD-ROM 1: Database of Handwritten Cities, States, ZIP Codes, Digits, and Alphabetic Characters

    The Center Of Excellence for Document Analysis and Recognition (CEDAR) State University of New York at Buffalo announces the availability of CEDAR CDROM 1: USPS Office of Advanced Technology The database contains handwritten words and ZIP Codes in high resolution grayscale (300 ppi 8-bit) as well as binary handwritten digits and alphabetic characters (300 ppi 1-bit). This database is intended to encourage research in off-line handwriting recognition by providing access to handwriting samples digitized from envelopes in a working post office.
         Specifications of the database include:
         +    300 ppi 8-bit grayscale handwritten words (cities,
              states, ZIP Codes)
              o    5632 city words
              o    4938 state words
              o    9454 ZIP Codes
         +    300 ppi binary handwritten characters and digits:
              o    27,837 mixed alphas  and  numerics  segmented
                   from address blocks
              o    21,179 digits segmented from ZIP Codes
         +    every image supplied with  a  manually  determined
              truth value
         +    extracted from live mail in a  working  U.S.  Post
              Office
         +    word images in the test  set  supplied  with  dic-
              tionaries  of  postal  words that simulate partial
              recognition of the corresponding ZIP Code.
         +    digit images included in test  set  that  simulate
              automatic ZIP Code segmentation.  Results on these
              data can be projected to overall ZIP Code recogni-
              tion performance.
         +    image format documentation and software included
    System requirements are a 5.25" CD-ROM drive with software to read ISO-9660 format. For further information, see http://www.cedar.buffalo.edu/Databases/CDROM1/ or send email to Ajay Shekhawat at <ajay@cedar.Buffalo.EDU>

    There is also a CEDAR CDROM-2, a database of machine-printed Japanese character images.

    AI-CD-ROM (see question "Other sources of information")

    Time series archive

    Various datasets of time series (to be used for prediction learning problems) are available for anonymous ftp from ftp.santafe.edu [192.12.12.1] in /pub/Time-Series". Problems are for example: fluctuations in a far-infrared laser; Physiological data of patients with sleep apnea; High frequency currency exchange rate data; Intensity of a white dwarf star; J.S. Bachs final (unfinished) fugue from "Die Kunst der Fuge"

     Some of the datasets were used in a prediction contest and are described in detail in the book "Time series prediction: Forecasting the future and understanding the past", edited by Weigend/Gershenfield, Proceedings Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity series of Addison Wesley (1994).

    USENIX Faces

    The USENIX faces archive is a public database, accessible by ftp, that can be of use to people working in the fields of human face recognition, classification and the like. It currently contains 5592 different faces (taken at USENIX conferences) and is updated twice each year. The images are mostly 96x128 greyscale frontal images and are stored in ascii files in a way that makes it easy to convert them to any usual graphic format (GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided. Each image file takes approximately 25K.

    For further information, see ftp://src.doc.ic.ac.uk/pub/packages/faces/README Do NOT do a directory listing in the top directory of the face archive, as it contains over 2500 entries!

    According to the archive administrator, Barbara L. Dijker (barb.dijker@labyrinth.com), there is no restriction to use them. However, the image files are stored in separate directories corresponding to the Internet site to which the person represented in the image belongs, with each directory containing a small number of images (two in the average). This makes it difficult to retrieve by ftp even a small part of the database, as you have to get each one individually.
    A solution, as Barbara proposed me, would be to compress the whole set of images (in separate files of, say, 100 images) and maintain them as a specific archive for research on face processing, similar to the ones that already exist for fingerprints and others. The whole compressed database would take some 30 megabytes of disk space. I encourage anyone willing to host this database in his/her site, available for anonymous ftp, to contact her for details (unfortunately I don't have the resources to set up such a site).

    Please consider that UUNET has graciously provided the ftp server for the FaceSaver archive and may discontinue that service if it becomes a burden. This means that people should not download more than maybe 10 faces at a time from uunet.

    A last remark: each file represents a different person (except for isolated cases). This makes the database quite unsuitable for training neural networks, since for proper generalisation several instances of the same subject are required. However, it is still useful for use as testing set on a trained network.

    Astronomical Time Series

    Prepared by Paul L. Hertz (Naval Research Laboratory) & Eric D. Feigelson (Pennsyvania State University):
  • Detection of variability in photon counting observations 1 (QSO1525+337)
  • Detection of variability in photon counting observations 2 (H0323+022)
  • Detection of variability in photon counting observations 3 (SN1987A)
  • Detecting orbital and pulsational periodicities in stars 1 (binaries)
  • Detecting orbital and pulsational periodicities in stars 2 (variables)
  • Cross-correlation of two time series 1 (Sun)
  • Cross-correlation of two time series 2 (OJ287)
  • Periodicity in a gamma ray burster (GRB790305)
  • Solar cycles in sunspot numbers (Sun)
  • Deconvolution of sources in a scanning operation (HEAO A-1)
  • Fractal time variability in a seyfert galaxy (NGC5506)
  • Quasi-periodic oscillations in X-ray binaries (GX5-1)
  • Deterministic chaos in an X-ray pulsar? (Her X-1)
  • URL: http://xweb.nrl.navy.mil/www_hertz/timeseries/timeseries.html

    Miscellaneous Images

    The USC-SIPI Image Database: http://sipi.usc.edu/services/database/Database.html

     CityU Image Processing Lab: http://www.image.cityu.edu.hk/images/database.html

     Center for Image Processing Research: http://cipr.rpi.edu/

     Computer Vision Test Images: http://www.cs.cmu.edu:80/afs/cs/project/cil/ftp/html/v-images.html

     Lenna 97: A Complete Story of Lenna: http://www.image.cityu.edu.hk/images/lenna/Lenna97.html
     
     

    StatLib

    The StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon University has a large collection of data sets, many of which can be used with NNs.
     
     
    ------------------------------------------------------------------------
    Next part is part 5 (of 7). Previous part is part 3.

    Subject: Freeware and shareware packages for NN simulation?

    Since the FAQ maintainer works for a software company, he does not recommend or evaluate software in the FAQ. The descriptions below are provided by the developers or distributors of the software.

    Note for future submissions: Please restrict software descriptions to a maximum of 60 lines of 72 characters, in either plain-text format or, preferably, HTML format. If you include the standard header (name, company, address, etc.), you need not count the header in the 60 line maximum. Please confine your HTML to features that are supported by most browsers, especially NCSA Mosaic 2.0; avoid tables, for example--use <pre> instead. Try to make the descriptions objective, and avoid making implicit or explicit assertions about competing products, such as "Our product is the *only* one that does so-and-so" or "Our innovative product trains bigger nets faster." The FAQ maintainer reserves the right to remove excessive marketing hype and to edit submissions to conform to size requirements; if he is in a good mood, he may also correct spelling and punctuation.

    The following simulators are described below:

  • Rochester Connectionist Simulator
  • UCLA-SFINX
  • NeurDS
  • PlaNet (formerly known as SunNet)
  • GENESIS
  • Mactivation
  • Cascade Correlation Simulator
  • Quickprop
  • DartNet
  • SNNS
  • Aspirin/MIGRAINES
  • ALN Workbench
  • PDP++
  • Uts (Xerion, the sequel)
  • Neocognitron simulator
  • Multi-Module Neural Computing Environment (MUME)
  • LVQ_PAK, SOM_PAK
  • Nevada Backpropagation (NevProp)
  • Fuzzy ARTmap
  • PYGMALION
  • Basis-of-AI-NN Software
  • Matrix Backpropagation
  • BIOSIM
  • The Brain
  • FuNeGen
  • NeuDL -- Neural-Network Description Language
  • NeoC Explorer
  • AINET
  • DemoGNG
  • PMNEURO 1.0a
  • nn/xnn
  • NNDT
  • Trajan 2.1 Shareware
  • Neural Networks at your Fingertips
  • NNFit
  • Nenet v1.0
  • Machine Consciousness Toolbox
  • NICO Toolkit (speech recognition)
  • SOM Toolbox for Matlab 5
  • See also http://www.emsl.pnl.gov:2080/proj/neuron/neural/systems/shareware.html

    Rochester Connectionist Simulator

    A quite versatile simulator program for arbitrary types of neural nets. Comes with a backprop package and a X11/Sunview interface. Available via anonymous FTP from ftp.cs.rochester.edu in directory pub/packages/simulator as the files README (8 KB), and rcs_v4.2.tar.Z (2.9 MB)

    UCLA-SFINX

       ftp retina.cs.ucla.edu [131.179.16.6];
       Login name: sfinxftp;  Password: joshua;
       directory: pub;
       files : README; sfinx_v2.0.tar.Z;
       Email info request : sfinx@retina.cs.ucla.edu

    NeurDS

    Neural Design and Simulation System. This is a general purpose tool for building, running and analysing Neural Network Models in an efficient manner. NeurDS will compile and run virtually any Neural Network Model using a consistent user interface that may be either window or "batch" oriented. HP-UX 8.07 source code is available from http://hpux.u-aizu.ac.jp/hppd/hpux/NeuralNets/NeurDS-3.1/ or http://askdonna.ask.uni-karlsruhe.de/hppd/hpux/NeuralNets/NeurDS-3.1/

    PlaNet5.7 (formerly known as SunNet)

    A popular connectionist simulator with versions to run under X Windows, and non-graphics terminals created by Yoshiro Miyata (Chukyo Univ., Japan). 60-page User's Guide in Postscript. Send any questions to miyata@sccs.chukyo-u.ac.jp Available for anonymous ftp from ftp.ira.uka.de as /pub/neuron/PlaNet5.7.tar.gz (800 kb)

    GENESIS

    GENESIS 2.0 (GEneral NEural SImulation System) is a general purpose simulation platform which was developed to support the simulation of neural systems ranging from complex models of single neurons to simulations of large networks made up of more abstract neuronal components. Most current GENESIS applications involve realistic simulations of biological neural systems. Although the software can also model more abstract networks, other simulators are more suitable for backpropagation and similar connectionist modeling. Runs on most Unix platforms. Graphical front end XODUS. Parallel version for networks of workstations, symmetric multiprocessors, and MPPs also available. Available by ftp from ftp://genesis.bbb.caltech.edu/pub/genesis. Further information via WWW at http://www.bbb.caltech.edu/GENESIS/.

    Mactivation

    A neural network simulator for the Apple Macintosh. Available for ftp from ftp.cs.colorado.edu [128.138.243.151] as /pub/cs/misc/Mactivation-3.3.sea.hqx

    Cascade Correlation Simulator

    A simulator for Scott Fahlman's Cascade Correlation algorithm. Available for ftp from ftp.cs.cmu.edu in directory /afs/cs/project/connect/code/supported as the file cascor-v1.2.shar (223 KB) There is also a version of recurrent cascade correlation in the same directory in file rcc1.c (108 KB).

    Quickprop

    A variation of the back-propagation algorithm developed by Scott Fahlman. A simulator is available in the same directory as the cascade correlation simulator above in file nevprop1.16.shar (137 KB)
    (There is also an obsolete simulator called quickprop1.c (21 KB) in the same directory, but it has been superseeded by NevProp. See also the description of NevProp below.)

    DartNet

    DartNet is a Macintosh-based backpropagation simulator, developed at Dartmouth by Jamshed Bharucha and Sean Nolan as a pedagogical tool. It makes use of the Mac's graphical interface, and provides a number of tools for building, editing, training, testing and examining networks. This program is available by anonymous ftp from ftp.dartmouth.edu as /pub/mac/dartnet.sit.hqx (124 KB).

    SNNS 4.1

    "Stuttgarter Neural Network Simulator" from the University of Tuebingen, Germany (formerly from the University of Stuttgart): a simulator for many types of nets with X11 interface: Graphical 2D and 3D topology editor/visualizer, training visualisation, multiple pattern set handling etc.

    Currently supports backpropagation (vanilla, online, with momentum term and flat spot elimination, batch, time delay), counterpropagation, quickprop, backpercolation 1, generalized radial basis functions (RBF), RProp, ART1, ART2, ARTMAP, Cascade Correlation, Recurrent Cascade Correlation, Dynamic LVQ, Backpropagation through time (for recurrent networks), batch backpropagation through time (for recurrent networks), Quickpropagation through time (for recurrent networks), Hopfield networks, Jordan and Elman networks, autoassociative memory, self-organizing maps, time-delay networks (TDNN), RBF_DDA, simulated annealing, Monte Carlo, Pruned Cascade-Correlation, Optimal Brain Damage, Optimal Brain Surgeon, Skeletonization, and is user-extendable (user-defined activation functions, output functions, site functions, learning procedures). C code generator snns2c.

    Works on SunOS, Solaris, IRIX, Ultrix, OSF, AIX, HP/UX, NextStep, Linux, and Windows 95/NT. Distributed kernel can spread one learning run over a workstation cluster.

    SNNS web page: http://www-ra.informatik.uni-tuebingen.de/SNNS
    Ftp server: ftp://ftp.informatik.uni-tuebingen.de/pub/SNNS

  • SNNSv4.1.Readme
  • SNNSv4.1.tar.gz (1.4 MB, Source code)
  • SNNSv4.1.Manual.ps.gz (1 MB, Documentation)
  • Mailing list: http://www-ra.informatik.uni-tuebingen.de/SNNS/about-ml.html

    Aspirin/MIGRAINES

    Aspirin/MIGRAINES 6.0 consists of a code generator that builds neural network simulations by reading a network description (written in a language called "Aspirin") and generates a C simulation. An interface (called "MIGRAINES") is provided to export data from the neural network to visualization tools. The system has been ported to a large number of platforms. The goal of Aspirin is to provide a common extendible front-end language and parser for different network paradigms. The MIGRAINES interface is a terminal based interface that allows you to open Unix pipes to data in the neural network. Users can display the data using either public or commercial graphics/analysis tools. Example filters are included that convert data exported through MIGRAINES to formats readable by Gnuplot 3.0, Matlab, Mathematica, and xgobi.

    The software is available from two FTP sites: from CMU's simulator collection on pt.cs.cmu.edu [128.2.254.155] in /afs/cs/project/connect/code/unsupported/am6.tar.Z and from UCLA's cognitive science machine ftp.cognet.ucla.edu [128.97.50.19] in /pub/alexis/am6.tar.Z (2 MB).

    ALN Workbench (a spreadsheet for Windows)

    ALNBench is a free spreadsheet program for MS-Windows (NT, 95) that allows the user to import training and test sets and predict a chosen column of data from the others in the training set. It is an easy-to-use program for research, education and evaluation of ALN technology. Anyone who can use a spreadsheet can quickly understand how to use it. It facilitates interactive access to the power of the Dendronic Learning Engine (DLE), a product in commercial use.

    An ALN consists of linear functions with adaptable weights at the leaves of a tree of maximum and minimum operators. The tree grows automatically during training: a linear piece splits if its error is too high. The function computed by an ALN is piecewise linear and continuous. It can learn to approximate any continuous function to arbitrarily high accuracy.

    Parameters allow the user to input knowledge about a function to promote good generalization. In particular, bounds on the weights of the linear functions can be directly enforced. Some parameters are chosen automatically in standard mode, and are under user control in expert mode.

    The program can be downloaded from http://www.dendronic.com/beta.htm

    For further information please contact:

    William W. Armstrong PhD, President
    Dendronic Decisions Limited
    3624 - 108 Street, NW
    Edmonton, Alberta,
    Canada T6J 1B4
    Email: arms@dendronic.com
    URL: http://www.dendronic.com/
    Tel. +1 403 421 0800
    (Note: The area code 403 changes to 780 after Jan. 25, 1999)

    PDP++

    The PDP++ software is a new neural-network simulation system written in C++. It represents the next generation of the PDP software released with the McClelland and Rumelhart "Explorations in Parallel Distributed Processing Handbook", MIT Press, 1987. It is easy enough for novice users, but very powerful and flexible for research use.
    The current version is 1.0, our first non-beta release. It has been extensively tested and should be completely usable. Works on Unix with X-Windows.

    Features: Full GUI (InterViews), realtime network viewer, data viewer, extendable object-oriented design, CSS scripting language with source-level debugger, GUI macro recording.

    Algorithms: Feedforward and several recurrent BP, Boltzmann machine, Hopfield, Mean-field, Interactive activation and competition, continuous stochastic networks.

    The software can be obtained by anonymous ftp from ftp://cnbc.cmu.edu/pub/pdp++/ and from ftp://unix.hensa.ac.uk/mirrors/pdp++/.

    For more information, see our WWW page at http://www.cnbc.cmu.edu/PDP++/PDP++.html.
    There is a 250 page (printed) manual and an HTML version available on-line at the above address.

    Uts (Xerion, the sequel)

    Uts is a portable artificial neural network simulator written on top of the Tool Control Language (Tcl) and the Tk UI toolkit. As result, the user interface is readily modifiable and it is possible to simultaneously use the graphical user interface and visualization tools and use scripts written in Tcl. Uts itself implements only the connectionist paradigm of linked units in Tcl and the basic elements of the graphical user interface. To make a ready-to-use package, there exist modules which use Uts to do back-propagation (tkbp) and mixed em gaussian optimization (tkmxm). Uts is available in ftp.cs.toronto.edu in directory /pub/xerion.

    Neocognitron simulator

    The simulator is written in C and comes with a list of references which are necessary to read to understand the specifics of the implementation. The unsupervised version is coded without (!) C-cell inhibition. Available for anonymous ftp from unix.hensa.ac.uk [129.12.21.7] in /pub/neocognitron.tar.Z (130 kB).

    Multi-Module Neural Computing Environment (MUME)

    MUME is a simulation environment for multi-modules neural computing. It provides an object oriented facility for the simulation and training of multiple nets with various architectures and learning algorithms. MUME includes a library of network architectures including feedforward, simple recurrent, and continuously running recurrent neural networks. Each architecture is supported by a variety of learning algorithms. MUME can be used for large scale neural network simulations as it provides support for learning in multi-net environments. It also provide pre- and post-processing facilities.

    The modules are provided in a library. Several "front-ends" or clients are also available. X-Window support by editor/visualization tool Xmume. MUME can be used to include non-neural computing modules (decision trees, ...) in applications. MUME is available for educational institutions by anonymous ftp on mickey.sedal.su.oz.au [129.78.24.170] after signing and sending a licence: /pub/license.ps (67 kb).

    Contact:
    Marwan Jabri, SEDAL, Sydney University Electrical Engineering,
    NSW 2006 Australia, marwan@sedal.su.oz.au

    LVQ_PAK, SOM_PAK

    These are packages for Learning Vector Quantization and Self-Organizing Maps, respectively. They have been built by the LVQ/SOM Programming Team of the Helsinki University of Technology, Laboratory of Computer and Information Science, Rakentajanaukio 2 C, SF-02150 Espoo, FINLAND There are versions for Unix and MS-DOS available from http://nucleus.hut.fi/nnrc/nnrc-programs.html

    Nevada Backpropagation (NevProp)

    NevProp, version 3, is a relatively easy-to-use, feedforward backpropagation multilayer perceptron simulator-that is, statistically speaking, a multivariate nonlinear regression program. NevProp3 is distributed for free under the terms of the GNU Public License and can be downloaded from http://www.scsr.nevada.edu/nevprop/ The program is distributed as C source code that should compile and run on most platforms. In addition, precompiled executables are available for Macintosh and DOS platforms. Limited support is available from Phil Goodman (goodman@unr.edu), University of Nevada Center for Biomedical Research.

    MAJOR FEATURES OF NevProp3 OPERATION (* indicates feature new in version 3)

  • Character-based interface common to the UNIX, DOS, and Macintosh platforms.
  • Command-line argument format to efficiently initiate NevProp3. For Generalized Nonlinear Modeling (GNLM) mode, beginners may opt to use an interactive interface.
  • Option to pre-standardize the training data (z-score or forced range*).
  • Option to pre-impute missing elements in training data (case-wise deletion, or imputation with mean, median, random selection, or k-nearest neighbor).*
  • Primary error (criterion) measures include mean square error, hyperbolic tangent error, and log likelihood (cross-entropy), as penalized an unpenalized values.
  • Secondary measures include ROC-curve area (c-index), thresholded classification, R-squared and Nagelkerke R-squared. Also reported at intervals are the weight configuration, and the sum of square weights.
  • Allows simultaneous use of logistic (for dichotomous outputs) and linear output activation functions (automatically detected to assign activation and error function).*
  • 1-of-N (Softmax)* and M-of-N options for binary classification.
  • Optimization options: flexible learning rate (fixed global adaptive, weight-specific, quickprop), split learn rate (inversely proportional to number of incoming connections), stochastic (case-wise updating), sigmoidprime offset (to prevent locking at logistic tails).
  • Regularization options: fixed weight decay, optional decay on bias weights, Bayesian hyperpenalty* (partial and full Automatic Relevance Determination-also used to select important predictors), automated early stopping (full dataset stopping based on multiple subset cross-validations) by error criterion.
  • Validation options: upload held-out validation test set; select subset of outputs for joint summary statistics;* select automated bootstrapped modeling to correct optimistically biased summary statistics (with standard deviations) without use of hold-out.
  • Saving predictions: for training data and uploaded validation test set, save file with identifiers, true targets, predictions, and (if bootstrapped models selected) lower and upper 95% confidence limits* for each prediction.
  • Inference options: determination of the mean predictor effects and level effects (for multilevel predictor variables); confidence limits within main model or across bootstrapped models.*
  • ANN-kNN (k-nearest neighbor) emulation mode options: impute missing data elements and save to new data file; classify test data (with or without missing elements) using ANN-kNN model trained on data with or without missing elements (complete ANN-based expectation maximization).*
  • AGE (ANN-Gated Ensemble) options: adaptively weight predictions (any scale of scores) obtained from multiple (human or computational) "experts"; validate on new prediction sets; optional internal prior-probability expert.*
  • Fuzzy ARTmap

    This is just a small example program. Available for anonymous ftp from park.bu.edu [128.176.121.56] ftp://cns-ftp.bu.edu/pub/fuzzy-artmap.tar.Z (44 kB).

    PYGMALION

    This is a prototype that stems from an ESPRIT project. It implements back-propagation, self organising map, and Hopfield nets. Avaliable for ftp from ftp.funet.fi [128.214.248.6] as /pub/sci/neural/sims/pygmalion.tar.Z (1534 kb). (Original site is imag.imag.fr: archive/pygmalion/pygmalion.tar.Z).

    Basis-of-AI-NN Software

    Non-GUI DOS and UNIX source code, DOS binaries and examples are available in the following different program sets and the backprop package has a Windows 3.x binary and a Unix/Tcl/Tk version:
       [backprop, quickprop, delta-bar-delta, recurrent networks],
       [simple clustering, k-nearest neighbor, LVQ1, DSM],
       [Hopfield, Boltzman, interactive activation network],
       [interactive activation network],
       [feedforward counterpropagation],
       [ART I],
       [a simple BAM] and
       [the linear pattern classifier]
    
    For details see: http://www.dontveter.com/nnsoft/nnsoft.html

    An improved professional version of backprop is also available; see Part 6 of the FAQ.

    Questions to: Don Tveter, drt@christianliving.net

    Matrix Backpropagation

    MBP (Matrix Back Propagation) is a very efficient implementation of the back-propagation algorithm for current-generation workstations. The algorithm includes a per-epoch adaptive technique for gradient descent. All the computations are done through matrix multiplications and make use of highly optimized C code. The goal is to reach almost peak-performances on RISCs with superscalar capabilities and fast caches. On some machines (and with large networks) a 30-40x speed-up can be measured with respect to conventional implementations. The software is available by anonymous ftp from ftp.esng.dibe.unige.it as /neural/MBP/MBPv1.1.tar.Z (Unix version), or /neural/MBP/MBPv11.zip (PC version)., For more information, contact Davide Anguita (anguita@dibe.unige.it).

    BIOSIM

    BIOSIM is a biologically oriented neural network simulator. Public domain, runs on Unix (less powerful PC-version is available, too), easy to install, bilingual (german and english), has a GUI (Graphical User Interface), designed for research and teaching, provides online help facilities, offers controlling interfaces, batch version is available, a DEMO is provided.

    REQUIREMENTS (Unix version): X11 Rel. 3 and above, Motif Rel 1.0 and above, 12 MB of physical memory, recommended are 24 MB and more, 20 MB disc space. REQUIREMENTS (PC version): PC-compatible with MS Windows 3.0 and above, 4 MB of physical memory, recommended are 8 MB and more, 1 MB disc space.

    Four neuron models are implemented in BIOSIM: a simple model only switching ion channels on and off, the original Hodgkin-Huxley model, the SWIM model (a modified HH model) and the Golowasch-Buchholz model. Dendrites consist of a chain of segments without bifurcation. A neural network can be created by using the interactive network editor which is part of BIOSIM. Parameters can be changed via context sensitive menus and the results of the simulation can be visualized in observation windows for neurons and synapses. Stochastic processes such as noise can be included. In addition, biologically orientied learning and forgetting processes are modeled, e.g. sensitization, habituation, conditioning, hebbian learning and competitive learning. Three synaptic types are predefined (an excitatatory synapse type, an inhibitory synapse type and an electrical synapse). Additional synaptic types can be created interactively as desired.

    Available for ftp from ftp.uni-kl.de in directory /pub/bio/neurobio: Get /pub/bio/neurobio/biosim.readme (2 kb) and /pub/bio/neurobio/biosim.tar.Z (2.6 MB) for the Unix version or /pub/bio/neurobio/biosimpc.readme (2 kb) and /pub/bio/neurobio/biosimpc.zip (150 kb) for the PC version.

    Contact:
    Stefan Bergdoll
    Department of Software Engineering (ZXA/US)
    BASF Inc.
    D-67056 Ludwigshafen; Germany
    bergdoll@zxa.basf-ag.de phone 0621-60-21372 fax 0621-60-43735

    The Brain

    The Brain is an advanced neural network simulator for PCs that is simple enough to be used by non-technical people, yet sophisticated enough for serious research work. It is based upon the backpropagation learning algorithm. Three sample networks are included. The documentation included provides you with an introduction and overview of the concepts and applications of neural networks as well as outlining the features and capabilities of The Brain.

    The Brain requires 512K memory and MS-DOS or PC-DOS version 3.20 or later (versions for other OS's and machines are available). A 386 (with maths coprocessor) or higher is recommended for serious use of The Brain. Shareware payment required.

    Demo version is restricted to number of units the network can handle due to memory contraints on PC's. Registered version allows use of extra memory.

    External documentation included: 39Kb, 20 Pages.
    Source included: No (Source comes with registration).
    Available via anonymous ftp from ftp.tu-clausthal.de as /pub/msdos/science/brain12.zip (78 kb) and from ftp.technion.ac.il as /pub/contrib/dos/brain12.zip (78 kb)

    Contact:
    David Perkovic
    DP Computing
    PO Box 712
    Noarlunga Center SA 5168
    Australia
    Email: dip@mod.dsto.gov.au (preferred) or dpc@mep.com or perkovic@cleese.apana.org.au

    FuNeGen 1.0

    FuNeGen is a MLP based software program to generate fuzzy rule based classifiers. A limited version (maximum of 7 inputs and 3 membership functions for each input) for PCs is available for anonymous ftp from obelix.microelectronic.e-technik.th-darmstadt.de in directory /pub/neurofuzzy. For further information see the file read.me. Contact: Saman K. Halgamuge 

    NeuDL -- Neural-Network Description Language

    NeuDL is a description language for the design, training, and operation of neural networks. It is currently limited to the backpropagation neural-network model; however, it offers a great deal of flexibility. For example, the user can explicitly specify the connections between nodes and can create or destroy connections dynamically as training progresses. NeuDL is an interpreted language resembling C or C++. It also has instructions dealing with training/testing set manipulation as well as neural network operation. A NeuDL program can be run in interpreted mode or it can be automatically translated into C++ which can be compiled and then executed. The NeuDL interpreter is written in C++ and can be easly extended with new instructions.

    NeuDL is available from the anonymous ftp site at The University of Alabama: cs.ua.edu (130.160.44.1) in the file /pub/neudl/NeuDLver021.tar. The tarred file contains the interpreter source code (in C++) a user manual, a paper about NeuDL, and about 25 sample NeuDL programs. A document demonstrating NeuDL's capabilities is also available from the ftp site: /pub/neudl/NeuDL/demo.doc /pub/neudl/demo.doc. For more information contact the author: Joey Rogers (jrogers@buster.eng.ua.edu).

    NeoC Explorer (Pattern Maker included)

    The NeoC software is an implementation of Fukushima's Neocognitron neural network. Its purpose is to test the model and to facilitate interactivity for the experiments. Some substantial features: GUI, explorer and tester operation modes, recognition statistics, performance analysis, elements displaying, easy net construction. PLUS, a pattern maker utility for testing ANN: GUI, text file output, transformations. Available for anonymous FTP from OAK.Oakland.Edu (141.210.10.117) as /SimTel/msdos/neurlnet/neocog10.zip (193 kB, DOS version)

    AINET

    AINET is a probabilistic neural network application which runs on Windows 95/NT. It was designed specifically to facilitate the modeling task in all neural network problems. It is lightning fast and can be used in conjunction with many different programming languages. It does not require iterative learning, has no limits in variables (input and output neurons), no limits in sample size. It is not sensitive toward noise in the data. The database can be changed dynamically. It provides a way to estimate the rate of error in your prediction. It has a graphical spreadsheet-like user interface. The AINET manual (more than 100 pages) is divided into: "User's Guide", "Basics About Modeling with the AINET", "Examples", "The AINET DLL library" and "Appendix" where the theoretical background is revealed. You can get a full working copy from: http://www.ainet-sp.si/

    DemoGNG

    This simulator is written in Java and should therefore run without compilation on all platforms where a Java interpreter (or a browser with Java support) is available. It implements the following algorithms and neural network models:
  • Hard Competitive Learning (standard algorithm)
  • Neural Gas (Martinetz and Schulten 1991)
  • Competitive Hebbian Learning (Martinetz and Schulten 1991, Martinetz 1993)
  • Neural Gas with Competitive Hebbian Learning (Martinetz and Schulten 1991)
  • Growing Neural Gas (Fritzke 1995)
  • DemoGNG is distributed under the GNU General Public License. It allows to experiment with the different methods using various probability distributions. All model parameters can be set interactively on the graphical user interface. A teach modus is provided to observe the models in "slow-motion" if so desired. It is currently not possible to experiment with user-provided data, so the simulator is useful basically for demonstration and teaching purposes and as a sample implementation of the above algorithms.

    DemoGNG can be accessed most easily at http://www.neuroinformatik.ruhr-uni-bochum.de/ in the file /ini/VDM/research/gsn/DemoGNG/GNG.html where it is embedded as Java applet into a Web page and is downloaded for immediate execution when you visit this page. An accompanying paper entitled "Some competitive learning methods" describes the implemented models in detail and is available in html at the same server in the directory ini/VDM/research/gsn/JavaPaper/.

    It is also possible to download the complete source code and a Postscript version of the paper via anonymous ftp from ftp.neuroinformatik.ruhr-uni-bochum [134.147.176.16] in directory /pub/software/NN/DemoGNG/. The software is in the file DemoGNG-1.00.tar.gz (193 KB) and the paper in the file sclm.ps.gz (89 KB). There is also a README file (9 KB). Please send any comments and questions to demogng@neuroinformatik.ruhr-uni-bochum.de which will reach Hartmut Loos who has written DemoGNG as well as Bernd Fritzke, the author of the accompanying paper.

    PMNEURO 1.0a

    PMNEURO 1.0a is available at:
    
    ftp://ftp.uni-stuttgart.de/pub/systems/os2/misc/pmneuro.zip
    
    PMNEURO 1.0a creates neuronal networks (backpropagation); propagation
    results can be used as new training input for creating new networks and
    following propagation trials.

    nn/xnn

       Name: nn/xnn
    Company: Neureka ANS
    Address: Klaus Hansens vei 31B
             5037 Solheimsviken
             NORWAY
      Phone: +47 55 20 15 48
      Email: neureka@bgif.no
        URL: http://www.bgif.no/neureka/ 
    Operating systems: 
         nn: UNIX or MS-DOS, 
        xnn: UNIX/X-windows, UNIX flavours: OSF1, Solaris, AIX, IRIX, Linux (1.2.13)
    System requirements: Min. 20 Mb HD + 4 Mb RAM available. If only the
                         nn/netpack part is used (i.e. not the GUI), much
                         less is needed.
    Approx. price: Free for 30 days after installation, fully functional
                   After 30 days: USD 250,-
                   35% educational discount.
    A comprehensive shareware system for developing and simulating artificial neural networks. You can download the software from the URL given above.

    nn is a high-level neural network specification language. The current version is best suited for feed-forward nets, but recurrent models can and have been implemented as well. The nn compiler can generate C code or executable programs, with a powerful command line interface, but everything may also be controlled via the graphical interface (xnn). It is possible for the user to write C routines that can be called from inside the nn specification, and to use the nn specification as a function that is called from a C program. These features makes nn well suited for application development. Please note that no programming is necessary in order to use the network models that come with the system (netpack).

    xnn is a graphical front end to networks generated by the nn compiler, and to the compiler itself. The xnn graphical interface is intuitive and easy to use for beginners, yet powerful, with many possibilities for visualizing network data. Data may be visualized during training, testing or 'off-line'.

    netpack: A number of networks have already been implemented in nn and can be used directly: MAdaline, ART1, Backpropagation, Counterpropagation, Elman, GRNN, Hopfield, Jordan, LVQ, Perceptron, RBFNN, SOFM (Kohonen). Several others are currently being developed.

    The pattern files used by the networks, have a simple and flexible format, and can easily be generated from other kinds of data. The data file generated by the network, can be saved in ASCII or binary format. Functions for converting and pre-processing data are available.

    NNDT

                              NNDT
    
                  Neural Network Development Tool
                      Evaluation version 1.4
                           Bjvrn Saxen
                              1995
    http://www.abo.fi/~bjsaxen/nndt.html ftp://ftp.abo.fi/pub/vt/bjs/

    The NNDT software is as a tool for neural network training. The user interface is developed with MS Visual Basic 3.0 professional edition. DLL routines (written in C) are used for most of the mathematics. The program can be run on a personal computer with MS Windows, version 3.1.

    Evaluation version

    This evaluation version of NNDT may be used free of charge for personal and educational use. The software certainly contains limitations and bugs, but is still a working version which has been developed for over one year. Comments, bug reports and suggestions for improvements can be sent to:
            bjorn.saxen@abo.fi
    or
            Bjorn Saxen
            Heat Engineering Laboratory
            Abo Akademi University
            Biskopsgatan 8
            SF-20500 Abo
            Finland
    Remember, this program comes free but with no guarantee!

    A user's guide for NNDT is delivered in PostScript format. The document is split into three parts and compressed into a file called MANUAL.ZIP. Due to many bitmap figures included, the total size of the uncompressed files is very large, approx 1.5M.

    Features and methods

    The network algorithms implemented are of the so called supervised type. So far, algorithms for multi-layer perceptron (MLP) networks of feed-forward and recurrent types are included. The MLP networks are trained with the Levenberg-Marquardt method.

    The training requires a set of input signals and corresponding output signals, stored in a file referred to as pattern file. This is the only file the user must provide. Optionally, parameters defining the pattern file columns, network size and network configuration may be stored in a file referred to as setup file.

    NNDT includes a routine for graphical presentation of output signals, node activations, residuals and weights during run. The interface also provides facilities for examination of node activations and weights as well as modification of weights.

    A Windows help file is included, help is achieved at any time during NNDT execution by pressing F1.

    Installation

    Unzip NNDTxx.ZIP to a separate disk or to a temporary directory e.g. to c:\tmp. The program is then installed by running SETUP.EXE. See INSTALL.TXT for more details.

    Trajan 2.1 Shareware

    Trajan 2.1 Shareware is a Windows-based Neural Network simulation package. It includes support for the two most popular forms of Neural Network: Multilayer Perceptrons with Back Propagation and Kohonen networks.

     Trajan 2.1 Shareware concentrates on ease-of-use and feedback. It includes Graphs, Bar Charts and Data Sheets presenting a range of Statistical feedback in a simple, intuitive form. It also features extensive on-line Help.

     The Registered version of the package can support very large networks (up to 128 layers with up to 8,192 units each, subject to memory limitations in the machine), and allows simple Cut and Paste transfer of data to/from other Windows-packages, such as spreadsheet programs. The Unregistered version features limited network size and no Clipboard Cut-and-Paste.

     There is also a Professional version of Trajan 2.1, which supports a wider range of network models, training algorithms and other features.

     See Trajan Software's Home Page at http://www.trajan-software.demon.co.uk for further details, and a free copy of the Shareware version.

     Alternatively, email andrew@trajan-software.demon.co.uk for more details.

    Neural Networks at your Fingertips

    "Neural Networks at your Fingertips" is a package of ready-to-reuse neural network simulation source code which was prepared for educational purposes by Karsten Kutza. The package consists of eight programs, each of which implements a particular network architecture together with an embedded example application from a typical application domain.
    Supported network architectures are
  • Adaline,
  • Backpropagation,
  • Hopfield Model,
  • Bidirectional Associative Memory,
  • Boltzmann Machine,
  • Counterpropagation,
  • Self-Organizing Map, and
  • Adaptive Resonance Theory.
  • The applications demonstrate use of the networks in various domains such as pattern recognition, time-series forecasting, associative memory, optimization, vision, and control and include e.g. a sunspot prediction, the traveling salesman problem, and a pole balancer.
    The programs are coded in portable, self-contained ANSI C and can be obtained from the web pages at http://www.geocities.com/CapeCanaveral/1624.

    NNFit

    NNFit (Neural Network data Fitting) is a user-friendly software that allows the development of empirical correlations between input and output data. Multilayered neural models have been implemented using a quasi-newton method as learning algorithm. Early stopping method is available and various tables and figures are provided to evaluate fitting performances of the neural models. The software is available for most of the Unix platforms with X-Windows (IBM-AIX, HP-UX, SUN, SGI, DEC, Linux). Informations, manual and executable codes (english and french versions) are available at http://www.gch.ulaval.ca/~nnfit
    Contact: Bernard P.A. Grandjean, department of chemical engineering,
    Laval University; Sainte-Foy (Quibec) Canada G1K 7P4;
    grandjean@gch.ulaval.ca

    Nenet v1.0

    Nenet v1.0 is a 32-bit Windows 95 and Windows NT 4.0 application designed to facilitate the use of a Self-Organizing Map (SOM) algorithm.

    The major motivation for Nenet was to create a user-friendly SOM algorithm tool with good visualization capabilities and with a GUI allowing efficient control of the SOM parameters. The use scenarios have stemmed from the user's point of view and a considerable amount of work has been placed on the ease of use and versatile visualization methods.

    With Nenet, all the basic steps in map control can be performed. In addition, Nenet also includes some more exotic and involved features especially in the area of visualization.

    Features in Nenet version 1.0:

  • Implements the standard Kohonen SOM algorithm
  • Supports 2 common data preprocessing methods
  • 5 different visualization methods with rectangular or hexagonal topology
  • Capability to animate both train and test sequences in all visualization methods
  • Labelling
  • Both neurons and parameter levels can be labelled
  • Provides also autolabelling
  • Neuron values can be inspected easily
  • Arbitrary selection of parameter levels can be visualized with Umatrix simultaneously
  • Multiple views can be opened on the same map data
  • Maps can be printed
  • Extensive help system provides fast and accurate online help
  • SOM_PAK compatible file formats
  • Easy to install and uninstall
  • Conforms to the common Windows 95 application style - all functionality in one application
  • Nenet web site is at: http://www.mbnet.fi/~phodju/nenet/nenet.html The web site contains further information on Nenet and also the downloadable Nenet files (3 disks totalling about 3 Megs)

    If you have any questions whatsoever, please contact: Nenet-Team@hut.fi or phassine@cc.hut.fi

    Machine Consciousness Toolbox

    See listing for Machine Consciousness Toolbox in part 6 of the FAQ.

    NICO Toolkit (speech recognition)

          Name: NICO Artificial Neural Network Toolkit
        Author: Nikko Strom
       Address: Speech, Music and Hearing, KTH, S-100 44, Stockholm, Sweden
         Email: nikko@speech.kth.se
           URL: http://www.speech.kth.se/NICO/index.html
     Platforms: UNIX, ANSI C; Source code tested on: HPUX, SUN Solaris, Linux
         Price: Free
    The NICO Toolkit is an artificial neural network toolkit designed and optimized for automatic speech recognition applications. Networks with both recurrent connections and time-delay windows are easily constructed. The network topology is very flexible -- any number of layers is allowed and layers can be arbitrarily connected. Sparse connectivity between layers can be specified. Tools for extracting input-features from the speech signal are included as well as tools for computing target values from several standard phonetic label-file formats.

    Algorithms:

  • Back-propagation through time,
  • Speech feature extraction (Mel cepstrum coefficients, filter-bank)
  • SOM Toolbox for Matlab 5

    SOM Toolbox, a shareware Matlab 5 toolbox for data analysis with self-organizing maps is available at the URL http://www.cis.hut.fi/projects/somtoolbox/. If you are interested in practical data analysis and/or self-organizing maps and have Matlab 5 in your computer, be sure to check this out!

    Highlights of the SOM Toolbox include the following:

  • Tools for all the stages of data analysis: besides the basic SOM training and visualization tools, the package includes also tools for data preprocessing and model validation and interpretation.
  • Graphical user interface (GUI): the GUI first guides the user through the initialization and training procedures, and then offers a variety of different methods to visualize the data on the trained map.
  • Modular programming style: the Toolbox code utilizes Matlab structures, and the functions are constructed in a modular manner, which makes it convenient to tailor the code for each user's specific needs.
  • Advanced graphics: building on the Matlab's strong graphics capabilities, attractive figures can be easily produced.
  • Compatibility with SOM_PAK: import/export functions for SOM_PAK codebook and data files are included in the package.
  • Component weights and names: the input vector components may be given different weights according to their relative importance, and the components can be given names to make the figures easier to read.
  • Batch or sequential training: in data analysis applications, the speed of training may be considerably improved by using the batch version.
  • Map dimension: maps may be N-dimensional (but visualization is not supported when N > 2 ).
  • ------------------------------------------------------------------------
    For some of these simulators there are user mailing lists. Get the packages and look into their documentation for further info.

     If you are using a small computer (PC, Mac, etc.) you may want to have a look at the Central Neural System Electronic Bulletin Board (see question "Other sources of information"). Modem: 409-737-5222; Sysop: Wesley R. Elsberry; 4160 Pirates' Beach, Galveston, TX, USA; welsberr@orca.tamu.edu. There are lots of small simulator packages, the CNS ANNSIM file set. There is an ftp mirror site for the CNS ANNSIM file set at me.uta.edu [129.107.2.20] in the /pub/neural directory. Most ANN offerings are in /pub/neural/annsim.

    ------------------------------------------------------------------------
    Next part is part 6 (of 7). Previous part is part 4.

    Subject: Commercial software packages for NN simulation?

    Since the FAQ maintainer works for a software company, he does not recommend or evaluate software in the FAQ. The descriptions below are provided by the developers or distributors of the software.

    Note for future submissions: Please restrict product descriptions to a maximum of 60 lines of 72 characters, in either plain-text format or, preferably, HTML format. If you include the standard header (name, company, address, etc.), you need not count the header in the 60 line maximum. Please confine your HTML to features that are supported by primitive browsers, especially NCSA Mosaic 2.0; avoid tables, for example--use <pre> instead. Try to make the descriptions objective, and avoid making implicit or explicit assertions about competing products, such as "Our product is the *only* one that does so-and-so." The FAQ maintainer reserves the right to remove excessive marketing hype and to edit submissions to conform to size requirements; if he is in a good mood, he may also correct your spelling and punctuation.

    The following simulators are described below:

  • BrainMaker
  • SAS Enterprise Miner Software
  • NeuralWorks
  • MATLAB Neural Network Toolbox
  • Propagator
  • NeuroForecaster
  • Products of NESTOR, Inc.
  • Ward Systems Group (NeuroShell, etc.)
  • NuTank
  • Neuralyst
  • NeuFuz4
  • Cortex-Pro
  • Partek
  • NeuroSolutions v3.0
  • Qnet For Windows Version 2.0
  • NeuroLab, A Neural Network Library
  • hav.Software: havBpNet++, havFmNet++, havBpNet:J
  • IBM Neural Network Utility
  • NeuroGenetic Optimizer (NGO) Version 2.0
  • WAND
  • The Dendronic Learning Engine
  • TDL v. 1.1 (Trans-Dimensional Learning)
  • NeurOn-Line
  • NeuFrame, NeuroFuzzy, NeuDesk and NeuRun
  • OWL Neural Network Library (TM)
  • Neural Connection
  • Pattern Recognition Workbench Expo/PRO/PRO+
  • PREVia
  • Neural Bench
  • Trajan 2.1 Neural Network Simulator
  • DataEngine
  • Machine Consciousness Toolbox
  • Professional Basis of AI Backprop
  • STATISTICA: Neural Networks
  • Braincel (Excel add-in)
  • DESIRE/NEUNET
  • BioNet Simulator
  • Viscovery SOMine
  • NeuNet Pro
  • See also http://www.emsl.pnl.gov:2080/proj/neuron/neural/systems/software.html

    BrainMaker

             Name: BrainMaker, BrainMaker Pro
          Company: California Scientific Software
          Address: 10024 Newtown rd, Nevada City, CA, 95959 USA
            Phone: 800 284-8112, 530 478 9040
              Fax: 530 478 9041
              URL: http://www.calsci.com/
       Basic capabilities:  train backprop neural nets
       Operating system:   Windows, Mac
       System requirements:
      
       Approx. price:  $195, $795
    
       BrainMaker Pro 3.7 (DOS/Windows)     $795
           Gennetic Training add-on         $250
       BrainMaker 3.7 (DOS/Windows/Mac)     $195
           Network Toolkit add-on           $150
       BrainMaker 3.7 Student version       (quantity sales only, about $38 each)
    
       BrainMaker Pro CNAPS Accelerator Board $8145
    
       Introduction To Neural Networks book $30
    
       30 day money back guarantee, and unlimited free technical support.
       BrainMaker package includes:
        The book Introduction to Neural Networks
        BrainMaker Users Guide and reference manual
            300 pages, fully indexed, with tutorials, and sample networks
        Netmaker
            Netmaker makes building and training Neural Networks easy, by
            importing and automatically creating BrainMaker's Neural Network
            files.  Netmaker imports Lotus, Excel, dBase, and ASCII files.
        BrainMaker
            Full menu and dialog box interface, runs Backprop at 3,000,000 cps
            on a 300Mhz Pentium II; 570,000,000 cps on CNAPS accelerator.
    
       ---Features ("P" means is available in professional version only):
       MMX instruction set support for increased computation speed,
       Pull-down Menus, Dialog Boxes, Programmable Output Files,
       Editing in BrainMaker,  Network Progress Display (P),
       Fact Annotation,  supports many printers,  NetPlotter,
       Graphics Built In (P),  Dynamic Data Exchange (P),
       Binary Data Mode, Batch Use Mode (P), EMS and XMS Memory (P),
       Save Network Periodically,  Fastest Algorithms,
       512 Neurons per Layer (P: 32,000), up to 8 layers,
       Specify Parameters by Layer (P), Recurrence Networks (P),
       Prune Connections and Neurons (P),  Add Hidden Neurons In Training,
       Custom Neuron Functions,  Testing While Training,
       Stop training when...-function (P),  Heavy Weights (P),
       Hypersonic Training,  Sensitivity Analysis (P),  Neuron Sensitivity (P),
       Global Network Analysis (P),  Contour Analysis (P),
       Data Correlator (P),  Error Statistics Report,
       Print or Edit Weight Matrices,  Competitor (P), Run Time System (P),
       Chip Support for Intel, American Neurologics, Micro Devices,
       Genetic Training Option (P),  NetMaker,  NetChecker,
       Shuffle,  Data Import from Lotus, dBASE, Excel, ASCII, binary,
       Finacial Data (P),  Data Manipulation,  Cyclic Analysis (P),
       User's Guide quick start booklet,
       Introduction to Neural Networks 324 pp book

    SAS Enterprise Miner Software

        Name: SAS Enterprise Miner Software
    
              In USA:                 In Europe:
     Company: SAS Institute, Inc.     SAS Institute, European Office 
     Address: SAS Campus Drive        Neuenheimer Landstrasse 28-30 
              Cary, NC 27513          P.O.Box 10 53 40 
              USA                     D-69043 Heidelberg 
                                      Germany
       Phone: (919) 677-8000          (49) 6221 4160
         Fax: (919) 677-4444          (49) 6221 474 850
    
         URL: http://www.sas.com/software/components/miner.html
       Email: software@sas.sas.com
    
    Operating systems: Windows NT
    To find the addresses and telephone numbers of other SAS Institute offices, including those outside the USA and Europe, connect your web browser to http://www.sas.com/offices/intro.html.

    Enterprise Miner is an integrated software product that provides an end-to-end business solution for data mining based on SEMMA (Sample, Explore, Modify, Model, Assess) methodology. Statistical tools include clustering, decision trees, linear and logistic regression, and neural networks. Data preparation tools include outlier detection, variable transformations, random sampling, and the partitioning of data sets (into training, test, and validation data sets). Advanced visualization tools enable you to quickly and easily examine large amounts of data in multidimensional histograms, and to graphically compare modeling results.

    The neural network tool includes multilayer perceptrons, radial basis functions, statistical versions of counterpropagation and learning vector quantization, a variety of built-in activation and error functions, multiple hidden layers, direct input-output connections, categorical variables, standardization of inputs and targets, and multiple preliminary optimizations from random initial values to avoid local minima. Training is done by state-of-the-art numerical optimization algorithms instead of tedious backprop.

    NeuralWorks

         Name: NeuralWorks Professional II Plus (from NeuralWare)
      Company: NeuralWare Inc.
       Adress: RIDC Park West
               202 Park West Drive
               Pittsburgh, PA 15275
        Phone: (412) 787-8222
          FAX: (412) 787-8220
        Email: sales@neuralware.com
          URL: http://www.neuralware.com/
    
     Distributor for Europe:
       Scientific Computers GmbH.
       Franzstr. 107, 52064 Aachen
       Germany
       Tel.   (49) +241-26041
       Fax.   (49) +241-44983
       Email. info@scientific.de
    
     Basic capabilities:
       supports over 30 different nets: backprop, art-1,kohonen,
       modular neural network, General regression, Fuzzy art-map,
       probabilistic nets, self-organizing map, lvq, boltmann,
       bsb, spr, etc...
       Extendable with optional package.
       ExplainNet, Flashcode (compiles net in .c code for runtime),
       user-defined io in c possible. ExplainNet (to eliminate
       extra inputs), pruning, savebest,graph.instruments like
       correlation, hinton diagrams, rms error graphs etc..
     Operating system   : PC,Sun,IBM RS6000,Apple Macintosh,SGI,Dec,HP.
     System requirements: varies. PC:2MB extended memory+6MB Harddisk space.
                          Uses windows compatible memory driver (extended).
                          Uses extended memory.
     Approx. price      : call (depends on platform)
     Comments           : award winning documentation, one of the market
                          leaders in NN software.

    MATLAB Neural Network Toolbox

       Contact: The MathWorks, Inc.     Phone: 508-653-1415
                24 Prime Park Way       FAX: 508-653-2997
                Natick, MA 01760 email: info@mathworks.com
    The Neural Network Toolbox is a powerful collection of MATLAB functions for the design, training, and simulation of neural networks. It supports a wide range of network architectures with an unlimited number of processing elements and interconnections (up to operating system constraints). Supported architectures and training methods include: supervised training of feedforward networks using the perceptron learning rule, Widrow-Hoff rule, several variations on backpropagation (including the fast Levenberg-Marquardt algorithm), and radial basis networks; supervised training of recurrent Elman networks; unsupervised training of associative networks including competitive and feature map layers; Kohonen networks, self-organizing maps, and learning vector quantization. The Neural Network Toolbox contains a textbook-quality Users' Guide, uses tutorials, reference materials and sample applications with code examples to explain the design and use of each network architecture and paradigm. The Toolbox is delivered as MATLAB M-files, enabling users to see the algorithms and implementations, as well as to make changes or create new functions to address a specific application.

     (Comment from Nigel Dodd, nd@neural.win-uk.net): there is also a free Neural Network Based System Identification Toolbox available from http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html that contains many of the supervised training algorithms, some of which are duplicated in C code which should run faster. This free toolbox does regularisation and pruning which the costly one doesn't attempt (as of Nov 1995).

    (Message from Eric A. Wan, ericwan@eeap.ogi.edu) FIR/TDNN Toolbox for MATLAB: Beta version of a toolbox for FIR (Finite Impulse Response) and TD (Time Delay) Neural Networks. For efficient stochastic implementation, algorithms are written as MEX compatible c-code which can be called as primitive functions from within MATLAB. Both source and compiled functions are available. URL: http://www.eeap.ogi.edu/~ericwan/fir.html
     
     

    Propagator

      Contact: ARD Corporation,
               9151 Rumsey Road, Columbia, MD  21045, USA
               propagator@ard.com
      Easy to use neural network training package.  A GUI implementation of
      backpropagation networks with five layers (32,000 nodes per layer).
      Features dynamic performance graphs, training with a validation set,
      and C/C++ source code generation.
      For Sun (Solaris 1.x & 2.x, $499),
          PC  (Windows 3.x, $199)
          Mac (System 7.x, $199)
      Floating point coprocessor required, Educational Discount,
      Money Back Guarantee, Muliti User Discount
      See http://www.cs.umbc.edu/~zwa/Gator/Description.html
      Windows Demo on:
        nic.funet.fi        /pub/msdos/windows/demo
        oak.oakland.edu     /pub/msdos/neural_nets
            gatordem.zip    pkzip 2.04g archive file
            gatordem.txt    readme text file

    NeuroForecaster& VisuaData

    Name: NeuroForecaster(TM)/Genetica 4.1a
    Contact: Accel Infotech (S) Pte Ltd; 648 Geylang Road; Republic of Singapore 1438;
    Phone: +65-7446863, 3366997; Fax: +65-3362833, Internet: accel@technet.sg, accel@singapore.com

     Neuroforecaster 4.1a for Windows is priced at US$1199 per single user license. Please email us (accel@technet.sg) for order form.

    For more information and evaluation copy please visit http://www.singapore.com/products/nfga.
    NeuroForecaster is a user-friendly ms-windows neural network program specifically designed for building sophisticated and powerful forecasting and decision-support systems (Time-Series Forecasting, Cross-Sectional Classification, Indicator Analysis)
    Features:

  • GENETICA Net Builder Option for automatic network optimization
  • 12 Neuro-Fuzzy Network Models
  • Multitasking & Background Training Mode
  • Unlimited Network Capacity
  • Rescaled Range Analysis & Hurst Exponent to Unveil Hidden Market
  • Cycles & Check for Predictability
  • Correlation Analysis to Compute Correlation Factors to Analyze the
  • Significance of Indicators
  • Weight Histogram to Monitor the Progress of Learning
  • Accumulated Error Analysis to Analyze the Strength of Input Indicators
  • ------------------------------------------------------------------------
    Next part is part 7 (of 7). Previous part is part 5.

    Subject: Neural Network hardware?

    Thomas Lindblad notes on 96-12-30:
    The reactive tabu search algorithm has been implemented by the Italians, in Trento. ISA and VME and soon PCI boards are available. We tested the system with the IRIS and SATIMAGE data and it did better than most other chips.

    The Neuroclassifier is available from Holland still and is also the fastest nnw chip or a transient time less than 100 ns.

    JPL is making another chip, ARL in WDC is making another, so there are a few things going on ...
     
     

    Overview articles:
  • Ienne, Paolo and Kuhn, Gary (1995), "Digital Systems for Neural Networks", in Papamichalis, P. and Kerwin, R., eds., Digital Signal Processing Technology, Critical Reviews Series CR57 Orlando, FL: SPIE Optical Engineering, pp 314-45, ftp://mantraftp.epfl.ch/mantra/ienne.spie95.A4.ps.gz or ftp://mantraftp.epfl.ch/mantra/ienne.spie95.US.ps.gz
  • ftp://ftp.mrc-apu.cam.ac.uk/pub/nn/murre/neurhard.ps (1995)
  • ftp://ftp.urc.tue.nl/pub/neural/hardware_general.ps.gz (1993)
  • Various NN HW information can be found in the Web site http://www1.cern.ch/NeuralNets/nnwInHepHard.html (from people who really use such stuff!). Several applications are described in http://www1.cern.ch/NeuralNets/nnwInHepExpt.html

    More information on NN chips can be obtained from the Electronic Engineers Toolbox web page. Go to http://www.eg3.com/ebox.htm, type "neural" in the quick search box, click on "chip co's" and then on "search".

    Further WWW pointers to NN Hardware:

  • http://msia02.msi.se/~lindsey/nnwAtm.html
  • http://www.genobyte.com/article.html
  • Here is a short list of companies:

    HNC, INC.

      HNC Inc.
      5930 Cornerstone Court West
      San Diego, CA 92121-3728
    
      619-546-8877  Phone
      619-452-6524  Fax
      HNC markets:
       Database Mining Workstation (DMW), a PC based system that
       builds models of relationships and patterns in data. AND
       The SIMD Numerical Array Processor (SNAP). It is an attached
       parallel array processor in a VME chassis with between 16 and 64 parallel
       floating point processors. It provides between 640 MFLOPS and 2.56 GFLOPS
       for neural network and signal processing applications.  A Sun SPARCstation
       serves as the host.  The SNAP won the IEEE 1993 Gordon Bell Prize for best
       price/performance for supercomputer class systems.

    SAIC (Sience Application International Corporation)

       10260 Campus Point Drive
       MS 71, San Diego
       CA 92121
       (619) 546 6148
       Fax: (619) 546 6736

    Micro Devices

       30 Skyline Drive
       Lake Mary
       FL 32746-6201
       (407) 333-4379
       MicroDevices makes   MD1220 - 'Neural Bit Slice'
       Each of the products mentioned sofar have very different usages.
       Although this sounds similar to Intel's product, the
       architectures are not.

    Intel Corp

       2250 Mission College Blvd
       Santa Clara, Ca 95052-8125
       Attn ETANN, Mail Stop SC9-40
       (408) 765-9235
       Intel was making an experimental chip (which is no longer produced):
       80170NW - Electrically trainable Analog Neural Network (ETANN)
       It has 64 'neurons' on it - almost fully internally connectted
       and the chip can be put in an hierarchial architecture to do 2 Billion
       interconnects per second.
       Support software by
         California Scientific Software
         10141 Evening Star Dr #6
         Grass Valley, CA 95945-9051
         (916) 477-7481
       Their product is called 'BrainMaker'.

    NeuralWare, Inc

       Penn Center West
       Bldg IV Suite 227
       Pittsburgh
       PA 15276
       They only sell software/simulator but for many platforms.

    Tubb Research Limited

       7a Lavant Street
       Peterfield
       Hampshire
       GU32 2EL
       United Kingdom
       Tel: +44 730 60256

    Adaptive Solutions Inc

       1400 NW Compton Drive
       Suite 340
       Beaverton, OR 97006
       U. S. A.
       Tel: 503-690-1236;   FAX: 503-690-1249

    NeuroDynamX, Inc.

       P.O. Box 14
       Marion, OH  43301-0014
       Voice (740) 387-5074
       Fax: (740) 382-4533
       Internet:  jwrogers@on-ramp.net
       http://www.neurodynamx.com
    
       InfoTech Software Engineering purchased the software and
       trademarks from NeuroDynamX, Inc. and, using the NeuroDynamX tradename,
       continues to publish the DynaMind, DynaMind Developer Pro and iDynaMind
       software packages.
     
    And here is an incomplete overview of known Neural Computers with their newest known reference.
     
     
    \subsection*{Digital}
    \subsubsection{Special Computers}
    
    {\bf AAP-2}
    Takumi Watanabe, Yoshi Sugiyama, Toshio Kondo, and Yoshihiro Kitamura.
    Neural network simulation on a massively parallel cellular array
    processor: AAP-2.
    In International Joint Conference on Neural Networks, 1989.
    
    {\bf ANNA}
    B.E.Boser, E.Sackinger, J.Bromley, Y.leChun, and L.D.Jackel.\\
    Hardware Requirements for Neural Network Pattern Classifiers.\\
    In {\it IEEE Micro}, 12(1), pages 32-40, February 1992.
    
    {\bf Analog Neural Computer}
    Paul Mueller et al.
    Design and performance of a prototype analog neural computer.
    In Neurocomputing, 4(6):311-323, 1992.
    
    {\bf APx -- Array Processor Accelerator}\\
    F.Pazienti.\\
    Neural networks simulation with array processors.
    In {\it Advanced Computer Technology, Reliable Systems and Applications;
    Proceedings of the 5th Annual Computer Conference}, pages 547-551.
    IEEE Comput. Soc. Press, May 1991. ISBN: 0-8186-2141-9.
    
    {\bf ASP -- Associative String Processor}\\
    A.Krikelis.\\
    A novel massively associative processing architecture for the
    implementation artificial neural networks.\\
    In {\it 1991 International Conference on Acoustics, Speech and
    Signal Processing}, volume 2, pages 1057-1060. IEEE Comput. Soc. Press,
    May 1991.
    
    {\bf BSP400}
    Jan N.H. Heemskerk, Jacob M.J. Murre, Jaap Hoekstra, Leon H.J.G.
    Kemna, and Patrick T.W. Hudson.
    The bsp400: A modular neurocomputer assembled from 400 low-cost
    microprocessors.
    In International Conference on Artificial Neural Networks. Elsevier
    Science, 1991.
    
    {\bf BLAST}\\
    J.G.Elias, M.D.Fisher, and C.M.Monemi.\\
    A multiprocessor machine for large-scale neural network simulation.
    In {\it IJCNN91-Seattle: International Joint Conference on Neural
    Networks}, volume 1, pages 469-474. IEEE Comput. Soc. Press, July 1991.
    ISBN: 0-7883-0164-1.
    
    {\bf CNAPS Neurocomputer}\\
    H.McCartor\\
    Back Propagation Implementation on the Adaptive Solutions CNAPS
    Neurocomputer.\\
    In {\it Advances in Neural Information Processing Systems}, 3, 1991.
    
    {\bf GENES~IV and MANTRA~I}\\
    Paolo Ienne and  Marc A. Viredaz\\
    {GENES~IV}: A Bit-Serial Processing Element for a Multi-Model
       Neural-Network Accelerator\\
    Journal of {VLSI} Signal Processing, volume 9, no. 3, pages 257--273, 1995.
    
    {\bf MA16 -- Neural Signal Processor}
    U.Ramacher, J.Beichter, and N.Bruls.\\
    Architecture of a general-purpose neural signal processor.\\
    In {\it IJCNN91-Seattle: International Joint Conference on Neural
    Networks}, volume 1, pages 443-446. IEEE Comput. Soc. Press, July 1991.
    ISBN: 0-7083-0164-1.
    
    {\bf Mindshape}
    Jan N.H. Heemskerk, Jacob M.J. Murre Arend Melissant, Mirko Pelgrom,
    and Patrick T.W. Hudson.
    Mindshape: a neurocomputer concept based on a fractal architecture.
    In International Conference on Artificial Neural Networks. Elsevier
    Science, 1992.
    
    {\bf mod 2}
    Michael L. Mumford, David K. Andes, and Lynn R. Kern.
    The mod 2 neurocomputer system design.
    In IEEE Transactions on Neural Networks, 3(3):423-433, 1992.
    
    {\bf NERV}\\
    R.Hauser, H.Horner, R. Maenner, and M.Makhaniok.\\
    Architectural Considerations for NERV - a General Purpose Neural
    Network Simulation System.\\
    In {\it Workshop on Parallel Processing: Logic, Organization and
    Technology -- WOPPLOT 89}, pages 183-195. Springer Verlag, Mars 1989.
    ISBN: 3-5405-5027-5.
    
    {\bf NP -- Neural Processor}\\
    D.A.Orrey, D.J.Myers, and J.M.Vincent.\\
    A high performance digital processor for implementing large artificial
    neural networks.\\
    In {\it Proceedings of of the IEEE 1991 Custom Integrated Circuits
    Conference}, pages 16.3/1-4. IEEE Comput. Soc. Press, May 1991.
    ISBN: 0-7883-0015-7.
    
    {\bf RAP -- Ring Array Processor }\\
    N.Morgan, J.Beck, P.Kohn, J.Bilmes, E.Allman, and J.Beer.\\
    The ring array processor: A multiprocessing peripheral for connectionist
    applications. \\
    In {\it Journal of Parallel and Distributed Computing}, pages
    248-259, April 1992.
    
    {\bf RENNS -- REconfigurable Neural Networks Server}\\
    O.Landsverk, J.Greipsland, J.A.Mathisen, J.G.Solheim, and L.Utne.\\
    RENNS - a Reconfigurable Computer System for Simulating Artificial
    Neural Network Algorithms.\\
    In {\it Parallel and Distributed Computing Systems, Proceedings of the
    ISMM 5th International Conference}, pages 251-256. The International
    Society for Mini and Microcomputers - ISMM, October 1992.
    ISBN: 1-8808-4302-1.
    
    {\bf SMART -- Sparse Matrix Adaptive and Recursive Transforms}\\
    P.Bessiere, A.Chams, A.Guerin, J.Herault, C.Jutten, and J.C.Lawson.\\
    From Hardware to Software: Designing a ``Neurostation''.\\
    In {\it VLSI design of Neural Networks}, pages 311-335, June 1990.
    
    {\bf SNAP -- Scalable Neurocomputer Array Processor}
    E.Wojciechowski.\\
    SNAP: A parallel processor for implementing real time neural networks.\\
    In {\it Proceedings of the IEEE 1991 National Aerospace and Electronics
    Conference; NAECON-91}, volume 2, pages 736-742. IEEE Comput.Soc.Press,
    May 1991.
    
    {\bf Toroidal Neural Network Processor}\\
    S.Jones, K.Sammut, C.Nielsen, and J.Staunstrup.\\
    Toroidal Neural Network: Architecture and Processor Granularity
    Issues.\\
    In {\it VLSI design of Neural Networks}, pages 229-254, June 1990.
    
    {\bf SMART and SuperNode}
    P. Bessi`ere, A. Chams, and P. Chol.
    MENTAL : A virtual machine approach to artificial neural networks
    programming. In NERVES, ESPRIT B.R.A. project no 3049, 1991.
    
    
    \subsubsection{Standard Computers}
    
    {\bf EMMA-2}\\
    R.Battiti, L.M.Briano, R.Cecinati, A.M.Colla, and P.Guido.\\
    An application oriented development environment for Neural Net models on
    multiprocessor Emma-2.\\
    In {\it Silicon Architectures for Neural Nets; Proceedings for the IFIP
    WG.10.5 Workshop}, pages 31-43. North Holland, November 1991.
    ISBN: 0-4448-9113-7.
    
    {\bf iPSC/860 Hypercube}\\
    D.Jackson, and D.Hammerstrom\\
    Distributing Back Propagation Networks Over the Intel iPSC/860
    Hypercube}\\
    In {\it IJCNN91-Seattle: International Joint Conference on Neural
    Networks}, volume 1, pages 569-574. IEEE Comput. Soc. Press, July 1991.
    ISBN: 0-7083-0164-1.
    
    {\bf SCAP -- Systolic/Cellular Array Processor}\\
    Wei-Ling L., V.K.Prasanna, and K.W.Przytula.\\
    Algorithmic Mapping of Neural Network Models onto Parallel SIMD
    Machines.\\
    In {\it IEEE Transactions on Computers}, 40(12), pages 1390-1401,
    December 1991. ISSN: 0018-9340.
    ------------------------------------------------------------------------

    Subject: What are some applications of NNs?

    There are vast numbers of published neural network applications. If you don't find something from your field of interest below, try a web search. Here are some useful search engines:
    http://www.google.com/
    http://search.yahoo.com/
    http://www.altavista.com/
    http://www.deja.com/
     
     

    General

  • The Pacific Northwest National Laboratory: http://www.emsl.pnl.gov:2080/proj/neuron/neural/ including a list of commercial applications at http://www.emsl.pnl.gov:2080/proj/neuron/neural/products/
  • The Stimulation Initiative for European Neural Applications: http://www.mbfys.kun.nl/snn/siena/cases/
  • The DTI NeuroComputing Web's Applications Portfolio: http://www.globalweb.co.uk/nctt/portfolo/
  • The Applications Corner, NeuroDimension, Inc.: http://www.nd.com/appcornr/purpose.htm
  • The BioComp Systems, Inc. Solutions page: http://www.bio-comp.com
  • Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, ISBN 0-07-011189-8.
  • The series Advances in Neural Information Processing Systems containing proceedings of the conference of the same name, published yearly by Morgan Kauffman starting in 1989 and by The MIT Press in 1995.
  • Agriculture

  • P.H. Heinemann, Automated Grading of Produce: http://server.age.psu.edu/dept/fac/Heinemann/phhdocs/visionres.html
  • Deck, S., C.T. Morrow, P.H. Heinemann, and H.J. Sommer, III. 1995. Comparison of a neural network and traditional classifier for machine vision inspection. Applied Engineering in Agriculture. 11(2):319-326.
  • Tao, Y., P.H. Heinemann, Z. Varghese, C.T. Morrow, and H.J. Sommer III. 1995. Machine vision for color inspection of potatoes and apples. Transactions of the American Society of Agricultural Engineers. 38(5):1555-1561.
  • Chemistry

  • PNNL, General Applications of Neural Networks in Chemistry and Chemical Engineering: http://www.emsl.pnl.gov:2080/proj/neuron/neural/bib/chemistry.html.
  • Prof. Dr. Johann Gasteiger, Neural Networks and Genetic Algorithms in Chemistry: http://www2.ccc.uni-erlangen.de/publications/publ_topics/publ_topics-12.html
  • Roy Goodacre, pyrolysis mass spectrometry: http://gepasi.dbs.aber.ac.uk/roy/pymshome.htm and Fourier transform infrared (FT-IR) spectroscopy: http://gepasi.dbs.aber.ac.uk/roy/ftir/ftirhome.htm contain applications of a variety of NNs as well as PLS (partial least squares) and other statistical methods.
  • Situs, a program package for the docking of protein crystal structures to single-molecule, low-resolution maps from electron microscopy or small angle X-ray scattering: http://chemcca10.ucsd.edu/~situs/
  • An on-line application of a Kohonen network with a 2-dimensional output layer for prediction of protein secondary structure percentages from UV circular dichroism spectra: http://kal-el.ugr.es/k2d/spectra.html.
  • Finance and economics

  • Athanasios Episcopos, References on Neural Net Applications to Finance and Economics: http://www.compulink.gr/users/episcopo/neurofin.html 
  • Trippi, R.R. & Turban, E. (1993), Neural Networks in Finance and Investing, Chicago: Probus.
  • Games

  • General:

  • Jay Scott, Machine Learning in Games: http://forum.swarthmore.edu/~jay/learn-game/index.html
    METAGAME Game-Playing Workbench: ftp://ftp.cl.cam.ac.uk/users/bdp/METAGAME
    R.S. Sutton, "Learning to predict by the methods of temporal differences", Machine Learning 3, p. 9-44 (1988).
    David E. Moriarty and Risto Miikkulainen (1994). "Evolving Neural Networks to Focus Minimax Search," In Proceedings of Twelfth National Conference on Artificial Intelligence (AAAI-94, Seattle, WA), 1371-1377. Cambridge, MA: MIT Press, http://www.cs.utexas.edu/users/nn/pages/publications/neuro-evolution.html
  • Backgammon:

  • G. Tesauro and T.J. Sejnowski (1989), "A Parallel Network that learns to play Backgammon," Artificial Intelligence, vol 39, pp. 357-390.
    G. Tesauro and T.J. Sejnowski (1990), "Neurogammon: A Neural Network Backgammon Program," IJCNN Proceedings, vol 3, pp. 33-39, 1990.
  • Bridge:

  • METAGAME: ftp://ftp.cl.cam.ac.uk/users/bdp/bridge.ps.Z
    He Yo, Zhen Xianjun, Ye Yizheng, Li Zhongrong (19??), "Knowledge acquisition and reasoning based on neural networks - the research of a bridge bidding system," INNC '90, Paris, vol 1, pp. 416-423.
    M. Kohle and F. Schonbauer (19??), "Experience gained with a neural network that learns to play bridge," Proc. of the 5th Austrian Artificial Intelligence meeting, pp. 224-229.
  • Checkers (not NNs, but classic papers):

  • A.L. Samuel (1959), "Some studies in machine learning using the game of checkers," IBM journal of Research and Development, vol 3, nr. 3, pp. 210-229.
    A.L. Samuel (1967), "Some studies in machine learning using the game of checkers 2 - recent progress," IBM journal of Research and Development, vol 11, nr. 6, pp. 601-616.
  • Chess:

  • Sebastian Thrun, NeuroChess: http://forum.swarthmore.edu/~jay/learn-game/systems/neurochess.html
    Luke Pellen, Octavius: http://home.seol.net.au/luke/octavius/
  • Go:

  • David Stoutamire (19??), "Machine Learning, Game Play, and Go," Center for Automation and Intelligent Systems Research TR 91-128, Case Western Reserve University. http://www.stoutamire.com/david/publications.html
    David Stoutamire (1991), Machine Learning Applied to Go, M.S. thesis, Case Western Reserve University, ftp://ftp.cl.cam.ac.uk/users/bdp/go.ps.Z
    Norman Richards, David Moriarty, and Risto Miikkulainen (1998), "Evolving Neural Networks to Play Go," Applied Intelligence, 8, 85-96, http://www.cs.utexas.edu/users/nn/pages/publications/neuro-evolution.html
    Markus Enzenberger (1996), "The Integration of A Priori Knowledge into a Go Playing Neural Network," http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html
    Schraudolph, N., Dayan, P., Sejnowski, T. (1994), "Temporal Difference Learning of Position Evaluation in the Game of Go," In: Neural Information Processing Systems 6, Morgan Kaufmann 1994, ftp://bsdserver.ucsf.edu/Go/comp/td-go.ps.Z
  • Go-Moku:

  • Freisleben, B., "Teaching a Neural Network to Play GO-MOKU," in I. Aleksander and J. Taylor, eds, Artificial Neural Networks 2, Proc. of ICANN-92, Brighton UK, vol. 2, pp. 1659-1662, Elsevier Science Publishers, 1992
    Katz, W.T. and Pham, S.P. "Experience-Based Learning Experiments using Go-moku", Proc. of the 1991 IEEE International Conference on Systems, Man, and Cybernetics, 2: 1405-1410, October 1991.
  • Reversi/Othello:

  • David E. Moriarty and Risto Miikkulainen (1995). Discovering Complex Othello Strategies through Evolutionary Neural Networks. Connection Science, 7, 195-209, http://www.cs.utexas.edu/users/nn/pages/publications/neuro-evolution.html
  • Tic-Tac-Toe:

  • Richard S. Sutton and Andrew G. Barto (1998), Reinforcement Learning: An Introduction The MIT Press, ISBN: 0262193981
     
     

    Industry

  • PNNL, Neural Network Applications in Manufacturing: http://www.emsl.pnl.gov:2080/proj/neuron/neural/bib/manufacturing.html.
  • PNNL, Applications in the Electric Power Industry: http://www.emsl.pnl.gov:2080/proj/neuron/neural/bib/power.html.
  • PNNL, Process Control: http://www.emsl.pnl.gov:2080/proj/neuron/neural/bib/process.html.
  • Raoul Tawel, Ken Marko, and Lee Feldkamp (1998), "Custom VLSI ASIC for Automotive Applications with Recurrent Networks", http://www.jpl.nasa.gov/releases/98/ijcnn98.pdf
  • Otsuka, Y. et al. "Neural Networks and Pattern Recognition of Blast Furnace Operation Data" Kobelco Technology Review, Oct. 1992, 12
  • Otsuka, Y. et al. "Applications of Neural Network to Iron and Steel Making Processes" 2. International Conference on Fuzzy Logic and Neural Networks, Iizuka, 1992
  • Staib, W.E. "Neural Network Control System for Electric Arc Furnaces" M.P.T. International, 2/1995, 58-61
  • Portmann, N. et al. "Application of Neural Networks in Rolling Automation" Iron and Steel Engineer, Feb. 1995, 33-36
  • Murat, M. E., and Rudman, A. J., 1992, Automated first arrival picking: A neural network approach: Geophysical Prospecting, 40, 587-604.
  • Materials science

  • Phase Transformations Research Group (search for "neural"): http://www.msm.cam.ac.uk/phase-trans/pubs/ptpuball.html
  • Medicine

  • PNNL, Applications in Medicine and Health: http://www.emsl.pnl.gov:2080/proj/neuron/neural/bib/medicine.html.
  • Music

  • Mozer, M. C. (1994), "Neural network music composition by prediction: Exploring the benefits of psychophysical constraints and multiscale processing," Connection Science, 6, 247-280, http://www.cs.colorado.edu/~mozer/papers/music.html.
  • Robotics

  • Institute of Robotics and System Dynamics: http://www.robotic.dlr.de/LEARNING/
  • UC Berkeley Robotics and Intelligent Machines Lab: http://robotics.eecs.berkeley.edu/
  • Perth Robotics and Automation Laboratory: http://telerobot.mech.uwa.edu.au/
  • University of New Hampshire Robot Lab: http://www.ece.unh.edu/robots/rbt_home.htm
  • Weather forecasting and atmospheric science

  • UBC Climate Prediction Group: http://www.ocgy.ubc.ca/projects/clim.pred/index.html
  • Artificial Intelligence Research In Environmental Science: http://www.salinas.net/~jpeak/airies/airies.html
  • MET-AI, an mailing list for meteorologists and AI researchers: http://www.comp.vuw.ac.nz/Research/met-ai
  • Caren Marzban, Ph.D., Research Scientist, National Severe Storms Laboratory: http://www.nhn.ou.edu/~marzban/
  • David Myers's references on NNs in atmospheric science: http://terra.msrc.sunysb.edu/~dmyers/ai_refs
  • Weird

    Zaknich, Anthony and Baker, Sue K. (1998), "A real-time system for the characterisation of sheep feeding phases from acoustic signals of jaw sounds," Australian Journal of Intelligent Information Processing Systems (AJIIPS), Vol. 5, No. 2, Winter 1998.

    Abstract
    This paper describes a four-channel real-time system for the detection and measurement of sheep rumination and mastication time periods by the analysis of jaw sounds transmitted through the skull. The system is implemented using an 80486 personal computer, a proprietary data acquisition card (PC-126) and a custom made variable gain preamplifier and bandpass filter module. Chewing sounds are transduced and transmitted to the system using radio microphones attached to the top of the sheep heads. The system's main functions are to detect and estimate rumination and mastication time periods, to estimate the number of chews during the rumination and mastication periods, and to provide estimates of the number of boli in the rumination sequences and the number of chews per bolus. The individual chews are identified using a special energy threshold detector. The rumination and mastication time periods are determined by neural network classifier using a combination of time and frequency domain features extracted from successive 10 second acoustic signal blocks.

    ------------------------------------------------------------------------

    Subject: What to do with missing/incomplete data?

    The problem of missing data is very complex.

    For unsupervised learning, conventional statistical methods for missing data are often appropriate (Little and Rubin, 1987; Schafer, 1997). There is a concise introduction to these methods in the University of Texas statistics FAQ at http://www.utexas.edu/cc/faqs/stat/general/gen25.html.

    For supervised learning, the considerations are somewhat different, as discussed by Sarle (1998). The statistical literature on missing data deals almost exclusively with training rather than prediction (e.g., Little, 1992). For example, if you have only a small proportion of cases with missing data, you can simply throw those cases out for purposes of training; if you want to make predictions for cases with missing inputs, you don't have the option of throwing those cases out! In theory, Bayesian methods take care of everything, but a full Bayesian analysis is practical only with special models (such as multivariate normal distributions) or small sample sizes. The neural net literature contains a few good papers that cover prediction with missing inputs (e.g., Ghahramani and Jordan, 1997; Tresp, Neuneier, and Ahmad 1995), but much research remains to be done.

    References:

    Donner, A. (1982), "The relative effectiveness of procedures commonly used in multiple regression analysis for dealing with missing values," American Statistician, 36, 378-381.

    Ghahramani, Z. and Jordan, M.I. (1994), "Supervised learning from incomplete data via an EM approach," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 120-127.

    Ghahramani, Z. and Jordan, M.I. (1997), "Mixture models for Learning from incomplete data," in Greiner, R., Petsche, T., and Hanson, S.J. (eds.) Computational Learning Theory and Natural Learning Systems, Volume IV: Making Learning Systems Practical, Cambridge, MA: The MIT Press, pp. 67-85.

    Jones, M.P. (1996), "Indicator and stratification methods for missing explanatory variables in multiple linear regression," J. of the American Statistical Association, 91, 222-230.

    Little, R.J.A. (1992), "Regression with missing X's: A review," J. of the American Statistical Association, 87, 1227-1237.

    Little, R.J.A. and Rubin, D.B. (1987), Statistical Analysis with Missing Data, NY: Wiley.

    McLachlan, G.J. (1992) Discriminant Analysis and Statistical Pattern Recognition, Wiley.

    Sarle, W.S. (1998), "Prediction with Missing Inputs," in Wang, P.P. (ed.), JCIS '98 Proceedings, Vol II, Research Triangle Park, NC, 399-402, ftp://ftp.sas.com/pub/neural/JCIS98.ps.

    Schafer, J.L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman & Hall, ISBN 0 412 04061 1.

    Tresp, V., Ahmad, S. and Neuneier, R., (1994), "Training neural networks with deficient data", in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 128-135.

    Tresp, V., Neuneier, R., and Ahmad, S. (1995), "Efficient methods for dealing with missing data in supervised learning", in Tesauro, G., Touretzky, D.S., and Leen, T.K. (eds.) Advances in Neural Information Processing Systems 7, Cambridge, MA: The MIT Press, pp. 689-696.

    ------------------------------------------------------------------------

    Subject: How to forecast time series (temporal sequences)?

    In most of this FAQ, it is assumed that the training cases are statistically independent. That is, the training cases consist of pairs of input and target vectors, (X_i,Y_i), i=1,...,N, such that the conditional distribution of Y_i given all the other training data, (X_j, j=1,...,N, and Y_j, j=1,...i-1,i+1,...N) is equal to the conditional distribution of Y_i given X_i regardless of the values in the other training cases. Independence of cases is often achieved by random sampling.

    The most common violation of the independence assumption occurs when cases are observed in a certain order relating to time or space. That is, case (X_i,Y_i) corresponds to time T_i, with T_1 < T_2 < ... < T_N. It is assumed that the current target Y_i may depend not only on X_i but also on (X_i,Y_i) in the recent past. If the T_i are equally spaced, the simplest way to deal with this dependence is to include additional inputs (called lagged variables, shift registers, or a tapped delay line) in the network. Thus, for target Y_i, the inputs may include X_i, Y_{i-1}, X_{i-1}, Y_{i-1}, X_{i-2}, etc. (In some situations, X_i would not be known at the time you are trying to forecast Y_i and would therefore be excluded from the inputs.) Then you can train an ordinary feedforward network with these targets and lagged variables. The use of lagged variables has been extensively studied in the statistical and econometric literature (Judge, Griffiths, Hill, L\"utkepohl and Lee, 1985). A network in which the only inputs are lagged target values is called an "autoregressive model." The input space that includes all of the lagged variables is called the "embedding space."

    If the T_i are not equally spaced, everything gets much more complicated. One approach is to use a smoothing technique to interpolate points at equally spaced intervals, and then use the interpolated values for training instead of the original data.

    Use of lagged variables increases the number of decisions that must be made during training, since you must consider which lags to include in the network, as well as which input variables, how many hidden units, etc. Neural network researchers have therefore attempted to use partially recurrent networks instead of feedforward networks with lags (Weigend and Gershenfeld, 1994). Recurrent networks store information about past values in the network itself. There are many different kinds of recurrent architectures (Hertz, Krogh, and Palmer 1991; Mozer, 1994; Horne and Giles, 1995; Kremer, 199?). For example, in time-delay neural networks (Lang, Waibel, and Hinton 1990), the outputs for predicting target Y_{i-1} are used as inputs when processing target Y_i. Jordan networks (Jordan, 1986) are similar to time-delay neural networks except that the feedback is an exponential smooth of the sequence of output values. In Elman networks (Elman, 1990), the hidden unit activations that occur when processing target Y_{i-1} are used as inputs when processing target Y_i.

    However, there are some problems that cannot be dealt with via recurrent networks alone. For example, many time series exhibit trend, meaning that the target values tend to go up over time, or that the target values tend to go down over time. For example, stock prices and many other financial variables usually go up. If today's price is higher than all previous prices, and you try to forecast tomorrow's price using today's price as a lagged input, you are extrapolating, and extrapolating is unreliable. The simplest methods for handling trend are:

  • First fit a linear regression predicting the target values from the time, Y_i = a + b T_i + noise, where a and b are regression weights. Compute residuals R_i = Y_i - (a + b T_i). Then train the network using R_i for the target and lagged values. This method is rather crude but may work for deterministic linear trends. Of course, for nonlinear trends, you would need to fit a nonlinear regression.
  • Instead of using Y_i as a target, use D_i = Y_i - Y_{i-1} for the target and lagged values. This is called differencing and is the standard statistical method for handling nondeterministic (stochastic) trends. Sometimes it is necessary to compute differences of differences.

  •  

     

    For an elementary discussion of trend and various other practical problems in forecasting time series with NNs, such as seasonality, see Masters (1993). For a more advanced discussion of NN forecasting of economic series, see Moody (1998).

    There are several different ways to compute forecasts. For simplicity, let's assume you have a simple time series, Y_1, ..., Y_99, you want to forecast future values Y_f for f > 99, and you decide to use three lagged values as inputs. The possibilities include:

    Single-step, one-step-ahead, or open-loop forecasting:
    Train a network with target Y_i and inputs Y_{i-1}, Y_{i-2}, and Y_{i-3}. Let the scalar function computed by the network be designated as Net(.,.,.) taking the three input values as arguments and returning the output (predicted) value. Then:

    forecast Y_100 as Net(Y_99,Y_98,Y_97)
    forecast Y_101 as Net(Y_100,Y_99,Y_98)
    forecast Y_102 as Net(Y_101,Y_100,Y_99)
    forecast Y_103 as Net(Y_102,Y_101,Y_100)
    forecast Y_104 as Net(Y_103,Y_102,Y_101)
    and so on.
    Multi-step or closed-loop forecasting:
    Train the network as above, but:

    forecast Y_100 as P_100 = Net(Y_99,Y_98,Y_97)
    forecast Y_101 as P_101 = Net(P_100,Y_99,Y_98)
    forecast Y_102 as P_102 = Net(P_101,P_100,Y_99)
    forecast Y_103 as P_103 = Net(P_102,P_101,P_100)
    forecast Y_104 as P_104 = Net(P_103,P_102,P_101)
    and so on.
    N-step-ahead forecasting:
    For, say, N=3, train the network as above, but:

    compute P_100 = Net(Y_99,Y_98,Y_97)
    compute P_101 = Net(P_100,Y_99,Y_98)
    forecast Y_102 as P_102 = Net(P_101,P_100,Y_99)
    forecast Y_103 as P_103 = Net(P_102,P_101,Y_100)
    forecast Y_104 as P_104 = Net(P_103,P_102,Y_101)
    and so on.
    Direct simultaneous long-term forecasting:
    Train a network with multiple targets Y_i, Y_{i+1}, and Y_{i+2} and inputs Y_{i-1}, Y_{i-2}, and Y_{i-3}. Let the vector function computed by the network be designated as Net3(.,.,.), taking the three input values as arguments and returning the output (predicted) vector. Then:

    forecast (Y_100,Y_101,Y_102) as Net3(Y_99,Y_98,Y_97)
     
     
    Which method you choose for computing forecasts will obviously depend in part on the requirements of your application. If you have yearly sales figures through 1999 and you need to forecast sales in 2003, you clearly can't use single-step forecasting. If you need to compute forecasts at a thousand different future times, using direct simultaneous long-term forecasting would require an extremely large network.

    If a time series is a random walk, a well-trained network will predict Y_i by simply outputting Y_{i-1}. If you make a plot showing both the target values and the outputs, the two curves will almost coincide, except for being offset by one time step. People often mistakenly intrepret such a plot to indicate good forecasting accuracy, whereas in fact the network is virtually useless. In such situations, it is more enlightening to plot multi-step forecasts or N-step-ahead forecasts.

    References:

    Elman, J.L. (1990), "Finding structure in time," Cognitive Science, 14, 179-211.

    Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley: Redwood City, California.

    Horne, B. G. and Giles, C. L. (1995), "An experimental comparison of recurrent neural networks," In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 7, pp. 697-704. The MIT Press.

    Jordan, M. I. (1986b). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual conference of the Cognitive Science Society, pages 531-546. Lawrence Erlbaum.

    Judge, G.G., Griffiths, W.E., Hill, R.C., L\"utkepohl, H., and Lee, T.-C. (1985), The Theory and Practice of Econometrics, NY: John Wiley & Sons.

    Kremer, S.C. (199?), "Spatio-temporal Connectionist Networks: A Taxonomy and Review," http://hebb.cis.uoguelph.ca/~skremer/Teaching/27642/dynamic2/review.html.

    Lang, K. J., Waibel, A. H., and Hinton, G. (1990), "A time-delay neural network architecture for isolated word recognition," Neural Networks, 3, 23-44.

    Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego: Academic Press.

    Moody, J. (1998), "Forecasting the economy with neural nets: A survey of challenges and solutions," in Orr, G,B., and Mueller, K-R, eds., Neural Networks: Tricks of the Trade, Berlin: Springer.

    Mozer, M.C. (1994), "Neural net architectures for temporal sequence processing," in Weigend, A.S. and Gershenfeld, N.A., eds. (1994) Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: Addison-Wesley, 243-264, http://www.cs.colorado.edu/~mozer/papers/timeseries.html.

    Weigend, A.S. and Gershenfeld, N.A., eds. (1994) Time Series Prediction: Forecasting the Future and Understanding the Past, Reading, MA: Addison-Wesley.
     
     

    ------------------------------------------------------------------------

    Subject: How to learn an inverse of a function?

    Ordinarily, NNs learn a function Y = f(X), where Y is a vector of outputs, X is a vector of inputs, and f() is the function to be learned. Sometimes, however, you may want to learn an inverse of a function f(), that is, given Y, you want to be able to find an X such that Y = f(X). In general, there may be many different Xs that satisfy the equation Y = f(X).

    For example, in robotics (DeMers and Kreutz-Delgado, 1996, 1997), X might describe the positions of the joints in a robot's arm, while Y would describe the location of the robot's hand. There are simple formulas to compute the location of the hand given the positions of the joints, called the "forward kinematics" problem. But there is no simple formula for the "inverse kinematics" problem to compute positions of the joints that yield a given location for the hand. Furthermore, if the arm has several joints, there will usually be many different positions of the joints that yield the same location of the hand, so the forward kinematics function is many-to-one and has no unique inverse. Picking any X such that Y = f(X) is OK if the only aim is to position the hand at Y. However if the aim is to generate a series of points to move the hand through an arc this may be insufficient. In this case the series of Xs need to be in the same "branch" of the function space. Care must be taken to avoid solutions that yield inefficient or impossible movements of the arm.

    As another example, consider an industrial process in which X represents settings of control variables imposed by an operator, and Y represents measurements of the product of the industrial process. The function Y = f(X) can be learned by a NN using conventional training methods. But the goal of the analysis may be to find control settings X that yield a product with specified measurements Y, in which case an inverse of f(X) is required. In industrial applications, financial considerations are important, so not just any setting X that yields the desired result Y may be acceptable. Perhaps a function can be specified that gives the cost of X resulting from energy consumption, raw materials, etc., in which case you would want to find the X that minimizes the cost function while satisfying the equation Y = f(X).

    The obvious way to try to learn an inverse function is to generate a set of training data from a given forward function, but designate Y as the input and X as the output when training the network. Using a least-squares error function, this approach will fail if f() is many-to-one. The problem is that for an input Y, the net will not learn any single X such that Y = f(X), but will instead learn the arithmetic mean of all the Xs in the training set that satisfy the equation (Bishop, 1995, pp. 207-208). One solution to this difficulty is to construct a network that learns a mixture approximation to the conditional distribution of X given Y (Bishop, 1995, pp. 212-221). However, the mixture method will not work well in general for an X vector that is more than one-dimensional, such as Y = X_1^2 + X_2^2, since the number of mixture components required may increase exponentially with the dimensionality of X. And you are still left with the problem of extracting a single output vector from the mixture distribution, which is nontrivial if the mixture components overlap considerably. Another solution is to use a highly robust error function, such as a redescending M-estimator, that learns a single mode of the conditional distribution instead of learning the mean (Huber, 1981; Rohwer and van der Rest 1996). Additional regularization terms or constraints may be required to persuade the network to choose appropriately among several modes, and there may be severe problems with local optima.

    Another approach is to train a network to learn the forward mapping f() and then numerically invert the function. Finding X such that Y = f(X) is simply a matter of solving a nonlinear system of equations, for which many algorithms can be found in the numerical analysis literature (Dennis and Schnabel 1983). One way to solve nonlinear equations is turn the problem into an optimization problem by minimizing sum(Y_i-f(X_i))^2. This method fits in nicely with the usual gradient-descent methods for training NNs (Kindermann and Linden 1990). Since the nonlinear equations will generally have multiple solutions, there may be severe problems with local optima, especially if some solutions are considered more desirable than others. You can deal with multiple solutions by inventing some objective function that measures the goodness of different solutions, and optimizing this objective function under the nonlinear constraint Y = f(X) using any of numerous algorithms for nonlinear programming (NLP; see Bertsekas, 1995, and other references under "What are conjugate gradients, Levenberg-Marquardt, etc.?") The power and flexibility of the nonlinear programming approach are offset by possibly high computational demands.

    If the forward mapping f() is obtained by training a network, there will generally be some error in the network's outputs. The magnitude of this error can be difficult to estimate. The process of inverting a network can propagate this error, so the results should be checked carefully for validity and numerical stability. Some training methods can produce not just a point output but also a prediction interval (Bishop, 1995; White, 1992). You can take advantage of prediction intervals when inverting a network by using NLP methods. For example, you could try to find an X that minimizes the width of the prediction interval under the constraint that the equation Y = f(X) is satisfied. Or instead of requiring Y = f(X) be satisfied exactly, you could try to find an X such that the prediction interval is contained within some specified interval while minimizing some cost function.

    For more mathematics concerning the inverse-function problem, as well as some interesting methods involving self-organizing maps, see DeMers and Kreutz-Delgado (1996, 1997). For NNs that are relatively easy to invert, see the Adaptive Logic Networks described in the software sections of the FAQ.

    References:

    Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific.

    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press.

    DeMers, D., and Kreutz-Delgado, K. (1996), "Canonical Parameterization of Excess motor degrees of freedom with self organizing maps", IEEE Trans Neural Networks, 7, 43-55.

    DeMers, D., and Kreutz-Delgado, K. (1997), "Inverse kinematics of dextrous manipulators," in Omidvar, O., and van der Smagt, P., (eds.) Neural Systems for Robotics, San Diego: Academic Press, pp. 75-116.

    Dennis, J.E. and Schnabel, R.B. (1983) Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall

    Huber, P.J. (1981), Robust Statistics, NY: Wiley.

    Kindermann, J., and Linden, A. (1990), "Inversion of Neural Networks by Gradient Descent," Parallel Computing, 14, 277-286, ftp://icsi.Berkeley.EDU/pub/ai/linden/KindermannLinden.IEEE92.ps.Z

    Rohwer, R., and van der Rest, J.C. (1996), "Minimum description length, regularization, and multimodal data," Neural Computation, 8, 595-609.

    White, H. (1992), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings of the 23rd Sympsium on the Interface: Computing Science and Statistics, Alexandria, VA: American Statistical Association, pp. 190-199.

    ------------------------------------------------------------------------

    Subject: How to get invariant recognition of images under translation, rotation, etc.?

    See:
    Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, section 8.7.

    Masters, T. (1994), Signal and Image Processing with Neural Networks: A C++ Sourcebook, NY: Wiley.

    Soucek, B., and The IRIS Group (1992), Fast Learning and Invariant Object Recognition, NY: Wiley.

    Squire, D. (1997), Model-Based Neural Networks for Invariant Pattern Recognition, http://cuiwww.unige.ch/~squire/publications.html

    ------------------------------------------------------------------------

    Subject: How to recognize handwritten characters?

    URLS:
     
  • Andras Kornai's homepage at http://www.cs.rice.edu/~andras/
  • Yann LeCun's homepage at http://www.research.att.com/~yann/
  • Other references:
    Hastie, T., and Simard, P.Y. (1998), "Metrics and models for handwritten character recognition," Statistical Science, 13, 54-65.

    Jackel, L.D. et al., (1994) "Comparison of Classifier Methods: A Case Study in Handwritten Digit Recognition", 1994 International Conference on Pattern Recognition, Jerusalem

    LeCun, Y., Jackel, L.D., Bottou, L., Brunot, A., Cortes, C., Denker, J.S., Drucker, H., Guyon, I., Muller, U.A., Sackinger, E., Simard, P., and Vapnik, V. (1995), "Comparison of learning algorithms for handwritten digit recognition," in F. Fogelman and P. Gallinari, eds., International Conference on Artificial Neural Networks, pages 53-60, Paris.

    Orr, G,B., and Mueller, K-R, eds. (1998), Neural Networks: Tricks of the Trade, Berlin: Springer, ISBN 3-540-65311-2.

    ------------------------------------------------------------------------

    Subject: What about Genetic Algorithms?

    There are a number of definitions of GA (Genetic Algorithm). A possible one is
     
     
      A GA is an optimization program
      that starts with
      a population of encoded procedures,       (Creation of Life :-> )
      mutates them stochastically,              (Get cancer or so :-> )
      and uses a selection process              (Darwinism)
      to prefer the mutants with high fitness
      and perhaps a recombination process       (Make babies :-> )
      to combine properties of (preferably) the succesful mutants.
    Genetic algorithms are just a special case of the more general idea of "evolutionary computation". There is a newsgroup that is dedicated to the field of evolutionary computation called comp.ai.genetic. It has a detailed FAQ posting which, for instance, explains the terms "Genetic Algorithm", "Evolutionary Programming", "Evolution Strategy", "Classifier System", and "Genetic Programming". That FAQ also contains lots of pointers to relevant literature, software, other sources of information, et cetera et cetera. Please see the comp.ai.genetic FAQ for further information.

    There is a web page on "/Neural Network Using Genetic Algorithms" by Omri Weisman and Ziv Pollack at http://www.cs.bgu.ac.il/~omri/NNUGA/

    Andrew Gray's Hybrid Systems FAQ at the University of Otago at http://divcom.otago.ac.nz:800/COM/INFOSCI/SMRL/people/andrew/publications/faq/hybrid/hybrid.htm also has links to information on neuro-genetic methods.

    For general information on GAs, try the links at http://www.shef.ac.uk/~gaipp/galinks.html and http://www.cs.unibo.it/~gaioni

    ------------------------------------------------------------------------

    Subject: What about Fuzzy Logic?

    Fuzzy logic is an area of research based on the work of L.A. Zadeh. It is a departure from classical two-valued sets and logic, that uses "soft" linguistic (e.g. large, hot, tall) system variables and a continuous range of truth values in the interval [0,1], rather than strict binary (True or False) decisions and assignments.

     Fuzzy logic is used where a system is difficult to model exactly (but an inexact model is available), is controlled by a human operator or expert, or where ambiguity or vagueness is common. A typical fuzzy system consists of a rule base, membership functions, and an inference procedure.

     Most fuzzy logic discussion takes place in the newsgroup comp.ai.fuzzy (where there is a fuzzy logic FAQ) but there is also some work (and discussion) about combining fuzzy logic with neural network approaches in comp.ai.neural-nets.

     Early work combining neural nets and fuzzy methods used competitive networks to generate rules for fuzzy systems (Kosko 1992). This approach is sort of a crude version of bidirectional counterpropagation (Hecht-Nielsen 1990) and suffers from the same deficiencies. More recent work (Brown and Harris 1994; Kosko 1997) has been based on the realization that a fuzzy system is a nonlinear mapping from an input space to an output space that can be parameterized in various ways and therefore can be adapted to data using the usual neural training methods (see "What is backprop?") or conventional numerical optimization algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?").

    A neural net can incorporate fuzziness in various ways:

  • The inputs can be fuzzy. Any garden-variety backprop net is fuzzy in this sense, and it seems rather silly to call a net "fuzzy" solely on this basis, although Fuzzy ART (Carpenter and Grossberg 1996) has no other fuzzy characteristics.
  • The outputs can be fuzzy. Again, any garden-variety backprop net is fuzzy in this sense. But competitive learning nets ordinarily produce crisp outputs, so for competitive learning methods, having fuzzy output is a meaningful distinction. For example, fuzzy c-means clustering (Bezdek 1981) is meaningfully different from (crisp) k-means. Fuzzy ART does not have fuzzy outputs.
  • The net can be interpretable as an adaptive fuzzy system. For example, Gaussian RBF nets and B-spline regression models (Dierckx 1995, van Rijckevorsal 1988) are fuzzy systems with adaptive weights (Brown and Harris 1994) and can legitimately be called neurofuzzy systems.
  • The net can be a conventional NN architecture that operates on fuzzy numbers instead of real numbers (Lippe, Feuring and Mischke 1995).
  • Fuzzy constraints can provide external knowledge (Lampinen and Selonen 1996).
  • More information on neurofuzzy systems is available online:
  • The Fuzzy Logic and Neurofuzzy Resources page of the Image, Speech and Intelligent Systems (ISIS) research group at the University of Southampton, Southampton, Hampshire, UK: http://www-isis.ecs.soton.ac.uk/research/nfinfo/fuzzy.html.
  • The Neuro-Fuzzy Systems Research Group's web page at Tampere University of Technology, Tampere, Finland: http://www.cs.tut.fi/~tpo/group.html and http://dmiwww.cs.tut.fi/nfs/Welcome_uk.html
  • Marcello Chiaberge's Neuro-Fuzzy page at http://polimage.polito.it/~marcello.
  • Jyh-Shing Roger Jang's home page at http://www.cs.nthu.edu.tw/~jang/ with information on ANFIS (Adaptive Neuro-Fuzzy Inference Systems), articles on neuro-fuzzy systems, and more links.
  • Andrew Gray's Hybrid Systems FAQ at the University of Otago at http://divcom.otago.ac.nz:800/COM/INFOSCI/SMRL/people/andrew/publications/faq/hybrid/hybrid.htm
  • References:
    Bezdek, J.C. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press.

    Bezdek, J.C. & Pal, S.K., eds. (1992), Fuzzy Models for Pattern Recognition, New York: IEEE Press.

    Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and Control, NY: Prentice Hall.

    Carpenter, G.A. and Grossberg, S. (1996), "Learning, Categorization, Rule Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H. (1996), pp. 1.3-1.45.

    Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY: McGraw-Hill, ISBN 0-07-011189-8.

    Dierckx, P. (1995), Curve and Surface Fitting with Splines, Oxford: Clarendon Press.

    Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley.

    Klir, G.J. and Folger, T.A.(1988), Fuzzy Sets, Uncertainty, and Information, Englewood Cliffs, N.J.: Prentice-Hall.

    Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs, N.J.: Prentice-Hall.

    Kosko, B. (1997), Fuzzy Engineering, NY: Prentice Hall.

    Lampinen, J and Selonen, A. (1996), "Using Background Knowledge for Regularization of Multilayer Perceptron Learning", Submitted to International Conference on Artificial Neural Networks, ICANN'96, Bochum, Germany.

    Lippe, W.-M., Feuring, Th. and Mischke, L. (1995), "Supervised learning in fuzzy neural networks," Institutsbericht Angewandte Mathematik und Informatik, WWU Muenster, I-12, http://wwwmath.uni-muenster.de/~feuring/WWW_literatur/bericht12_95.ps.gz

    van Rijckevorsal, J.L.A. (1988), "Fuzzy coding and B-splines," in van Rijckevorsal, J.L.A., and de Leeuw, J., eds., Component and Correspondence Analysis, Chichester: John Wiley & Sons, pp. 33-54.

    ------------------------------------------------------------------------

    Subject: Unanswered FAQs

  • How many training cases do I need?
  • How should I split the data into training and validation sets?
  • What error functions can be used?
  • What are some good constructive training algorithms?
  • How can I select important input variables?
  • Should NNs be used in safety-critical applications?
  • My net won't learn! What should I do???
  • My net won't generalize! What should I do???
  • ------------------------------------------------------------------------

    Subject: Other NN links?

    ------------------------------------------------------------------------
    That's all folks (End of the Neural Network FAQ).
     
     
    Acknowledgements: Thanks to all the people who helped to get the stuff
                      above into the posting. I cannot name them all, because
                      I would make far too many errors then. :->
    
                      No?  Not good?  You want individual credit?
                      OK, OK. I'll try to name them all. But: no guarantee....
    
      THANKS FOR HELP TO:
    (in alphabetical order of email adresses, I hope)
  • Steve Ward <71561.2370@CompuServe.COM>
  • Allen Bonde <ab04@harvey.gte.com>
  • Accel Infotech Spore Pte Ltd <accel@solomon.technet.sg>
  • Ales Krajnc <akrajnc@fagg.uni-lj.si>
  • Alexander Linden <al@jargon.gmd.de>
  • Matthew David Aldous <aldous@mundil.cs.mu.OZ.AU>
  • S.Taimi Ames <ames@reed.edu>
  • Axel Mulder <amulder@move.kines.sfu.ca>
  • anderson@atc.boeing.com
  • Andy Gillanders <andy@grace.demon.co.uk>
  • Davide Anguita <anguita@ICSI.Berkeley.EDU>
  • Avraam Pouliakis <apou@leon.nrcps.ariadne-t.gr>
  • Kim L. Blackwell <avrama@helix.nih.gov>
  • Mohammad Bahrami <bahrami@cse.unsw.edu.au>
  • Paul Bakker <bakker@cs.uq.oz.au>
  • Stefan Bergdoll <bergdoll@zxd.basf-ag.de>
  • Jamshed Bharucha <bharucha@casbs.Stanford.EDU>
  • Carl M. Cook <biocomp@biocomp.seanet.com>
  • Yijun Cai <caiy@mercury.cs.uregina.ca>
  • L. Leon Campbell <campbell@brahms.udel.edu>
  • Cindy Hitchcock <cindyh@vnet.ibm.com>
  • Clare G. Gallagher <clare@mikuni2.mikuni.com>
  • Craig Watson <craig@magi.ncsl.nist.gov>
  • Yaron Danon <danony@goya.its.rpi.edu>
  • David Ewing <dave@ndx.com>
  • David DeMers <demers@cs.ucsd.edu>
  • Denni Rognvaldsson <denni@thep.lu.se>
  • Duane Highley <dhighley@ozarks.sgcl.lib.mo.us>
  • Dick.Keene@Central.Sun.COM
  • DJ Meyer <djm@partek.com>
  • Donald Tveter <drt@christianliving.net>
  • Daniel Tauritz <dtauritz@wi.leidenuniv.nl>
  • Wlodzislaw Duch <duch@phys.uni.torun.pl>
  • E. Robert Tisdale <edwin@flamingo.cs.ucla.edu>
  • Athanasios Episcopos <episcopo@fire.camp.clarkson.edu>
  • Frank Schnorrenberg <fs0997@easttexas.tamu.edu>
  • Gary Lawrence Murphy <garym@maya.isis.org>
  • gaudiano@park.bu.edu
  • Lee Giles <giles@research.nj.nec.com>
  • Glen Clark <opto!glen@gatech.edu>
  • Phil Goodman <goodman@unr.edu>
  • guy@minster.york.ac.uk
  • Horace A. Vallas, Jr. <hav@neosoft.com>
  • Gregory E. Heath <heath@ll.mit.edu>
  • Joerg Heitkoetter <heitkoet@lusty.informatik.uni-dortmund.de>
  • Ralf Hohenstein <hohenst@math.uni-muenster.de>
  • Ian Cresswell <icressw@leopold.win-uk.net>
  • Gamze Erten <ictech@mcimail.com>
  • Ed Rosenfeld <IER@aol.com>
  • Franco Insana <INSANA@asri.edu>
  • Janne Sinkkonen <janne@iki.fi>
  • Javier Blasco-Alberto <jblasco@ideafix.cps.unizar.es>
  • Jean-Denis Muller <jdmuller@vnet.ibm.com>
  • Jeff Harpster <uu0979!jeff@uu9.psi.com>
  • Jonathan Kamens <jik@MIT.Edu>
  • J.J. Merelo <jmerelo@kal-el.ugr.es>
  • Dr. Jacek Zurada <jmzura02@starbase.spd.louisville.edu>
  • Jon Gunnar Solheim <jon@kongle.idt.unit.no>
  • Josef Nelissen <jonas@beor.informatik.rwth-aachen.de>
  • Joey Rogers <jrogers@buster.eng.ua.edu>
  • Subhash Kak <kak@gate.ee.lsu.edu>
  • Ken Karnofsky <karnofsky@mathworks.com>
  • Kjetil.Noervaag@idt.unit.no
  • Luke Koops <koops@gaul.csd.uwo.ca>
  • Kurt Hornik <Kurt.Hornik@tuwien.ac.at>
  • Thomas Lindblad <lindblad@kth.se>
  • Clark Lindsey <lindsey@particle.kth.se>
  • Lloyd Lubet <llubet@rt66.com>
  • William Mackeown <mackeown@compsci.bristol.ac.uk>
  • Maria Dolores Soriano Lopez <maria@vaire.imib.rwth-aachen.de>
  • Mark Plumbley <mark@dcs.kcl.ac.uk>
  • Peter Marvit <marvit@cattell.psych.upenn.edu>
  • masud@worldbank.org
  • Miguel A. Carreira-Perpinan<mcarreir@moises.ls.fi.upm.es>
  • Yoshiro Miyata <miyata@sccs.chukyo-u.ac.jp>
  • Madhav Moganti <mmogati@cs.umr.edu>
  • Jyrki Alakuijala <more@ee.oulu.fi>
  • Jean-Denis Muller <muller@bruyeres.cea.fr>
  • Michael Reiss <m.reiss@kcl.ac.uk>
  • mrs@kithrup.com
  • Maciek Sitnik <msitnik@plearn.edu.pl>
  • R. Steven Rainwater <ncc@ncc.jvnc.net>
  • Nigel Dodd <nd@neural.win-uk.net>
  • Barry Dunmall <neural@nts.sonnet.co.uk>
  • Paolo Ienne <Paolo.Ienne@di.epfl.ch>
  • Paul Keller <pe_keller@ccmail.pnl.gov>
  • Peter Hamer <P.G.Hamer@nortel.co.uk>
  • Pierre v.d. Laar <pierre@mbfys.kun.nl>
  • Michael Plonski <plonski@aero.org>
  • Lutz Prechelt <prechelt@ira.uka.de> [creator of FAQ]
  • Richard Andrew Miles Outerbridge <ramo@uvphys.phys.uvic.ca>
  • Rand Dixon <rdixon@passport.ca>
  • Robin L. Getz <rgetz@esd.nsc.com>
  • Richard Cornelius <richc@rsf.atd.ucar.edu>
  • Rob Cunningham <rkc@xn.ll.mit.edu>
  • Robert.Kocjancic@IJS.si
  • Randall C. O'Reilly <ro2m@crab.psy.cmu.edu>
  • Rutvik Desai <rudesai@cs.indiana.edu>
  • Robert W. Means <rwmeans@hnc.com>
  • Stefan Vogt <s_vogt@cis.umassd.edu>
  • Osamu Saito <saito@nttica.ntt.jp>
  • Scott Fahlman <sef+@cs.cmu.edu>
  • <seibert@ll.mit.edu>
  • Sheryl Cormicle <sherylc@umich.edu>
  • Ted Stockwell <ted@aps1.spa.umn.edu>
  • Stephanie Warrick <S.Warrick@cs.ucl.ac.uk>
  • Serge Waterschoot <swater@minf.vub.ac.be>
  • Thomas G. Dietterich <tgd@research.cs.orst.edu>
  • Thomas.Vogel@cl.cam.ac.uk
  • Ulrich Wendl <uli@unido.informatik.uni-dortmund.de>
  • M. Verleysen <verleysen@dice.ucl.ac.be>
  • VestaServ@aol.com
  • Sherif Hashem <vg197@neutrino.pnl.gov>
  • Matthew P Wiener <weemba@sagi.wistar.upenn.edu>
  • Wesley Elsberry <welsberr@orca.tamu.edu>
  • Dr. Steve G. Romaniuk <ZLXX69A@prodigy.com>
  • The FAQ was created in June/July 1991 by Lutz Prechelt; he also maintained the FAQ until November 1995. Warren Sarle maintains the FAQ since December 1995.
    Bye
    
      Warren & Lutz
    Previous part is part 6.
    Neural network FAQ / Warren S. Sarle, saswss@unx.sas.com