The fit
command can fit a user-defined function to a set of data points
(x,y) or (x,y,z), using an implementation of the nonlinear least-squares
(NLLS) Marquardt-Levenberg algorithm. Any user-defined variable occurring in
the function body may serve as a fit parameter, but the return type of the
function must be real.
Syntax:
fit {[xrange] {[yrange]}} <function> '<datafile>' {datafile-modifiers} via '<parameter file>' | <var1>{,<var2>,...}
Ranges may be specified to temporarily limit the data which is to be fitted; any out-of-range data points are ignored. The syntax is
[{dummy_variable=}{<min>}{:<max>}],
analogous to plot
; see plot ranges
.
<function> is any valid gnuplot
expression, although it is usual to use a
previously user-defined function of the form f(x) or f(x,y).
<datafile> is treated as in the plot
command. All the plot datafile
modifiers (using
, every
,...) except smooth
are applicable to fit
.
See plot datafile
.
The default data formats for fitting functions with a single independent
variable, y=f(x), are {x:}y or x:y:s; those formats can be changed with
the datafile using
qualifier. The third item, (a column number or an
expression), if present, is interpreted as the standard deviation of the
corresponding y value and is used to compute a weight for the datum, 1/s**2.
Otherwise, all data points are weighted equally, with a weight of one.
To fit a function with two independent variables, z=f(x,y), the required
format is using
with four items, x:y:z:s. The complete format must be
given--no default columns are assumed for a missing token. Weights for
each data point are evaluated from 's' as above. If error estimates are
not available, a constant value can be specified as a constant expression
(see plot datafile using
), e.g., using 1:2:3:(1)
.
Multiple datasets may be simultaneously fit with functions of one
independent variable by making y a 'pseudo-variable', e.g., the dataline
number, and fitting as two independent variables. See fit multibranch
.
The via
qualifier specifies which parameters are to be adjusted, either
directly, or by referencing a parameter file.
Examples:
f(x) = a*x**2 + b*x + c g(x,y) = a*x**2 + b*y**2 + c*x*y FIT_LIMIT = 1e-6 fit f(x) 'measured.dat' via 'start.par' fit f(x) 'measured.dat' using 3:($7-5) via 'start.par' fit f(x) './data/trash.dat' using 1:2:3 via a, b, c fit g(x,y) 'surface.dat' using 1:2:3:(1) via a, b, c
After each iteration step, detailed information about the current state of the fit is written to the display. The same information about the initial and final states is written to a log file, "fit.log". This file is always appended to, so as to not lose any previous fit history; it should be deleted or renamed as desired.
The fit may be interrupted by pressing Ctrl-C (any key but Ctrl-C under
MSDOS and Atari Multitasking Systems). After the current iteration
completes, you have the option to (1) stop the fit and accept the current
parameter values, (2) continue the fit, (3) execute a gnuplot
command
as specified by the environment variable FIT_SCRIPT. The default for
FIT_SCRIPT is replot
, so if you had previously plotted both the data
and the fitting function in one graph, you can display the current state
of the fit.
Once fit
has finished, the update
command may be used to store final
values in a file for subsequent use as a parameter file. See update
for details.
There are two ways that via
can specify the parameters to be adjusted,
either directly on the command line or indirectly, by referencing a
parameter file. The two use different means to set initial values.
Adjustable parameters can be specified by a comma-separated list of variable
names after the via
keyword. Any variable that is not already defined is
is created with an initial value of 1.0. However, the fit is more likely
to converge rapidly if the variables have been previously declared with more
appropriate starting values.
In a parameter file, each parameter to be varied and a corresponding initial value are specified, one per line, in the form
varname = value
Comments, marked by '#', and blank lines are permissible. The special form
varname = value # FIXED
means that the variable is treated as a 'fixed parameter', initialized by the
parameter file, but not adjusted by fit
. For clarity, it may be useful to
designate variables as fixed parameters so that their values are reported by
fit
. The keyword # FIXED
has to appear in exactly this form.
fit
is used to find a set of parameters that 'best' fits your data to your
user-defined function. The fit is judged on the basis of the the sum of the
squared differences or 'residuals' (SSR) between the input data points and
the function values, evaluated at the same places. This quantity is often
called 'chisquare' (i.e., the Greek letter chi, to the power of 2). The
algorithm attempts to minimize SSR, or more precisely, WSSR, as the residuals
are 'weighted' by the input data errors (or 1.0) before being squared; see
fit error_estimates
for details.
That's why it is called 'least-squares fitting'. Let's look at an example
to see what is meant by 'non-linear', but first we had better go over some
terms. Here it is convenient to use z as the dependent variable for
user-defined functions of either one independent variable, z=f(x), or two
independent variables, z=f(x,y). A parameter is a user-defined variable
that fit
will adjust, i.e., an unknown quantity in the function
declaration. Linearity/non-linearity refers to the relationship of the
dependent variable, z, to the parameters which fit
is adjusting, not of
z to the independent variables, x and/or y. (To be technical, the
second {and higher} derivatives of the fitting function with respect to
the parameters are zero for a linear least-squares problem).
For linear least-squares (LLS), the user-defined function will be a sum of simple functions, not involving any parameters, each multiplied by one parameter. NLLS handles more complicated functions in which parameters can be used in a large number of ways. An example that illustrates the difference between linear and nonlinear least-squares is the Fourier series. One member may be written as
z=a*sin(c*x) + b*cos(c*x).
If a and b are the unknown parameters and c is constant, then estimating values of the parameters is a linear least-squares problem. However, if c is an unknown parameter, the problem is nonlinear.
In the linear case, parameter values can be determined by comparatively
simple linear algebra, in one direct step. However LLS is a special case
which is also solved along with more general NLLS problems by the iterative
procedure that gnuplot
uses. fit
attempts to find the minimum by doing
a search. Each step (iteration) calculates WSSR with a new set of parameter
values. The Marquardt-Levenberg algorithm selects the parameter values for
the next iteration. The process continues until a preset criterium is met,
either (1) the fit has "converged" (the relative change in WSSR is less than
FIT_LIMIT), or (2) it reaches a preset iteration count limit, FIT_MAXITER
(see fit control variables
). The fit may also be interrupted
and subsequently halted from the keyboard (see fit
).
Often the function to be fitted will be based on a model (or theory) that
attempts to describe or predict the behaviour of the data. Then fit
can
be used to find values for the free parameters of the model, to determine
how well the data fits the model, and to estimate an error range for each
parameter. See fit error_estimates
.
Alternatively, in curve-fitting, functions are selected independent of
a model (on the basis of experience as to which are likely to describe
the trend of the data with the desired resolution and a minimum number
of parameters*functions.) The fit
solution then provides an analytic
representation of the curve.
However, if all you really want is a smooth curve through your data points,
the smooth
option to plot
may be what you've been looking for rather
than fit
.
In fit
, the term "error" is used in two different contexts, data error
estimates and parameter error estimates.
Data error estimates are used to calculate the relative weight of each data
point when determining the weighted sum of squared residuals, WSSR or
chisquare. They can affect the parameter estimates, since they determine
how much influence the deviation of each data point from the fitted function
has on the final values. Some of the fit
output information, including
the parameter error estimates, is more meaningful if accurate data error
estimates have been provided.
The 'statistical overview' describes some of the fit
output and gives some
background for the 'practical guidelines'.
The theory of non-linear least-squares (NLLS) is generally described in terms of a normal distribution of errors, that is, the input data is assumed to be a sample from a population having a given mean and a Gaussian (normal) distribution about the mean with a given standard deviation. For a sample of sufficiently large size, and knowing the population standard deviation, one can use the statistics of the chisquare distribution to describe a "goodness of fit" by looking at the variable often called "chisquare". Here, it is sufficient to say that a reduced chisquare (chisquare/degrees of freedom, where degrees of freedom is the number of datapoints less the number of parameters being fitted) of 1.0 is an indication that the weighted sum of squared deviations between the fitted function and the data points is the same as that expected for a random sample from a population characterized by the function with the current value of the parameters and the given standard deviations.
If the standard deviation for the population is not constant, as in counting statistics where variance = counts, then each point should be individually weighted when comparing the observed sum of deviations and the expected sum of deviations.
At the conclusion fit
reports 'stdfit', the standard deviation of the fit,
which is the rms of the residuals, and the variance of the residuals, also
called 'reduced chisquare' when the data points are weighted. The number of
degrees of freedom (the number of data points minus the number of fitted
parameters) is used in these estimates because the parameters used in
calculating the residuals of the datapoints were obtained from the same data.
To estimate confidence levels for the parameters, one can use the minimum chisquare obtained from the fit and chisquare statistics to determine the value of chisquare corresponding to the desired confidence level, but considerably more calculation is required to determine the combinations of parameters which produce such values.
Rather than determine confidence intervals, fit
reports parameter error
estimates which are readily obtained from the variance-covariance matrix
after the final iteration. By convention, these estimates are called
"standard errors" or "asymptotic standard errors", since they are calculated
in the same way as the standard errors (standard deviation of each parameter)
of a linear least-squares problem, even though the statistical conditions for
designating the quantity calculated to be a standard deviation are not
generally valid for the NLLS problem. The asymptotic standard errors are
generally over-optimistic and should not be used for determining confidence
levels, but are useful for qualitative purposes.
The final solution also produces a correlation matrix, which gives an indication of the correlation of parameters in the region of the solution; if one parameter is changed, increasing chisquare, does changing another compensate? The main diagonal elements, autocorrelation, are all 1; if all parameters were independent, all other elements would be nearly 0. Two variables which completely compensate each other would have an off-diagonal element of unit magnitude, with a sign depending on whether the relation is proportional or inversely proportional. The smaller the magnitudes of the off-diagonal elements, the closer the estimates of the standard deviation of each parameter would be to the asymptotic standard error.
If you have a basis for assigning weights to each data point, doing so lets you make use of additional knowledge about your measurements, e.g., take into account that some points may be more reliable than others. That may affect the final values of the parameters.
Weighting the data provides a basis for interpreting the additional fit
output after the last iteration. Even if you weight each point equally,
estimating an average standard deviation rather than using a weight of 1
makes WSSR a dimensionless variable, as chisquare is by definition.
Each fit iteration will display information which can be used to evaluate
the progress of the fit. (An '*' indicates that it did not find a smaller
WSSR and is trying again.) The 'sum of squares of residuals', also called
'chisquare', is the WSSR between the data and your fitted function; fit
has minimized that. At this stage, with weighted data, chisquare is expected
to approach the number of degrees of freedom (data points minus parameters).
The WSSR can be used to calculate the reduced chisquare (WSSR/ndf) or stdfit,
the standard deviation of the fit, sqrt(WSSR/ndf). Both of these are
reported for the final WSSR.
If the data are unweighted, stdfit is the rms value of the deviation of the data from the fitted function, in user units.
If you supplied valid data errors, the number of data points is large enough, and the model is correct, the reduced chisquare should be about unity. (For details, look up the 'chi-squared distribution' in your favourite statistics reference.) If so, there are additional tests, beyond the scope of this overview, for determining how well the model fits the data.
A reduced chisquare much larger than 1.0 may be due to incorrect data error
estimates, data errors not normally distributed, systematic measurement
errors, 'outliers', or an incorrect model function. A plot of the residuals,
e.g., plot 'datafile' using 1:($2-f($1))
, may help to show any systematic
trends. Plotting both the data points and the function may help to suggest
another model.
Similarly, a reduced chisquare less than 1.0 indicates WSSR is less than that expected for a random sample from the function with normally distributed errors. The data error estimates may be too large, the statistical assumptions may not be justified, or the model function may be too general, fitting fluctuations in a particular sample in addition to the underlying trends. In the latter case, a simpler function may be more appropriate.
You'll have to get used to both fit
and the kind of problems you apply it
to before you can relate the standard errors to some more practical estimates
of parameter uncertainties or evaluate the significance of the correlation
matrix.
Note that fit
, in common with most NLLS implementations, minimizes the
weighted sum of squared distances (y-f(x))**2. It does not provide any means
to account for "errors" in the values of x, only in y. Also, any "outliers"
(data points outside the normal distribution of the model) will have an
exaggerated effect on the solution.
There are a number of gnuplot
variables that can be defined to affect
fit
. Those which can be defined once gnuplot
is running are listed
under 'control_variables' while those defined before starting gnuplot
are listed under 'environment_variables'.
The default epsilon limit (1e-5) may be changed by declaring a value for
FIT_LIMIT
When the sum of squared residuals changes between two iteration steps by a factor less than this number (epsilon), the fit is considered to have 'converged'.
The maximum number of iterations may be limited by declaring a value for
FIT_MAXITER
A value of 0 (or not defining it at all) means that there is no limit.
If you need even more control about the algorithm, and know the
Marquardt-Levenberg algorithm well, there are some more variables to
influence it. The startup value of lambda
is normally calculated
automatically from the ML-matrix, but if you want to, you may provide
your own one with
FIT_START_LAMBDA
Specifying FIT_START_LAMBDA as zero or less will re-enable the automatic selection. The variable
FIT_LAMBDA_FACTOR
gives the factor by which lambda
is increased or decreased whenever
the chi-squared target function increased or decreased significantly.
Setting FIT_LAMBDA_FACTOR to zero re-enables the default factor of
10.0.
Oher variables with the FIT_ prefix may be added to fit
, so it is safer
not to use that prefix for user-defined variables.
The variables FIT_SKIP and FIT_INDEX were used by earlier releases of
gnuplot
with a 'fit' patch called gnufit
and are no longer available.
The datafile every
modifier provides the functionality of FIT_SKIP.
FIT_INDEX was used for multi-branch fitting, but multi-branch fitting of
one independent variable is now done as a pseudo-3D fit in which the
second independent variable and using
are used to specify the branch.
See fit multi-branch
.
The environment variables must be defined before gnuplot
is executed; how
to do so depends on your operating system.
FIT_LOG
changes the name (and/or path) of the file to which the fit log will be written from the default of "fit.log" in the working directory.
FIT_SCRIPT
specifies a command that may be executed after an user interrupt. The default
is replot
, but a plot
or load
command may be useful to display a plot
customized to highlight the progress of the fit.
In multi-branch fitting, multiple data sets can be simultaneously fit with functions of one independent variable having common parameters by minimizing the total WSSR. The function and parameters (branch) for each data set are selected by using a 'pseudo-variable', e.g., either the dataline number (a 'column' index of -1) or the datafile index (-2), as the second independent variable.
Example: Given two exponential decays of the form, z=f(x), each describing a different data set but having a common decay time, estimate the values of the parameters. If the datafile has the format x:z:s, then
f(x,y) = (y==0) ? a*exp(-x/tau) : b*exp(-x/tau) fit f(x,y) 'datafile' using 1:-1:2:3 via a, b, tau
For a more complicated example, see the file "hexa.fnc" used by the "fit.dem" demo.
Appropriate weighting may be required since unit weights may cause one branch to predominate if there is a difference in the scale of the dependent variable. Fitting each branch separately, using the multi-branch solution as initial values, may give an indication as to the relative effect of each branch on the joint solution.
Nonlinear fitting is not guaranteed to converge to the global optimum (the solution with the smallest sum of squared residuals, SSR), and can get stuck at a local minimum. The routine has no way to determine that; it is up to you to judge whether this has happened.
fit
may, and often will get "lost" if started far from a solution, where
SSR is large and changing slowly as the parameters are varied, or it may
reach a numerically unstable region (e.g., too large a number causing a
floating point overflow) which results in an "undefined value" message
or gnuplot
halting.
To improve the chances of finding the global optimum, you should set the
starting values at least roughly in the vicinity of the solution, e.g.,
within an order of magnitude, if possible. The closer your starting values
are to the solution, the less chance of stopping at another minimum. One way
to find starting values is to plot data and the fitting function on the same
graph and change parameter values and replot
until reasonable similarity
is reached. The same plot is also useful to check whether the fit stopped at
a minimum with a poor fit.
Of course, a reasonably good fit is not proof there is not a "better" fit (in
either a statistical sense, characterized by an improved goodness-of-fit
criterion, or a physical sense, with a solution more consistent with the
model.) Depending on the problem, it may be desirable to fit
with various
sets of starting values, covering a reasonable range for each parameter.
Here are some tips to keep in mind to get the most out of fit
. They're not
very organized, so you'll have to read them several times until their essence
has sunk in.
The two forms of the via
argument to fit
serve two largely distinct
purposes. The via "file"
form is best used for (possibly unattended) batch
operation, where you just supply the startup values in a file and can later
use update
to copy the results back into another (or the same) parameter
file.
The via var1, var2, ...
form is best used interactively, where the command
history mechanism may be used to edit the list of parameters to be fitted or
to supply new startup values for the next try. This is particularly useful
for hard problems, where a direct fit to all parameters at once won't work
without good starting values. To find such, you can iterate several times,
fitting only some of the parameters, until the values are close enough to the
goal that the final fit to all parameters at once will work.
Make sure that there is no mutual dependency among parameters of the function you are fitting. For example, don't try to fit a*exp(x+b), because a*exp(x+b)=a*exp(b)*exp(x). Instead, fit either a*exp(x) or exp(x+b).
A technical issue: the parameters must not be too different in magnitude. The larger the ratio of the largest and the smallest absolute parameter values, the slower the fit will converge. If the ratio is close to or above the inverse of the machine floating point precision, it may take next to forever to converge, or refuse to converge at all. You will have to adapt your function to avoid this, e.g., replace 'parameter' by '1e9*parameter' in the function definition, and divide the starting value by 1e9.
If you can write your function as a linear combination of simple functions weighted by the parameters to be fitted, by all means do so. That helps a lot, because the problem is no longer nonlinear and should converge with only a small number of iterations, perhaps just one.
Some prescriptions for analysing data, given in practical experimentation
courses, may have you first fit some functions to your data, perhaps in a
multi-step process of accounting for several aspects of the underlying
theory one by one, and then extract the information you really wanted from
the fitting parameters of those functions. With fit
, this may often be
done in one step by writing the model function directly in terms of the
desired parameters. Transforming data can also quite often be avoided,
though sometimes at the cost of a more difficult fit problem. If you think
this contradicts the previous paragraph about simplifying the fit function,
you are correct.
A "singular matrix" message indicates that this implementation of the Marquardt-Levenberg algorithm can't calculate parameter values for the next iteration. Try different starting values, writing the function in another form, or a simpler function.
Finally, a nice quote from the manual of another fitting package (fudgit), that kind of summarizes all these issues: "Nonlinear fitting is an art!"