More accurate numerical gradients, by remembering where you came from.
Many optimization methods, including Newton's method, quasi-Newton (BFGS, L-BFGS), and trust-region, assume you can compute the gradient ∇f(x). When no analytical gradient is available, we approximate it with finite differences:
This works in theory, but in floating-point arithmetic it's a tightrope walk:
And it gets worse inside an iterative optimizer: every step uses a fresh, noisy gradient. Errors compound. The optimizer can stall, oscillate, or report convergence at a point that isn't quite a minimum.
The standard finite difference probes along the canonical axes e1, e2, …, en. But during optimization, the function isn't equally sensitive in every direction. It changes a lot along the directions you've been descending in, and much less along the rest.
Smart Gradient exploits this. At iteration k it builds an orthonormal basis from the recent descent directions dk−1, dk−2, …, dk−m (limited-memory: keep only the last m), completes it with axis directions, and finite-differences in that basis. Then it rotates the resulting gradient back to the original coordinates.
Finite-difference error has two parts: truncation (proportional to h) and round-off (proportional to εmach/h, where εmach is machine precision). What dominates the relative error is how much the function actually changes over the probe:
Along a recent descent direction, |Δf| is large by construction: the optimizer moved in that direction because the function was decreasing fast there. Along a generic axis, |Δf| can be tiny, so the round-off term blows up. Smart Gradient buys you bigger |Δf| for free: same h, same number of function evaluations, but the signal-to-noise ratio improves.
At each iteration k:
It is most useful when the analytical gradient is unavailable or hard to maintain (legacy code, complex simulators) but optimization is still being done seriously.
library("devtools")
install_github("esmail-abdulfattah/Smart-Gradient", subdir = "smartGrad")
Here we use the Extended Rosenbrock function, a classic ill-conditioned test case where the minimum sits at the bottom of a curved, narrow valley:
myfun <- function(x) {
res <- 0.0
for (i in 1:(length(x) - 1))
res <- res + 100 * (x[i+1] - x[i]^2)^2 + (1 - x[i])^2
return(res)
}
mygrad <- function(fun, x) {
h <- 1e-3
grad <- numeric(length(x))
for (i in 1:length(x)) {
e <- numeric(length(x)); e[i] <- 1
grad[i] <- (fun(x + h*e) - fun(x - h*e)) / (2*h)
}
return(grad)
}
makeSmart() and optimizelibrary("stats")
library("smartGrad")
x_initial <- rnorm(5)
result <- optim(
par = x_initial,
fn = myfun,
gr = makeSmart(fn = myfun, gr = mygrad),
method = "BFGS"
)
That's it. makeSmart() takes your existing numerical gradient and returns a drop-in replacement that maintains the rotated basis internally. The optimizer doesn't need to know anything has changed.