Smart Gradient logo

Smart Gradient

More accurate numerical gradients, by remembering where you came from.

TL;DR. When an optimizer estimates the gradient by finite differences, it usually probes along the coordinate axes. Smart Gradient instead probes along the directions the optimizer has actually been moving in. Those are the directions the function changes fastest along, so the numerical noise shrinks, and the optimizer converges with fewer, more reliable steps.

The problem: numerical gradients are noisy

Many optimization methods, including Newton's method, quasi-Newton (BFGS, L-BFGS), and trust-region, assume you can compute the gradient ∇f(x). When no analytical gradient is available, we approximate it with finite differences:

∂f / ∂xi ≈ ( f(x + h·ei) − f(x − h·ei) ) / (2h)

This works in theory, but in floating-point arithmetic it's a tightrope walk:

h too small

Catastrophic cancellation: the numerator subtracts two nearly-equal numbers and loses most of its significant digits.

h too large

Truncation error dominates: the finite difference no longer approximates the true derivative well.

And it gets worse inside an iterative optimizer: every step uses a fresh, noisy gradient. Errors compound. The optimizer can stall, oscillate, or report convergence at a point that isn't quite a minimum.

The idea: rotate your basis toward where the action is

The standard finite difference probes along the canonical axes e1, e2, …, en. But during optimization, the function isn't equally sensitive in every direction. It changes a lot along the directions you've been descending in, and much less along the rest.

Smart Gradient exploits this. At iteration k it builds an orthonormal basis from the recent descent directions dk−1, dk−2, …, dk−m (limited-memory: keep only the last m), completes it with axis directions, and finite-differences in that basis. Then it rotates the resulting gradient back to the original coordinates.

Standard finite differences e₁ e₂ Probes cross level sets at a steep angle.Small Δf, big relative noise. Smart Gradient xk−2 xk−1 dk−1 d Probes follow the valley.Larger Δf, much smaller relative noise.
The same finite-difference step size h gives a far cleaner signal when probed along directions the optimizer has already explored.

Why this reduces noise

Finite-difference error has two parts: truncation (proportional to h) and round-off (proportional to εmach/h, where εmach is machine precision). What dominates the relative error is how much the function actually changes over the probe:

relative error ≈ εmach · |f(x)| / |Δf|

Along a recent descent direction, |Δf| is large by construction: the optimizer moved in that direction because the function was decreasing fast there. Along a generic axis, |Δf| can be tiny, so the round-off term blows up. Smart Gradient buys you bigger |Δf| for free: same h, same number of function evaluations, but the signal-to-noise ratio improves.

Algorithm sketch

At each iteration k:

  1. 1Take the last m descent directions dk−1, …, dk−m; orthonormalize them (e.g., Gram–Schmidt) to get q1, …, qm.
  2. 2Complete to an orthonormal basis Q = [q1, …, qn] by adding axis directions.
  3. 3Compute finite differences of f along each column of Q; this gives the gradient in the rotated basis, ∇̃f.
  4. 4Rotate back: ∇f ≈ Q ∇̃f. Pass to the optimizer as usual.
Limited-memory. Only the most recent m directions are kept (typically m = 3–10). This keeps the cost the same as a standard finite-difference gradient: no extra function evaluations, no Hessian storage.

When does Smart Gradient help most?

It is most useful when the analytical gradient is unavailable or hard to maintain (legacy code, complex simulators) but optimization is still being done seriously.

Try it in R

Installation

library("devtools")
install_github("esmail-abdulfattah/Smart-Gradient", subdir = "smartGrad")

1. Define your objective

Here we use the Extended Rosenbrock function, a classic ill-conditioned test case where the minimum sits at the bottom of a curved, narrow valley:

myfun <- function(x) {
  res <- 0.0
  for (i in 1:(length(x) - 1))
    res <- res + 100 * (x[i+1] - x[i]^2)^2 + (1 - x[i])^2
  return(res)
}

2. Provide a standard finite-difference gradient

mygrad <- function(fun, x) {
  h <- 1e-3
  grad <- numeric(length(x))
  for (i in 1:length(x)) {
    e <- numeric(length(x)); e[i] <- 1
    grad[i] <- (fun(x + h*e) - fun(x - h*e)) / (2*h)
  }
  return(grad)
}

3. Wrap it with makeSmart() and optimize

library("stats")
library("smartGrad")

x_initial <- rnorm(5)
result <- optim(
  par    = x_initial,
  fn     = myfun,
  gr     = makeSmart(fn = myfun, gr = mygrad),
  method = "BFGS"
)

That's it. makeSmart() takes your existing numerical gradient and returns a drop-in replacement that maintains the rotated basis internally. The optimizer doesn't need to know anything has changed.

Links & references

← Back to main site