I recently saw that conda-forge released a Tensorflow-GPU package, see here. Many people, myself included, have been waiting for this since Anaconda changed the license for their conda package repositories, since obtaining a Tensorflow-GPU conda package trivializes the installation of GPU enabled Tensorflow, which notoriously can be a bit tricky. This prompted me to want to step back a bit, and look at why efficient implementations of optimization algorithms, like gradient descent, are necessary, and why you should generally not write any numerical algorithms in pure Python.

The plan is to write extremely simple implementations of gradient descent in pure Python and pure C++, while providing enough abstraction in both cases so as to actually be a useful, real implementation of the algorithm. To get this useful level of abstraction, we will write the code so that the user only needs to provide a function to be minimized and starting parameters for that function. An analytical gradient function will be optional, meaning we will also include a method for numerically calculating the gradient.

There are many blog posts, websites, etc. that describe gradient descent. For our purposes, all we need to know is we are trying to find the minimum of a function, \(f(\mathbf{x})\). We do this by computing the gradient (analytically or numerically), \(\nabla f(\mathbf{x})\), and iteratively updating the parameters, \(\mathbf{x}_{n+1} = \mathbf{x}_{n} - \eta \nabla f(\mathbf{x}_{n})\), where \(\eta\) is an empirical learning rate that dictates how quickly we descend the gradient. Without going into the nuances, that’s really all there is to it.

Now, let’s implement a very simple Python class to perform gradient descent:

class gd():
    def __init__(self, fx, dfx, x0, n, lr=1e-4):
        self.fx = fx
        self.dfx = dfx
        self.x0 = x0
        self.x = x0
        self.lr = lr
        self.n = n
        self.dydx = [float('nan') for i in range(self.n)]

    def __call__(self, iterations = 1):
        # Run an iteration of gradient descent
        for itr in range(iterations):
            self.dfx(self.x, self.dydx)
            for i in range(self.n):
                self.x[i] -= self.lr * self.dydx[i]

We can also provide a simple implementation of a forward differencing algorithm to numerically calculate the gradient of a scalar function:

def fordiff(fx, x, dydx, n, dx=1e-7):
    x_eval = x.copy()
    y = fx(x_eval)
    for i in range(n):
        x_eval[i] += dx
        dxb = x_eval[i] - x[i]
        y_eval = fx(x_eval)
        dy = y_eval - y
        dydx[i] = dy / dxb
        x_eval[i] = x[i]
    return

where \(dx\) is the small step taken in the finite difference formula, and \(dxb\) is calculated using the well known trick to get the exact floating point representation of the actually used step size (since (x + dx) - x does not necessarily equal dx when using finite precision arithmetic).

Now, let’s say we want to minimize the function \(f(\mathbf{x}) = x_{1}^2+x_{2}^4\), which of course has a global minimum at \(x_{1}=0\) and \(x_{2}=0\). Let’s write some quick code to try to find this minimum via gradient descent.

def fx(x):
    y = x[0]**2 + x[1]**4
    return y

if __name__ == "__main__":
    n = int(2)
    x_init = [1.,1.]
    dfx_fordiff = lambda x, dydx: fordiff(fx, x, dydx, n, 1e-7)
    gd_test = gd(fx, dfx_fordiff, x_init, n)
    gd_test(int(1e7))
    print(gd_test.x)

Note the use of the lambda; gd is implemented to assume that dfx accepts two arguments: x, which contains the inputs actually being evaluated, and dydx, which will be modified to store the gradient. We use a lambda to construct a wrapper around fordiff using only these expected arguments. On my machine, the call to gd_test(int(1e7)) takes about 12 seconds, and yields the solution [-4.9999999992031376e-08, 0.011179587414291947].

Now, let’s look at an implementation in C++.

class gd{
  public:
    int n;
    double *x0, *x, lr, *dydx;
    double (*fx) (double []);
    void (*dfx) (double [], double []);

    gd(double (*fx_) (double []), void (*dfx_) (double [], double []), 
        double x0_[], int n_, double lr_=1e-4) : n(n_), x0(x0_), 
        x(new double[n]), lr(lr_), dydx(new double[n]), fx(fx_), dfx(dfx_){
        for(int i=0; i<n; i++){
            x[i] = x0[i];
        }
    } 

    ~gd(){
        delete[] x;
        delete[] dydx;
    }

    void operator()(int iterations = 1){
        for(int itr=0; itr<iterations; itr++){
            dfx(x, dydx);
            for(int i=0; i<=n; i++){
                x[i] = x[i] - lr*dydx[i];
            }
        }
    }
};

Obviously, this is slightly more involved. I’m not using any fancy C++; x, dydx, etc., will be passed in as C-style arrays and are thus just declared as pointer types as opposed to containers like std::vector. fx and dfx are passed in as function pointers. Modern C++ has features to make this a bit nicer, but I wanted to keep everything as simple as possible, similar to the Python implementation, which did not have to include any other libraries (or header files in the C++ case). One thing of interest to note is that, to keep the same level of abstraction as in the Python implementation, we have to dynamically allocate x and dydx on the heap, since the size of a C++ class must be known at compile time.

Now we can implement our simple forward differencing function:

void fordiff(double (*fx) (double []), double x[], double dydx[], int n,
    double dx = 1e-7){
    double dxb, y_eval, dy, y;
    double x_eval[n];
    for(int i=0;i<=n;i++){
        x_eval[i] = x[i];
    }
    y = fx(x_eval);
    for(int i=0; i<=n; i++){
        x_eval[i] = x_eval[i] + dx;
        dxb = x_eval[i] - x[i];
        y_eval = fx(x_eval);
        dy = y_eval - y;
        dydx[i] = dy / dxb;
        x_eval[i] = x[i];
    }
}

Note that I’m claiming this is a generalizable, useful gradient descent implementation, but for large x you may need to heap allocate x_eval here. Finally, we can implement the function we are minimizing, and a driver main function:

double fx(double x[]){
    double y = x[0]*x[0] + x[1]*x[1]*x[1]*x[1];
    return y;
}

int main(){
    const int n = 2; /* Cannot use in Lambda without const since we would have
    to capture n, and lambdas with captured variables cannot convert to function
    pointers */
    double x_init[2] = {1., 1.};
    auto dfx_fordiff = [](double x[], double dydx[]){
        return fordiff(fx, x, dydx, n, 1e-7);
    };
    gd gdi(fx, dfx_fordiff, x_init, n);
    gdi(1e7);
    std::cout << gdi.x[0] << ", " << gdi.x[1] << std::endl;
    return 1;
}

On my machine, the call the gdi.call(1e7) takes about 0.4 seconds when compiling with gcc and no optimization, so approximately a 30x speed increase. Over time this can really add up, especially if you are writing complex numerical algorithms in pure Python (like Monte Carlo simulations, etc.). Later on we could look at how our very simple gradient descent implementations compare to Tensorflow (running on a CPU or GPU) for toy problems and real problems, using numerical and analytical gradients. We can also look at how to call C++ code from Python, and how to package C++ code into a Python package.