On Knowing What You Do Not | My Machine Learning Journey

Recently, I have been reflecting on my journey learning the fundamentals of how machine learning works by building my own programs from the ground up over and over. I would both like to share a bit of my journey throughout the years in hopes that it could lead someone else in the right direction, and I would like to comment on learning (sans machine) in programming as a whole. Machine learning has taught me how you do not know what you do not know, and it has taught me how you can figure out what you do not know.

Almost four or five years ago at this point, I first found 3blue1brown’s excellent series on neural networks. Subsequently, I found the book that 3blue1brown based their videos off of. I do not know how much exposure I had to any sort of machine learning before this, but these two resources made it seem surprisingly approachable. I did not even fully understand the calculus underlying backpropagation, which caused a lot of difficulty, but I was able to program a simple perceptron following the python examples in the book. For the first time, I made a neural network that could recognize handwritten digits from the MNIST dataset.

To get to my point in a minute here, I will need to delve into a few of the details about how my machine learning programs worked. This first program only supported a sequence of feedforward layers (equivalent to multiple dense/linear layers and activation layers). The only initialization parameters of the network were the size of each layer. While these layer types are the bedrock of all machine learning, they cannot do all that much in isolation. Any more complicated tasks like image recognition, natural language processing, or larger classification problems were simply not feasible with this program. In other words, the design of my program was too narrow; upon learning about more advanced types of neural networks, I realized that this model just would not cut it.

Over the years, I have returned to neural networks a few different times, gaining experience. I got a lot better at programming, and I finally took a real calculus class and actually understood how gradient descent works. With this experience, I wanted to make an actual machine learning library that could do basic image recognition. I had seen a little more about how systems like Keras worked, and I knew that I would need convolutional layers for the image recognition stuff, so I designed this library to work with a set array of layers, where each layer would implement its own forward and backward pass, allowing for the backpropagation. Honestly, I am still very proud of what I was able to do with this library. I built models which got respectable scores on MNIST, EMNIST, Fashion-MNIST, and CIFAR-10, and I squeezed a good amount of performance out of just the CPU. You can find the code here. Despite this success, the design of the library was still too narrow; I still did not understand the full scope of neural networks, and the library was not going to work for more advanced models. There were two main false assumptions that I based the library off of: 1) neural networks are always sequential, and 2) the gradient needs to be manually computed for each layer type. As it turns out, even relatively basic image recognition networks nowadays have non-sequential models. The “Res” in ResNet stands for residual connections, which means that layers reference previous layers. These models were not possible with this library. Additionally, to my knowledge, larger machine learning libraries like PyTorch are built on top of automatic gradient solvers, which completely obviate the need for manually computing the backwards passes of layers. Also, as I alluded to before, the library only uses the CPU, severely limiting the speed at which I could build and run networks.

With all of this knowledge, I have a pretty good idea what my next machine learning library is going to look like (whenever I finally get around to programming that). I am going to abstract the backend of the matrix math, allowing for computation on the CPU and/or the GPU. I will build the whole thing on top of an backwards mode automatic gradient solver, which will allow for much more complicated layers without having to manually compute their gradients. At the same time, there are some areas that I do not know the design for. How do you best split work up amongst multiple GPUs? How do you deal with the limits of VRAM? Might I need to split a matrix multiply across multiple GPUs for larger neural networks? I do not know the answer to these questions, but I do know how I can find out the answers.

All of that finally brings me to my point: one of the hardest aspects of learning something in programming is knowing what you need to learn. With all of the machine learning stuff, there are abundant resources online to learn it, but it can be difficult to find those resources. It always frustrates me when I know that the information is out there but I do not know the specific series of keywords that will summon it in Google. I think that there are about two and a half ways to deal with this.

The half is a LLM. I am sure that they could help you find information in an area where you are a novice, but, in my experience, LLMs seem to have a tendency to echo what you are saying. In other words, they are not as good about expanding your horizons, which is exactly what you need for a new topic. If you have had better luck in this area, let me know.

As for the methods I would actually recommend, the first would be some sort of formal education or at least some sort of guided course. One of the problems that I had in my machine learning journey was building a foundation for a program that does not fit the problem; the design was too narrow. Because I did not know the full scope of the problem, I did not build a foundation that could support it. If you are being guided by someone who does know the full scope, you can learn the basics while building a foundation that will scale. Alternatively, they could tell you where the current foundation falls short, and what you would change for the full problem.

The final method, which I gave a lengthy example of above, is to just start programming. Use the information you already have and just start. Oftentimes, you learn what is missing from your program when you just build it. For example, I built my last machine learning library to work exclusively with sequential networks, and I realized that the design did not work when I wanted to make more advanced neural networks. The experience of building the program helps you know and really feel what your program needs. It provides motivation for the design of a system.

Getting to a broader point, I feel that the most important part of the design process in programming is actually writing the program. It is better to get experience with something than think about what you might need in the future. Perhaps that is a bit too big of a takeaway from this example, but I really believe in it after all of my years programming. That is all I have for now. Have a good day.