
DATA, DATA, DATA.
Machine Learning is NOT magic.
During the early stages of my experience with Machine Learning, my problem solving philosophy was to focus on modelling. As my first entry into Machine Learning started with deep learning, applying complex architectures from the state-of-the-art research papers in deep learning showed exceptional results. Through time; however, I've come to shift my attention to a different component in Machine Learning pipeline : Data.
What humans "learn" come from their environment. Situations they were exposed to, things they saw, people they met shape them. This applies to a machine learning model as well. The type and quality of input features and labels determine a model's performance. How is the data distributed? How can I augment the data so that the model becomes more robust? Are some features not informative and only contributing to noise? How can I extract the best features? What is the number of features that are most informative together? How can I formulate the labels such that the model can learn most efficiently? What type of loss should I use for optimization with the given features and labels? These are all important questions I ask myself while optimizing Machine Learning performance.
It is also important to ask such questions iteratively. Formulate features and labels, put them through a model, optimize weights, and observe the results. Which aspects is the model most confused about? Will less number of features reduce noise? During my image domain adaptation project using CNNs, I saw poor classification accuracy of one particular class. Although my first instinct was that the model architecture was not complex enough, I checked the distribution of input images using t-SNE. The result was that the class with the poor accuracy actually had some overlapping distribution with another class, resulting in a sub-optimal decision boundary. I fixed the problem by scraping some more images of the class that contained unique characteristics of that class.
The more I focused on the data, and not the model, the "black box" of a Machine Learning model started to become more and more transparent. A machine learning model is a "function". It should not be too different from y = f(x) in grade 11 math class. It's just that x, y, or both might be much higher-dimensional, thus possibly incurring a more complex relationship. I also cannot solve for the weights on a sheet of paper, as I've done with a simple y = ax + b function. There might be hundreds of weights, if not millions. So I let the computer solve for the optimal weights. But these weights are just the numerical output of what I told it to do. It is trying to map the data (= x) given by me, to a particular form of labels (= y), which is also given by me. I'm responsible for setting up the environment and the props. The computer is only the computing hardware.
Anyone can download an open-source dataset and fit a model with an open-source library, but a real curiosity to ask questions why the model is acting this way and analyzing data to find answers really help understand how a machine "learns". Failed models are not true failures, but stepping stones to the final robust model. My new ML philosophy, backed up by months of personal experience through trial and error, is to analyze and optimize data iteratively.
- 𝕃 ☾₊˚.