Bayesian Learning - All of Machine Learning
Published:
If you want to know the basics of machine learning, this short read will clear most of the Why questions of machine learning.
Some Gyan
In all subjects (which includes some logical thinking), There is a fundamental law, and the rest is built on top of this fundamental law. For example, physics works on basic observation of the world and then formalizes it into math—Newton’s laws, Relativity, Quantum mechanics, etc. The remaining topics, like rigid body dynamics and fluid mechanics, are just combinations of these laws and some math manipulations. (This is my understanding. Can be different, person to person 🥹)
Here is my understanding of machine learning’s fundamental law, where machine learning meets probability theory.
We have data and want to find the relation between these data points (Unsupervised Learning), or we want to find the relation between its attributes and extend it to unseen data (Supervised Learning). Bayesian Learning gives a way to model this particular problem. It joins machine learning with probability theory.
Formal Definition
In Bayesian Learning, a learner tries to find the most probably hypothesis h from a set of hypotheses H, given the observed data. This maximally probable hypothesis is called the maximum a posteriori hypothesis (MAP), and we use Bayes theorem to compute it. This is the basic concept of Bayesian Learning; everything else is just examples and implementations.
\[h_{\text{map}} = \arg\max_{h \in H} P(h|D)\]We define hypothesis class (H) and assume that the hypothesis belonging to this class (h) is used to generate the given data. According to Bayesian learning, the best hypothesis maximizes MAP (maximum a posteriori hypothesis). Here, the hypothesis (h) is also a random variable.
\[h_{\text{map}} = \arg\max_{h \in H} \left( \frac{\text{P(D|h)} \cdot P(h)}{\text{P(D)}} \right)\] \[h_{\text{map}} = \arg\max_{h \in H} (P(D|h) \cdot \text{P(h)})\]P(h) can be modeled using a uniform distribution, uniform, or anything you like. If we choose uniform distribution, then the equation can be written as -
\[h_{\text{map}} = \arg\max_{h \in H} P(D|h)\]Suppose D = (X,Y) and h is a hypothesis that is parameterized by θ.
\[h_{\text{map}} = \arg\max_{h \in H} P(X,Y;θ)\]This is exact equation, when we consider Generative Learning. Further breaking -
\[h_{\text{map}} = \arg\max_{h \in H} (P(Y|X;θ) \cdot \text{P(X)})\]Here assume that the data X is generated as uniform distribution from some interval, then this will be
\[h_{\text{map}} = \arg\max_{h \in H} P(Y|X;θ)\]This is exact equation, when we consider Discriminitive Learning. Here we can also state that Bayesian learning is more general than Generative Learning and Generative Learning is more general than Disciminitive Learning.
\[Bayesian > Generative > Discriminitive\]Let us understand this by example. When we consider P(h) as a normal distribution in Bayesian learning, we get an extra regularization term in the loss function. This regularization term cannot be derived from Generative learning or Discriminitive learning. In generative learning, when we consider that P(X) is uniform, we get discriminative learning, but we cannot derive the Generative Learning equation from Discriminative learning.
All of Machine Learning
Any model in machine learning includes these steps, but sometimes, we do not look into it and consider it abstract. When we define model architecture, it is essentially hypothesis class H. Solving an equation is a learning process (Different gradient descent methods). At test time, we sample data from the D distribution and calculate the probability of the sample corresponding to a particular class.
For example, take LLM, a series of transformers is just modeling a function with millions of parameters, which is hypothesis class H. We maximize the probability P(word | (other words)) from solving the Bayesian learning equation. This looks similar to P(Y | X). We solve this Bayesian equation by backpropagation. Moreover, this simple thing could also generate this whole blog. Maybe this blog could also be generated by simple probability maximization.