From 181cd8323e9cfd56b143cafb615a6b0fd78bef36 Mon Sep 17 00:00:00 2001
From: lemisma <i.lemhadri@gmail.com>
Date: Mon, 13 Apr 2020 19:30:01 -0700
Subject: [PATCH] The second set of lecture notes on neural networks contains
 an inaccuracy regarding hierarchical softmax. This pull request fixes it by
 clarifying the use cases of hierarchical softmax.

---
 neural-networks-2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/neural-networks-2.md b/neural-networks-2.md
index ec8c0a8a..2826ba33 100644
--- a/neural-networks-2.md
+++ b/neural-networks-2.md
@@ -242,7 +242,7 @@ $$
 L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right)
 $$
 
-**Problem: Large number of classes**. When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), it may be helpful to use *Hierarchical Softmax* (see one explanation [here](http://arxiv.org/pdf/1310.4546.pdf) (pdf)). The hierarchical softmax decomposes labels into a tree. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent.
+**Problem: Large number of classes**. When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), computing the full softmax probabilities becomes expensive. For certain applications, approximate versions are popular. For instance, it may be helpful to use *Hierarchical Softmax* in natural language processing tasks (see one explanation [here](http://arxiv.org/pdf/1310.4546.pdf) (pdf)). The hierarchical softmax decomposes words as labels in a tree. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent.
 
 **Attribute classification**. Both losses above assume that there is a single correct answer \\(y_i\\). But what if \\(y_i\\) is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive? For example, images on Instagram can be thought of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image may contain multiple. A sensible approach in this case is to build a binary classifier for every single attribute independently. For example, a binary classifier for each category independently would take the form: