Softmax Activation Function: Everything You Need to Know

Pinecone

Pinecone Assistant: A Managed Knowledge Layer for Production AI Applications Multi-domain RAG in n8n: why one knowledge base is not enough Allspice Transforms the Culinary Experience with Semantic Search Powered by Pinecone | Pinecone Building RAG workflows in n8n: choosing the right Pinecone node Knowledge needs a meta-knowledge layer Garbage Day: How Pinecone Safely Deletes Billions of Objects at Scale When "Performance" Means Two Different Things Pinecone BYOC: Pinecone in your AWS, GCP, or Azure account, no vendor access True, Relevant, and Wrong: The Applicability Problem in RAG Use the Pinecone Plugin for Claude Code to develop AI Applications Faster Millions at Stake: How Melange's High-Recall Retrieval Prevents Litigation Collapse Powering High-stakes Patent Search at Scale: How Melange Built a Reliable AI System on Pinecone | Pinecone Pinecone Assistant Node in n8n: Turn Any Data Source Into Knowledge RAG with Access Control Pinecone Dedicated Read Nodes are now in Public Preview Inside Pinecone: Slab Architecture New Bulk Data Operations: Update, Delete, and Fetch by Metadata The Hidden Cost of Building: Lessons from Aquant Simplifying Vector Embeddings with Pinecone Integrated Inference Capabilities Pinecone joins Microsoft Marketplace as a Launch Partner GTM Engineering: Clay + Pinecone for AI-powered Sales Outbound Build an AI knowledge assistant with Google Docs and Pinecone Moving Pinecone forward with Ash Ashutosh as CEO and Edo spearheading our growing AI ambitions as Chief Scientist Pinecone Founder Edo Liberty to Spearhead Pinecone’s Growing AI Ambitions; Appoints Ash Ashutosh as CEO to Expand Vector Database Market Leadership Fast, Accurate Retrieval for Creators at Scale: Delphi’s Path Toward a Million Conversational Agents with Pinecone | Pinecone Announcing Pinecone Pioneers: A Program for Builders, Organizers, and Community Leaders What is Context Engineering? Chunking Strategies for LLM Applications Beyond the hype: Why RAG remains essential for modern AI Obviant Makes 30% More Accurate Defense Acquisition Recommendations Combining Sparse and Dense Retrieval with Pinecone | Pinecone Build more knowledgeable AI applications with new LLMs and greater control in Pinecone Assistant #NYTECHWEEK 2025 Retrieval-Augmented Generation (RAG) Accurate and Efficient Metadata Filtering in Pinecone’s Serverless Vector Database | Pinecone Terminal X AI Agents, Powered by Pinecone, Turn Complex Financial Data Into Production-grade Insights at Scale | Pinecone Aquant Delivers Scalable, Expert-level Service Intelligence with Pinecone | Pinecone Cascading retrieval with multi-vector representations: balancing efficiency and effectiveness Vector databases aren't just for large-scale enterprise AI Unveiling DIME: Reproducibility, Scalability, and Formal Analysis of Dimension Importance Estimation for Dense Retrieval | Pinecone Fast and Effective Early Termination for Simple Ranking Functions | Pinecone Domain-specific AI Agents at Scale: CustomGPT.ai Serves 10,000+ Customers with Pinecone | Pinecone Using Pinecone asynchronously with FastAPI A Flexible Resource for Top-Weighted Comparisons Between Sets and Rankings | Pinecone Build secure, scalable agentic AI workflows with Rubrik Annapurna and Pinecone Tool up: Pinecone’s first MCP servers are here Add context to your agent with Pinecone Assistant MCP remote server E2Rank: Efficient and Effective Layer-wise Reranking | Pinecone ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring | Pinecone Efficient Constant-Space Multi-Vector Retrieval | Pinecone How Vanguard Worked with Pinecone to Boost Customer Support with Faster Calls and 12% More Accurate Responses | Pinecone Pinecone Named to Fast Company's Annual List of the World's Most Innovative Companies of 2025 Launch Week: Pinecone for agents, search, recommendations, and more Optimizing Pinecone for agents (and more) Retrieval Inference for scale and performance How 1up Turns Sales Reps Into Product Experts with Pinecone | Pinecone Don’t be dense: Launching sparse indexes in Pinecone Unlock High-Precision Keyword Search with pinecone-sparse-english-v0 Evolving Pinecone's architecture to meet the demands of Knowledgeable AI Pinpoint references faster with citation highlights in Pinecone Assistant Bringing the leading vector database to your cloud Getting started with llama-text-embed-v2 Natural Language Counterfactual Explanations for Graphs Using Large Language Models | Pinecone Easily build knowledgeable chat and agent-based applications in minutes with Pinecone Assistant, now generally available How to build an agentic, chat or RAG knowledge system using Pinecone Assistant Real-time RAG with Pinecone and Estuary Flow BigQuery to Pinecone in Real-Time with Estuary Flow Stravito Turns Market and Consumer Data Into Actionable Insights with Pinecone Inference | Pinecone Accelerate prototyping and development with Pinecone Local First-of-its-kind Pinecone Knowledge Platform to Power Best-in-class Retrieval for Customers Introducing integrated inference: Embed, rerank, and retrieve your data with a single API Strengthening security and increasing control with CMEK and API key roles Introducing Pinecone Rerank V0 Introducing cascading retrieval: Unifying dense and sparse with reranking From Idea to Action: How Pinecone Assistant Meaningfully Accelerates AI Business Building AI apps on Azure with Pinecone just got a lot easier Building a reliable, curated, and accurate RAG system with Cleanlab and Pinecone Four features of the Assistant API you aren't using - but should Deploying Pinecone with Infrastructure as Code (IaC) Streamlining CI/CD with Pinecone Local September 2024 Product Update Results of the Big ANN: NeurIPS'23 competition | Pinecone Introducing import from object storage for more efficient data transfer to Pinecone serverless Simplify, enhance, and evaluate RAG development with Pinecone Assistant, now in public preview Vectors and Graphs: Better Together August 2024 Product Update Pinecone Helps Deep Talk Deliver World-Class AI Assistants with Lower Engineering Overhead | Pinecone Assembled Delivers Better, Faster AI- Driven Support with Pinecone | Pinecone Llama 3.1 Agent using LangGraph and Ollama Build knowledgeable AI with Pinecone serverless, now generally available on Microsoft Azure Pinecone serverless is now generally available on Google Cloud, adding knowledge to AI assistants and other applications Accelerating Legal Discovery and Analysis with Pinecone and Voyage AI Bridging Dense and Sparse Maximum Inner Product Search | Pinecone Refine Retrieval Quality with Pinecone Rerank Introducing reranking to Pinecone Inference to simplify building accurate AI July 2024 Product Update Connect to Pinecone within your platform to enable a seamless AI development experience Introducing Pinecone API Versioning RAG Brag with Inkeep Co-Founder Nick Gomez LangGraph and Research Agents Introducing Pinecone Inference to streamline your AI workflow

Bala Priya C · 2023-06-30 · via Pinecone

Softmax Activation

Have you ever trained a neural network to solve the problem of multiclass classification? If yes, you know that the raw outputs of the neural network are often very difficult to interpret. The softmax activation function simplifies this for you by making the neural network’s outputs easier to interpret!

The softmax activation function transforms the raw outputs of the neural network into a vector of probabilities, essentially a probability distribution over the input classes. Consider a multiclass classification problem with N classes. The softmax activation returns an output vector that is N entries long, with the entry at index i corresponding to the probability of a particular input belonging to the class i.

In this tutorial, you’ll learn all about the softmax activation function. You’ll start by reviewing the basics of multiclass classification, then proceed to understand why you cannot use the sigmoid or argmax activations in the output layer for multiclass classification problems.

Finally, you’ll learn the mathematical formulation of the softmax function and implement it in Python.

Let’s get started.

Multiclass Classification Revisited

Recall that in binary classification, there are only two possible classes. For example, a ConvNet trained to classify whether or not a given image is a panda is a binary classifier, whereas, in multiclass classification, there are more than two possible classes.

Let’s consider the following example: You’re given a dataset containing images of pandas, seals, and ducks. You’d like to train a neural network to predict whether a previously unseen image is that of a seal, a panda, or a duck.

Notice how the input class labels below are one-hot encoded, and the classes are mutually exclusive. In this context, mutual exclusivity means that a given image can only be one of {seal, panda, duck} at a time.

Multiclass Classification Example

Multiclass Classification Example (Image by the author)

Can You Use Sigmoid or Argmax Activations Instead?

In this section, you’ll learn why the sigmoid and argmax functions are not the optimal choices for the output layer in a multiclass classification problem.

Limitations of the Sigmoid Function

Mathematically, the sigmoid activation function is given by the following equation, and it squishes all inputs onto the range [0, 1].

Sigmoid Function Equation

Sigmoid Function Equation (Image by the author)

The sigmoid function takes in any real number as the input and maps it to a number between 0 and 1. This is exactly why it’s well-suited for binary classification.

▶️ You may run the following code cell to plot the values of the sigmoid function over a range of numbers.

import numpy as np
import seaborn as sns

def sigmoid(x):
  exp_x = np.exp(x)
  return np.divide(exp_x,(1 + exp_x))
  
x = np.linspace(-10,10,num=200)
exp_x = np.exp(x)
sigmoid_arr = sigmoid(x)

sns.set_theme()
sns.lineplot(x = x,y = sigmoid_arr).set(title='Sigmoid Function')

Plot of the Sigmoid Function

Let’s go back to our example of classifying whether an input image is that of a panda or not. In this case, let z be the raw output of the neural network. If σ(z) is the probability that the given image belongs to class 1 (is a panda), then 1 - σ(z) is the probability that the given image does not belong to class 1 and is not a panda. You can think of σ(z) as a probability score.

You can now fix a threshold, say T, and predict that class whose probability score is greater than the chosen threshold.

However, this won’t quite work when you have more than two classes. Softmax to the rescue!

In fact, you can think of the softmax function as a vector generalization of the sigmoid activation. We’ll revisit this later to confirm that for binary classification—when N = 2—the softmax and sigmoid activations are equivalent.

Limitations of the Argmax Function

The argmax function returns the index of the maximum value in the input array.

Let’s suppose the neural network’s raw output vector is given by z = [0.25, 1.23, -0.8]. In this case, the maximum value is 1.23 and it occurs at index 1. In our image classification example, index 1 corresponds to the second class—and the image is predicted to be that of a panda.

In vector notation, you’ll have 1 at the index where the maximum occurs (at index 1 for the vector z). And you’ll have 0 at all other indices.

Argmax Output

Argmax Output (Image by the author)

One limitation with using the argmax function is that its gradients with respect to the raw outputs of the neural networks are always zero. As you know, it’s the backpropagation of gradients that facilitates the learning process in neural networks.

As you’ll have to plug in the value 0 for all gradients of the argmax output during backpropagation, you cannot use the argmax function in training. Unless there’s backpropagation of gradients, the parameters of the neural network cannot be adjusted, and there’s effectively no learning!

From a probabilistic viewpoint, notice how the argmax function puts all the mass on index 1: the predicted class and 0 elsewhere. So it’s straightforward to infer the predicted class label from the argmax output. However, we would like to know how likely the image is to be that of a panda, a seal, or a duck, and the softmax scores help us with just that!

The Softmax Activation Function, Explained

It’s finally time to learn about softmax activation. The softmax activation function takes in a vector of raw outputs of the neural network and returns a vector of probability scores.

The equation of the softmax function is given as follows:

Softmax Function Equation

Softmax Function Equation (Image by the author)

Here,

z is the vector of raw outputs from the neural network
The value of e ≈ 2.718
The i-th entry in the softmax output vector softmax(z) can be thought of as the predicted probability of the test input belonging to class i.

From the plot of e^x, you can see that, regardless of whether the input x is positive, negative, or zero, e^x is always a positive number.

Plot of exp(x)

Recall that in our example, N = 3 as we have 3 classes: {seal, panda, duck}, and the valid indices are 0, 1, and 2. Suppose you’re given the vector z = [0.25, 1.23, -0.8] of raw outputs from the neural network.

Let’s apply the softmax formula on the vector z, using the steps below:

Calculate the exponent of each entry.
Divide the result of step 1 by the sum of the exponents of all entries.

Computing softmax scores for the 3 classes

Computing softmax scores for the 3 classes (Image by the author)

▶️ Now that we’ve computed the softmax scores, let’s collect them into a vector for succinct representation, as shown below:

Softmax Output

Softmax Output (Image by the author)

From the softmax output above, we can make the following observations:

In the vector z of raw outputs, the maximum value is 1.23, which on applying softmax activation maps to 0.664: the largest entry in the softmax output vector. Likewise, 0.25 and -0.8 map to 0.249 and 0.087: the second and the third largest entries in the softmax output respectively. Thus, applying softmax preserves the relative ordering of scores.
All entries in the softmax output vector are between 0 and 1.
In a multiclass classification problem, where the classes are mutually exclusive, notice how the entries of the softmax output sum up to 1: 0.664 + 0.249 + 0.087 = 1.

This is exactly why you can think of softmax output as a probability distribution over the input classes, that makes it readily interpretable.

As a next step, let’s examine the softmax output for our example.

In the vector softmax(z) = [0.664, 0.294, 0.087], 0.664 at index 1 is the largest value. This means there’s a 66.4% chance that the given image belongs to class 1, which from our one-hot encoding is a class panda.

And the input image has a 29.4% chance of being a seal and around 8.7% chance of being a duck.

Therefore, applying softmax gives instant interpretability, as you know how likely the test image is to belong to each of the 3 classes. In this particular example, it’s highly likely to be a panda and least likely to be a duck.

It now makes sense to call the argmax function on the softmax output to get the predicted class label. As the predicted class label is the one with the highest probability score, you can use argmax(softmax(z)) to obtain the predicted class label. In our example, the highest probability score of 0.664 occurs at index 1, corresponding to class 1 (panda).

How to Implement the Softmax Activation in Python

In the previous section, we did some simple math to compute the softmax scores for the output vector z.

Now let’s translate the math operations into equivalent operations on NumPy arrays. You may use the following code snippet to get the softmax activation for any vector z.

import numpy as np

def softmax(z):
  '''Return the softmax output of a vector.'''
  exp_z = np.exp(z)
  sum = exp_z.sum()
  softmax_z = np.round(exp_z/sum,3)
  return softmax_z

We can parse the definition of the softmax function:

The function takes in one required parameter z, a vector, and returns the softmax output vector softmax_z.
We use np.exp(z) to compute exp(z) for each z in z; call the resultant array exp_z.
Next, we call sum on the array exp_z to compute the sum of exponents.
We then divide each entry in exp_z by the sum and round off the result to 3 decimal places, storing the result in a variable, say, softmax_z.
Finally, the function returns the array softmax_z.

You may now call the function with the output array z as the argument and verify that the scores are identical to what we had computed manually.

z = [0.25, 1.23, -0.8]
softmax(z)

# Output
array([ 0.249, 0.664, 0.087])

Are you wondering if normalizing each value by the sum of entries will suffice, to get relative scores? Let’s see why it’s not an efficient solution.

Why Won’t Normalization by the Sum Suffice

Why use something math-heavy as the softmax activation? Can we not just divide each of the output values by the sum of all outputs?

Well, let’s try to answer this by taking a few examples.

Use the following function to return the array normalized by the sum.

def div_by_sum(z):
  sum_z = np.sum(z)
  out_z = np.round(z/sum_z,3)
  return out_z

1️⃣ Consider z1 = [0.25, 1.23, -0.8], and call the function div_by_sum. In this case, though the entries in the returned array sum up to 1, it has both positive and negative values. We still aren’t able to interpret the entries as probability scores.

z1 = [0.25,1.23,-0.8]
div_by_sum(z1)

# Output
array([ 0.368,  1.809, -1.176])

2️⃣ Let z2 = [-0.25, 1, -0.75]. In this case, all elements in the vector sum up to zero, so the denominator will always be 0. When you divide by the sum to normalize, you’ll face runtime warnings, as division by zero is not defined.

z2 = [-0.25,1,-0.75]
div_by_sum(z2)

# Output
RuntimeWarning: divide by zero encountered in true_divide
array([-inf,  inf, -inf])

3️⃣ In this example, z3 = [0.1, 0.9, 0.2]. Let’s check both the softmax and normalized scores.

z3 = [0.1,0.9,0.2] # ratio: 1:9:2
print(div_by_sum(z3))
print(softmax(z3))

# Output
[0.083 0.75  0.167] # ratio: 1:9:2
[0.231 0.514 0.255]

As shown in the code cell above, when all the inputs are positive, you may interpret the normalized scores as probability scores, but the scores are in the same ratio as in the array z3. In this example, the predicted class is still that of a panda.

However, you can’t guarantee that the neural network’s raw output won’t sum up to 0 or have negative entries.

4️⃣ In this example, z4 = [0, 0.9, 0.1]. Let’s check both the softmax and normalized scores.

z4 = [0,0.9,0.1]
print(div_by_sum(z4))
print(softmax(z4))

# Output
[0.  0.9 0.1]
[0.219 0.539 0.242]

As you can see, when one of the entries is 0, upon calling the div_by_sum function, the entry is still 0 in the normalized array. However, in the softmax output, you can see that 0 has been mapped to a score of 0.219.

In some sense you can think of the softmax activation function as a softer version of the argmax function: It maximizes the probability score corresponding to the predicted output label. At the same time, it’s soft because it does assign some probability mass to the less likely classes as well, unlike the argmax function that puts the entire probability mass of 1 on the maximum, and 0 everywhere else.

In essence, the softmax activation can be perceived as a smooth approximation to the argmax function.

Equivalence of the Sigmoid, Softmax Activations for N = 2

Now let’s revisit our earlier claim that the sigmoid and softmax activations are equivalent for binary classification when N = 2.

Recall that in binary classification, you apply the sigmoid function to the neural network’s output to get a value in the range [0, 1].

When you’re using the softmax function for multiclass classification, the number of nodes in the output layer = the number of classes N.

You can think of binary classification as a special case of multiclass classification. Assume that the output layer has two nodes: one outputting the score z and the other 0.

Effectively, there’s only one node as the other is not given any weight at all. The raw output vector now becomes z = [z, 0]. Next, we may go ahead and apply softmax activation on this vector z and check how it’s equivalent to the sigmoid function we looked at earlier.

Equivalence of Sigmoid & Softmax Activations

Equivalence of Sigmoid & Softmax Activations (Image by the author)

Observe how the softmax activation scores in this case are the same as the sigmoid activation scores: σ(z) and 1 - σ(z).

And with this, we wrap up our discussion on the softmax activation function. Let’s quickly summarize all that we’ve learned.

Summing Up

In this tutorial, you’ve learned the following:

How to use the softmax function as output layer activation in a multiclass classification problem.
The working of the softmax function—how it transforms a vector of raw outputs into a vector of probabilities. And how you can interpret each entry in the softmax output as the probability of the corresponding class.
How to interpret the softmax activation as an extension of the sigmoid function to multiclass classification, and their equivalence for binary classification where the number of classes N = 2.

In the next tutorial, we’ll delve deep into cross-entropy loss—a widely-used metric to assess how well your multiclass classification model performs.

Until then, check out other interesting NLP tutorials on vector search, algorithms, and more. Happy learning!

推荐订阅源