What is an activation function in a neural network?

An activation function transforms a neuron's weighted input sum into an output signal. It introduces non-linearity so networks can learn complex patterns — without one, a deep network reduces to a single linear equation.

What is the best activation function for hidden layers?

ReLU is the standard choice for hidden layers in most neural networks. It trains fast, avoids vanishing gradients in shallow-to-medium networks, and works well across CNNs, MLPs, and feedforward architectures.

When should I use sigmoid vs softmax activation?

Use sigmoid for binary classification outputs — it returns a single probability between 0 and 1. Use softmax for multi-class outputs — it returns a probability distribution across all classes that sums to 1.

What is the dying ReLU problem?

Dying ReLU occurs when neurons permanently output zero after receiving large negative inputs, making the gradient zero and stopping learning. Fix it by switching to Leaky ReLU, which allows small non-zero outputs for negative inputs.

What activation function do LLMs use?

Most modern LLMs use GELU or SwiGLU. GPT and BERT use GELU; LLaMA and PaLM use SwiGLU. Both outperform ReLU in transformer architectures by providing smoother gradient flow through very deep networks.

Why do neural networks need activation functions?

Without activation functions, stacking layers provides no benefit — the network collapses to a single linear transformation. Activation functions add non-linearity, allowing networks to model complex real-world patterns.

What is GELU and why do transformers use it?

GELU (Gaussian Error Linear Unit) is a smooth activation that weights inputs by their Gaussian probability. Transformers use it because it provides better gradient flow and improved performance on language tasks versus ReLU.

Activation Functions in Neural Networks:

Activation functions are the mathematical operations that determine whether and how strongly a neuron fires in a neural network. Without them, even a hundred-layer network reduces to a single linear equation — unable to model curves, patterns, or complex relationships in data.

Choosing the right activation function is one of the highest-leverage architectural decisions in deep learning. A wrong choice at the output layer causes silent model failure. A wrong choice in hidden layers can cause gradients to vanish over dozens of layers — an effect rooted in how backpropagation propagates error signals through the chain rule. The activation derivative f’(net) is the same term that appears in the generalized delta rule, which extends single-layer error-based learning to non-linear networks. Getting it right accelerates training and often improves final accuracy without any other changes.

This guide covers every major activation function used in production AI systems — from the workhorse ReLU to the transformer-era GELU and SwiGLU — and gives you a decision framework for choosing the right one for your architecture.

What Are Activation Functions?

Activation functions transform the weighted sum of a neuron’s inputs into an output signal passed to the next layer. Without them, stacking neural network layers provides zero benefit — the entire network collapses mathematically to a single linear transformation, regardless of depth. Activation functions introduce the non-linearity that lets networks approximate any continuous function and learn from complex, high-dimensional data.

Think of a single neuron: it receives inputs, multiplies each by a learned weight, adds a bias term, and passes the result through an activation function. That output becomes the input to neurons in the next layer. The choice of activation function defines what patterns that neuron can detect and communicate forward.

Why Non-Linearity Is Essential

A neural network without activation functions is mathematically equivalent to logistic regression, no matter how many layers you add. The Universal Approximation Theorem, established by Cybenko (1989) and Hornik (1991), proves that a network with even a single hidden layer and a non-linear activation function can approximate any continuous function to arbitrary precision given enough neurons.

This is the theoretical foundation for why neural networks can recognize faces in photos, translate between languages, detect cancer in medical images, and predict equipment failures — all tasks that require modeling deeply non-linear relationships.

Where Activation Functions Appear

Every neuron in every hidden layer has an activation function. Output layer neurons also have activation functions, but the choice there is dictated by task type rather than architecture preference. When building a neural network from scratch, selecting activation functions at each layer is as critical as choosing the number of layers and neurons.

The same network with sigmoid vs. ReLU hidden-layer activations can see dramatically different training dynamics — one converging in hours, the other failing to learn at all.

The 5 Core Activation Functions

Five activation functions cover the vast majority of modern neural network use cases: ReLU for most hidden layers, sigmoid and softmax for output layers, tanh for recurrent networks, and GELU or SwiGLU for modern transformers. Understanding when each excels — and where it fails — is fundamental to understanding how deep learning works in practice.

ReLU (Rectified Linear Unit)

ReLU is defined as f(x) = max(0, x): it outputs zero for any negative input and passes positive inputs unchanged. Popularized for deep learning by Nair and Hinton (2010) in their work on Restricted Boltzmann Machines, ReLU quickly became the standard activation for convolutional and feedforward networks.

Why ReLU dominates hidden layers:

Computationally cheap — one comparison operation per neuron, no exponentials
Sparse activations — roughly 50% of neurons output zero at any time, improving efficiency
No vanishing gradient for positive inputs — gradient is exactly 1, enabling learning in very deep networks
Trains 3-6x faster than sigmoid-based networks in practice, per benchmarks reported by Krizhevsky et al. (2012) with AlexNet

According to TensorFlow’s official documentation, ReLU is the recommended default activation for fully connected and convolutional hidden layers.

Limitations: ReLU suffers from the “dying ReLU” problem (covered in Section 5) and outputs only non-negative values, which can be restrictive in some architectures.

Sigmoid

Sigmoid squashes any input into a range between 0 and 1 along an S-shaped curve: f(x) = 1 / (1 + e⁻ˣ). It was the standard activation through the 1990s and early 2000s before ReLU demonstrated superior performance in deep networks.

When sigmoid is still the right choice:

Binary classification output layers — it converts a raw score to a calibrated probability between 0 and 1
Cases where a single output neuron needs a probability interpretation

Why sigmoid fell out of favor for hidden layers:

Vanishing gradients — for inputs far from zero, the gradient approaches zero, causing early layers to receive near-zero updates and stop learning
Not zero-centered — outputs are always positive, creating zig-zagging gradient descent dynamics
Saturates quickly — neurons frequently end up in flat gradient regions

Modern practice: use sigmoid only in binary classification output layers. Never use it in hidden layers if you can avoid it. To see the vanishing-gradient problem live, swap ReLU for Sigmoid on the spiral dataset in TensorFlow Playground — loss creeps down for ~200 epochs, then stalls.

Tanh (Hyperbolic Tangent)

Tanh outputs values between -1 and 1: f(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ). Its zero-centered output range makes it strictly better than sigmoid for hidden layers when you need a bounded symmetric activation.

When tanh works well:

Recurrent neural networks — tanh remains the standard for RNN hidden state updates and LSTM gating mechanisms
Shallow networks where vanishing gradients are less severe
When zero-centered outputs are important for optimization stability

Limitation: Like sigmoid, tanh suffers from vanishing gradients for extreme input values. For networks deeper than roughly 10 layers, ReLU variants are more practical.

Softmax

Softmax transforms a vector of K raw scores into a probability distribution over K classes where all outputs sum to 1. For each output zᵢ: f(zᵢ) = eᶻⁱ / Σeᶻʲ across all classes.

When to use softmax:

Multi-class classification output layers — always, without exception
Any case where you need mutually exclusive probabilities that sum to 1

Critical constraint: Softmax belongs only in the output layer. Using it in hidden layers creates forced competition between neurons that disrupts learning. When comparing different types of neural networks, you’ll find the output activation is almost always either softmax (multi-class) or sigmoid (binary).

One implementation detail: most frameworks including PyTorch apply softmax inside the loss function (CrossEntropyLoss expects raw logits). Applying softmax manually before passing to CrossEntropyLoss applies it twice, producing incorrect training loss.

Leaky ReLU, ELU, and Modern Variants

Several variants address ReLU’s dying neuron problem while preserving its speed advantages:

Leaky ReLU: f(x) = x if x > 0, else 0.01x — a small negative slope prevents neurons from permanently deactivating. The 0.01 slope (alpha) is a hyperparameter.

ELU (Exponential Linear Unit): For negative inputs, ELU uses an exponential: f(x) = α(eˣ - 1). ELU produces negative outputs, which pushes mean activations toward zero and can accelerate learning compared to ReLU.

PReLU (Parametric ReLU): Like Leaky ReLU, but the negative slope is learned during training. Introduced by He et al. (2015) at Microsoft Research, PReLU achieved state-of-the-art results on ImageNet that year.

Pro tip: If neurons in your network stop activating (output always zero during training), switching from ReLU to Leaky ReLU is a one-line change in any framework and resolves most dying ReLU issues without other tradeoffs.

How to Choose the Right Activation Function

Choosing the correct activation follows three rules: output layer choice is determined entirely by task type, hidden layer choice defaults to ReLU or its variants, and specialized architectures have established conventions. When training machine learning models, activation selection is one of the first architectural decisions that affects convergence speed and final accuracy.

Output Layer: Non-Negotiable Task-Based Rules

Output layer activation is not a hyperparameter to experiment with — it is determined by what you’re predicting:

Task	Output Activation	Reason
Binary classification (yes/no)	Sigmoid	Single probability output 0-1
Multi-class classification (pick one)	Softmax	Probability distribution summing to 1
Multi-label classification (pick many)	Sigmoid (per output)	Independent probabilities per label
Regression (single continuous value)	None (linear)	Unbounded output needed
Multi-output regression	None (linear)	Unbounded outputs across multiple targets

Getting this wrong is the most common architectural mistake. A regression model with sigmoid output silently caps all predictions between 0 and 1, failing on any target outside that range with no obvious error message.

Hidden Layers: A Decision Tree

For hidden layers, follow this decision sequence:

Start with ReLU — works in most feedforward, convolutional, and attention-based architectures
If you see dying neurons (output stuck at zero for many inputs) — switch to Leaky ReLU
If your network exceeds 50 layers — consider ELU or add batch normalization before adjusting activations
For RNNs and LSTMs — use tanh in gating mechanisms (this is the default in most framework implementations)
For transformers and attention-based models — use GELU (see Section 4)

Comprehensive Comparison Table

Activation	Output Range	Primary Use	Gradient Issue	Computation Cost
ReLU	[0, ∞)	Hidden layers (default)	Dying ReLU possible	Very fast
Leaky ReLU	(−∞, ∞)	Deep hidden layers	None	Fast
ELU	(−α, ∞)	Deep hidden layers	None	Moderate
Sigmoid	(0, 1)	Binary output only	Vanishing (deep)	Moderate
Tanh	(−1, 1)	RNN hidden states	Vanishing (very deep)	Moderate
Softmax	(0, 1) per class	Multi-class output only	None	Moderate
GELU	(−0.17, ∞)	Transformer hidden layers	None	Moderate
SwiGLU	(−∞, ∞)	Large LLM hidden layers	None	Higher

Ready to implement AI in your business? GrowthGear’s team has helped 50+ startups integrate AI solutions that drive real results. Book a Free Strategy Session to discuss your AI roadmap.

Activation Functions in Modern AI Architectures

Modern large language models and foundation models have moved well beyond ReLU toward smoother, more expressive activation functions. Understanding what transformers are and how they work explains why these newer activations were adopted — transformer architectures are far deeper than the CNNs that made ReLU famous, and training dynamics at that scale expose ReLU’s limitations.

GELU: The Transformer Standard

GELU (Gaussian Error Linear Unit) was introduced by Hendrycks and Gimpel (2016) and became the activation standard for transformer models. Unlike ReLU’s hard threshold at zero, GELU smoothly gates inputs based on their probability under a Gaussian distribution:

f(x) ≈ 0.5x × (1 + tanh(√(2/π) × (x + 0.044715x³)))

Why transformers use GELU over ReLU:

Smoother gradient flow — no hard zero boundary means better signal propagation in very deep networks
Stochastic regularization properties that reduce overfitting during training
Consistent performance improvement on language modeling tasks

BERT (Devlin et al., 2018) adopted GELU and reported it outperformed ReLU for transformer pre-training. GPT-2, GPT-3, and most subsequent large language models followed. Google’s T5 and Vision Transformers (ViT) use the same pattern.

SwiGLU: The Architecture for Large-Scale LLMs

SwiGLU combines Swish activation with a Gated Linear Unit mechanism. Introduced by Noam Shazeer at Google (2020), SwiGLU has been adopted by LLaMA, PaLM, Mistral, and other leading open-weight models:

SwiGLU(x) = Swish(Wx + b) ⊗ (Vx + c)

The gating mechanism allows the model to selectively pass or suppress information at each position. Empirical results across models with 7B+ parameters consistently show SwiGLU outperforms GELU for very large architectures. Meta’s LLaMA 2 paper (Touvron et al., 2023) reports SwiGLU as a key architectural component, replacing the FFN activation used in the original transformer.

CNNs: ReLU Still Dominates

Convolutional neural networks for image tasks — from AlexNet (2012) through modern EfficientNet and ConvNeXt — continue using ReLU or its variants in convolutional layers. The sparse activation properties of ReLU align naturally with local feature detection, where most of a feature map should be inactive at any position.

For AI tools used in visual applications and content marketing, the underlying vision models almost universally use ReLU in convolutional blocks, switching to softmax only at the final classification head.

RNNs and LSTMs: Tanh Persists

Recurrent networks maintain tanh as the standard activation for hidden state updates — this is baked into vanilla RNN, GRU, and LSTM implementations across all major frameworks. LSTM cells specifically use both tanh (cell state update) and sigmoid (input, forget, and output gates) together.

The bounded output range of tanh (−1 to 1) controls the magnitude of hidden states across time steps, which prevents the exploding gradients that would otherwise derail training on long sequences.

Common Mistakes and How to Avoid Them

Most activation function errors fall into four predictable patterns: using the wrong output activation, choosing sigmoid or tanh for deep hidden layers, mismatching activation type and weight initialization, and ignoring dying ReLU. Catching these mistakes before training saves hours of wasted compute and prevents models that fail silently without obvious error messages. These issues are especially costly when building AI-driven business systems at production scale.

The Dying ReLU Problem

When neurons receive consistently large negative inputs — caused by poor weight initialization or too-large learning rates — they output zero permanently. Because the gradient of ReLU for zero-output neurons is also zero, those neurons receive no weight update and never recover. A poorly initialized network can lose 20–40% of its neurons to this problem before training is even halfway through.

How to detect it: During training, monitor the fraction of zero-valued outputs per layer. If any hidden layer shows more than 30% zero activations consistently throughout training (not just early on), dying ReLU is likely the cause.

Fixes in priority order:

Switch to Leaky ReLU — one-line change, typically resolves the issue
Use He (Kaiming) initialization — the theoretically correct initialization for ReLU, available as torch.nn.init.kaiming_normal_() in PyTorch
Reduce initial learning rate — prevents large negative weight values during early training
Add batch normalization before activations — normalizes inputs to a sensible range, reducing the chance of extreme negative activations

Vanishing Gradients with Sigmoid and Tanh

For networks deeper than 10–15 layers, sigmoid and tanh in hidden layers cause gradients to shrink exponentially as they propagate backward through layers. Early layers receive near-zero gradient updates and fail to learn useful representations — the network appears to train (loss decreases) but only the final layers are actually learning.

This was the core barrier to deep learning until Hinton et al. (2006) showed that greedy layer-wise pretraining could initialize deep networks, and until the more general solution — ReLU combined with careful initialization — became standard practice.

Diagnostic: Plot the gradient magnitude per layer during training. With sigmoid/tanh, you’ll see gradients decreasing by orders of magnitude from the output layer toward the input layer.

Wrong Activation at the Output Layer

Silent failures from incorrect output activations are common even among experienced practitioners:

Sigmoid on a regression output — silently clamps all predictions to (0, 1), failing on any target above 1.0
Softmax on a binary output — creates unnecessary competition; use sigmoid instead for binary outputs
No activation on a classification output — produces raw logits instead of probabilities; valid if your loss function handles logits, but misleading if you interpret raw outputs as probabilities

The most insidious version: applying softmax explicitly before CrossEntropyLoss in PyTorch. Since CrossEntropyLoss already applies softmax internally, this double-applies it and produces consistently wrong gradients with no error message.

Mismatch Between Activation and Weight Initialization

Different activation functions have theoretically derived optimal initialization strategies. According to PyTorch’s official documentation:

ReLU and Leaky ReLU: He (Kaiming) initialization — accounts for the zero half of ReLU’s output range
Tanh and Sigmoid: Xavier (Glorot) initialization — assumes symmetric outputs centered at zero
GELU and SwiGLU: Modified Xavier or scaled initialization (varies by implementation)

Using Xavier initialization with ReLU leads to weights that are too small for effective learning. Using He initialization with sigmoid leads to weights that cause saturation immediately. This mismatch consistently produces slower convergence and often worse final performance — yet it’s invisible unless you know to look for it.

Take the Next Step

Understanding activation functions is foundational to building effective neural networks for business AI. Whether you’re designing a classification model for customer segmentation, a regression model for demand forecasting, or fine-tuning a transformer for document processing, the activation function at each layer determines whether your model converges — and how well it ultimately performs.

GrowthGear has guided 50+ startups through AI architecture decisions exactly like these. If you’re building a neural network and want expert input before investing months in training runs, we can help you avoid the most expensive mistakes upfront.

Book a Free Strategy Session →

Activation Functions Quick Reference

Activation	Layer Type	Task	Framework Default	Key Modern Usage
ReLU	Hidden	General	PyTorch, TensorFlow	CNNs, MLPs, ResNets
Leaky ReLU	Hidden	Deep nets, GANs	Optional	GANs, deep MLPs
ELU	Hidden	Deep nets	Optional	Research models
Sigmoid	Output	Binary classification	Optional	Logistic output layer
Tanh	Hidden/Output	RNN gates	LSTM default	RNNs, GRUs, LSTMs
Softmax	Output	Multi-class classification	Optional	All classification heads
GELU	Hidden	Transformers	BERT, GPT default	All transformer models
SwiGLU	Hidden	Large-scale LLMs	LLaMA, PaLM default	7B+ parameter LLMs

Sources & References

Nair, V. & Hinton, G. — “Rectified Linear Units Improve Restricted Boltzmann Machines” — Original ReLU deep learning paper showing improvement over sigmoid (2010)
Devlin, J. et al. — “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” — Introduced GELU as the standard transformer activation function (2018)
He, K. et al. — “Delving Deep into Rectifiers” (Microsoft Research) — PReLU and He initialization for ReLU networks, ImageNet 2015 winner (2015)
TensorFlow Keras Activation Functions Documentation — Official guidance on activation function selection and usage
Touvron, H. et al. — “LLaMA 2: Open Foundation and Fine-Tuned Chat Models” (Meta AI) — Documents SwiGLU adoption in LLaMA 2 architecture (2023)