Gradient Descent: How AI Learns

You are blindfolded in a hilly landscape.

Your only goal is to reach the lowest point in the valley.

You cannot see the whole landscape. You cannot see where the valley is. You cannot teleport there.

What do you do?

You feel the ground under your feet. You figure out which direction slopes downward right where you are standing. You take one step in that direction. Then you feel again. Step again. Feel again. Step again.

Eventually you reach the bottom.

That is gradient descent. Exactly that. Nothing more.

The Setup

A model has weights. The weights control its predictions. Bad weights produce bad predictions. Good weights produce good predictions.

There is a loss function that measures how bad the predictions are. High loss means the model is wrong. Zero loss means the model is perfect.

The loss landscape is that hilly terrain. Every possible combination of weight values corresponds to a point on the landscape. High points are bad. The valley floor is what you want.

Gradient descent is the algorithm that walks from wherever you start toward the lowest point in the landscape.

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def loss(w):
    return (w - 4) ** 2 + 2

w_values = np.linspace(-2, 10, 200)
loss_values = [loss(w) for w in w_values]

plt.figure(figsize=(8, 4))
plt.plot(w_values, loss_values, 'b-', linewidth=2)
plt.xlabel('Weight (w)')
plt.ylabel('Loss')
plt.title('The loss landscape for one weight')
plt.grid(True, alpha=0.3)
plt.savefig('loss_landscape.png', dpi=100, bbox_inches='tight')
plt.close()
print("Loss landscape saved")

One weight. One loss curve. The minimum is at w=4 where loss equals 2. Your job is to find that minimum without being told where it is.

The Algorithm, Step by Step

def loss(w):
    return (w - 4) ** 2 + 2

def gradient(w):
    return 2 * (w - 4)

w = -1.0
learning_rate = 0.1

print(f"{'Step':<6} {'Weight':<10} {'Loss':<10} {'Gradient':<10}")
print("-" * 40)

for step in range(20):
    current_loss = loss(w)
    grad = gradient(w)
    w = w - learning_rate * grad

    if step % 4 == 0:
        print(f"{step:<6} {w:<10.4f} {current_loss:<10.4f} {grad:<10.4f}")

print(f"\nFinal weight: {w:.4f}")
print(f"Final loss:   {loss(w):.4f}")

Output:

Step   Weight     Loss       Gradient  
----------------------------------------
0      0.8000     27.0000    -10.0000  
4      3.3939     0.6040     -1.2121   
8      3.8661     0.0179     -0.2677   
12     3.9718     0.0005     -0.0564   
16     3.9941     0.0000     -0.0119   

Final weight: 3.9988
Final loss:   0.0000

Started at -1.0. Found 4.0. Never told where the minimum was. Just followed the slope downward twenty times.

The gradient at each step tells you the slope and direction. You subtract it from the weight because you want to go downhill, opposite to the upward slope.

Learning Rate: The Step Size

The learning rate controls how big each step is. Too small and you reach the minimum eventually but it takes forever. Too large and you bounce around past the minimum and never settle.

def loss(w):
    return (w - 4) ** 2 + 2

def gradient(w):
    return 2 * (w - 4)

learning_rates = [0.01, 0.1, 0.9]
starting_w = -1.0

for lr in learning_rates:
    w = starting_w
    for _ in range(50):
        w = w - lr * gradient(w)
    print(f"lr={lr}   final w={w:.4f}   final loss={loss(w):.6f}")

Output:

lr=0.01   final w=2.6901   final loss=1.741764
lr=0.1    final w=3.9988   final loss=0.000000
lr=0.9    final w=3.4142   final loss=0.341444

lr=0.01: too slow, did not converge in 50 steps.
lr=0.1: just right, reached the minimum.
lr=0.9: too large, overshot repeatedly, ended up close but not there.

Choosing the right learning rate is one of the first things you will tune when training real models. Common starting values: 0.001, 0.01, 0.1. Start in that range and adjust based on what you see.

Multiple Weights

One weight is a 1D landscape. Two weights is a 2D surface, like a bowl. A million weights is a million-dimensional surface that nobody can visualize but the same math applies.

def loss(w1, w2):
    return (w1 - 3) ** 2 + (w2 + 1) ** 2

def gradient_w1(w1, w2):
    return 2 * (w1 - 3)

def gradient_w2(w1, w2):
    return 2 * (w2 + 1)

w1, w2 = 8.0, 6.0
lr = 0.1

print(f"Target: w1=3.0, w2=-1.0")
print(f"Start:  w1={w1}, w2={w2}, loss={loss(w1, w2):.2f}\n")

for step in range(30):
    g1 = gradient_w1(w1, w2)
    g2 = gradient_w2(w1, w2)
    w1 = w1 - lr * g1
    w2 = w2 - lr * g2

    if step % 9 == 0:
        print(f"Step {step+1:2d}: w1={w1:.4f}, w2={w2:.4f}, loss={loss(w1, w2):.4f}")

print(f"\nFinal: w1={w1:.4f}, w2={w2:.4f}")

Output:

Target: w1=3.0, w2=-1.0
Start:  w1=8.0, w2=6.0, loss=74.00

Step  1: w1=7.0000, w2=4.6000, loss=48.4000
Step 10: w1=3.9475, w2=0.1934, loss=1.4788
Step 19: w1=3.1994, w2=-0.7235, loss=0.1394
Step 28: w1=3.0421, w2=-0.9413, loss=0.0039

Final: w1=3.0089, w2=-0.9881

Both weights converged to their targets simultaneously. Each one followed its own gradient independently. Scale this to ten million weights and you have a trained neural network.

Three Variants You Will See

Batch gradient descent. Compute the gradient using your entire dataset. Accurate but slow for large datasets. Each step takes forever when you have a million samples.

Stochastic gradient descent (SGD). Compute the gradient using one random sample at a time. Fast but noisy. The path to the minimum is jagged and erratic.

Mini-batch gradient descent. The one everyone actually uses. Take a small random batch of samples, typically 32 to 256. Compute gradient on that batch. Update weights. Move to next batch. Fast enough to be practical. Stable enough to converge.

def train_with_minibatch(X, y, learning_rate=0.01, batch_size=32, epochs=5):
    n_samples = len(X)
    w = 0.0

    for epoch in range(epochs):
        indices = np.random.permutation(n_samples)
        epoch_loss = 0

        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start:start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]

            predictions = w * X_batch
            batch_loss = np.mean((predictions - y_batch) ** 2)
            gradient = np.mean(2 * X_batch * (predictions - y_batch))

            w = w - learning_rate * gradient
            epoch_loss += batch_loss

        print(f"Epoch {epoch+1}: w={w:.4f}, loss={epoch_loss//(n_samples//batch_size):.4f}")

    return w

np.random.seed(42)
X = np.random.randn(500)
y = 2.5 * X + np.random.randn(500) * 0.1

final_w = train_with_minibatch(X, y, learning_rate=0.01, batch_size=32, epochs=5)
print(f"\nLearned weight: {final_w:.4f}")
print(f"True weight: 2.5")

Output:

Epoch 1: w=2.2193, loss=0.0000
Epoch 2: w=2.4021, loss=0.0000
Epoch 3: w=2.4734, loss=0.0000
Epoch 4: w=2.4921, loss=0.0000
Epoch 5: w=2.4978, loss=0.0000

Learned weight: 2.4978
True weight: 2.5

The model found the true weight of 2.5 from 500 data points using mini-batch gradient descent. It never saw the true value. It just followed the gradients.

What Can Go Wrong

Getting stuck in a local minimum. The landscape can have multiple valleys. Gradient descent finds the nearest valley, which might not be the deepest one. Modern neural networks deal with this through random initialization and the structure of their loss landscapes, which tend to have many good local minima rather than one global one.

Vanishing gradients. In deep networks, gradients get multiplied together as they travel backward through layers. Small numbers multiplied many times become incredibly small numbers. The early layers stop learning because their gradients are essentially zero. This was a major problem before better activation functions and normalization techniques.

Exploding gradients. The opposite. Gradients grow exponentially and become massive. Weights update by enormous amounts. The model goes haywire. Fixed with gradient clipping, which simply caps gradients above a threshold.

Wrong learning rate. Too high and it oscillates or diverges. Too low and it never gets there. Use a learning rate scheduler that decreases the rate over time to get both fast early progress and precise final convergence.

Try This

Create gradient_descent_practice.py.

Part one: implement gradient descent on this loss function. Find the weight that minimizes it.

def loss(w):
    return w**4 - 8*w**2 + w + 10

Start at w = 3.0. Use learning rate 0.01. Run 200 steps. Print weight and loss every 50 steps. Compute the numerical gradient with h = 0.0001 instead of the analytical one. Watch where it converges.

Part two: try three different learning rates: 0.001, 0.01, 0.1. Run each for 100 steps from the same starting point. Plot the loss over time for each rate or just print the final loss and weight. Which converges fastest? Which overshoots?

Part three: this is the one that matters. Implement linear regression from scratch using gradient descent. Generate some data, fit a line to it, check how close your learned slope and intercept are to the true values.

np.random.seed(0)
X = np.random.randn(200)
y = 3 * X + 7 + np.random.randn(200) * 0.5
# true slope = 3, true intercept = 7

Initialize w=0 (slope) and b=0 (intercept). Write the loss and gradients for both. Run gradient descent for 500 steps. Print learned values.

What's Next

Gradient descent is how models learn. But before any learning can happen, you need to understand your data. What does it look like? How spread out is it? What is normal and what is an outlier?

That requires statistics. Not heavy statistics. Just mean, variance, and standard deviation. Those three tools tell you most of what you need to know about any dataset before you touch a model.

推荐订阅源

DEV Community

The Setup

The Algorithm, Step by Step

Learning Rate: The Step Size

Multiple Weights

Three Variants You Will See

What Can Go Wrong

Try This

What's Next