Skip to main content

Gradient Descent Learning Rate and Iterations

· 3 min read
Ross Bulat
Full Stack Engineer

In this short exercise, I explored how gradient descent reduces a cost function by repeatedly updating model parameters.

Google Colab notebook: Open in Google Colab

The tutorial uses a simple linear regression example where the input values are:

x = [1, 2, 3, 4, 5]
y = [5, 7, 9, 11, 13]

This data follows the relationship:

y = 2x + 3

The goal of gradient descent is to learn values for the slope m and intercept b that minimise the cost. In this case, the cost function is mean squared error, which measures how far the model’s predictions are from the true values.

Experiment Setup

I changed two values in the tutorial:

  • iterations: how many times gradient descent updates the parameters
  • learning_rate: how large each update step is

The original tutorial uses:

iterations = 100
learning_rate = 0.08

I tested several combinations and recorded the final cost.

Results

ExperimentIterationsLearning RateFinal CostFinal mFinal b
Fewer iterations100.0812.04681.63091.0383
Default tutorial1000.080.00412.04052.8536
More iterations10000.08~0.00002.00003.0000
Small learning rate1000.010.48042.44851.3809
Moderate learning rate1000.050.03212.11442.5870
Too high learning rate1000.10DivergedVery large negative valueVery large negative value

Observations

Increasing the number of iterations helped the model get closer to the correct line. With only 10 iterations, the cost dropped from the initial value of 89 to about 12.05, but the model was still far from the ideal values of m = 2 and b = 3.

With 100 iterations and a learning rate of 0.08, the cost decreased significantly to around 0.0041. The learned parameters were close to the expected values, with m ≈ 2.04 and b ≈ 2.85.

When I increased the number of iterations to 1000 while keeping the learning rate at 0.08, gradient descent reached the expected result almost exactly. The cost became effectively zero, and the model learned m = 2 and b = 3.

The learning rate had a major effect on the result. A smaller learning rate such as 0.01 still reduced the cost, but convergence was slower. A moderate learning rate such as 0.05 performed better, but 0.08 gave the best result among the stable values I tested.

However, increasing the learning rate too much caused the algorithm to diverge. With learning_rate = 0.10, the cost became extremely large instead of decreasing. This shows that gradient descent can overshoot the minimum when the step size is too large.

Conclusion

This exercise shows that gradient descent depends heavily on both the number of iterations and the learning rate. More iterations give the algorithm more chances to reduce the cost, but only if the learning rate is appropriate.

A learning rate that is too small makes training slow, while a learning rate that is too large can make the model unstable and cause the cost to increase. In this example, learning_rate = 0.08 with enough iterations gave the best balance between speed and stability.