Stochastic Gradient Descent for Coin Flips

Evan Zamir
3 min readDec 12, 2019

--

Here I’m going to show you in explicit detail how to perform stochastic gradient descent (SGD) for a Bernoulli simulation (eg coin flips). This is arguably the simplest non-trivial problem which fully demonstrates SGD. The likelihood function for a single Bernoulli trial is defined as follows:

The likelihood function for a single Bernoulli trial

The first thing we’re going to do is simulate a coin-flipping experiment. Let’s do it in Python. Here’s the code to generate 1000 random coin flips:

import numpy as np
import matplotib.pyplot as plt
theta = 0.5 # a "fair" coin
num_flips = 1000 #simulate 1000 coin flips
#Python list comprehension
flips = [np.random.random() < theta for flip in range(1000)]
print(f'Total number of heads = {sum(flips)}')
> Total number of heads = 499

Unsurprisingly, even though the coin is perfectly fair, it comes up heads (very) slightly less than 50% of the time simply due to chance. We assume our coin flips are independent and identically distributed (i.i.d.), and derive the total log-likelihood:

Derivation for the total log-likelihood for a series of (independent) coin flips

In the above equations, H and T, respectively, are the total number of heads and tails in the experiment. The final step in formulating the SGD update is to calculate the derivative of the likelihood with respect to the theta parameter:

In the above equation, we have divided by the total number of coin flips so that we can work with average values that are independent of the mini-batch size in our optimization step. The SGD update is now simply:

Here, alpha 𝛼 is the learning parameter. With all that setup here is the Python code to implement the SGD step:

dtheta = lambda theta, h, t: (1/(h+t))*((h/t) - (t)/(1-theta))
theta = 0.4
alpha = 0.001
num_epochs = 500
batch_size = 100
thetas = []
thetas.append(theta_opt)
for epoch in range(num_epochs):
rflips = np.random.permutation(flips)
num_batches = int(np.floor(len(flips)/batch_size))
for i in range(num_batches):
batch = rflips[i*batch_size:(i+1)*batch_size-1]
theta -= alpha*(1-epoch)/num_epochs * dtheta(theta, sum(batch), batch_size-sum(batch))
thetas.append(theta)
plt.plot(thetas)
print(f'Final theta = {np.average(thetas[-batch_size*10:])}')
> Final theta = 0.6089699285788563

Note that I added the term (1-epoch)/num_epochs which slows down the learning rate and seems to help with convergence. I suggest playing around with all the other hyperparameters as well, num_epochs, batch_size, num_epochs. At any rate (no pun), here’s what the evolution of 𝜃 looks like:

Evolution of theta to convergence.

Here’s another convergence plot when starting at a theta=0.9:

Starting at theta=0.9

That’s all for now. I hope this simple example of SGD helped you in your own learning process. Keep in mind this problem can be easily solved by deriving the maximum likelihood estimate. Simply set the derivative of the likelihood function to zero and solve for 𝜃. You should get H/T!

--

--

Evan Zamir
Evan Zamir

Written by Evan Zamir

Data Scientist. San Francisco.

No responses yet