Byte Sized ML - Counting Number of Parameters in GPT2

While going through the Let’s reproduce GPT-2(124M) I fell into the rabit hole of how the 124M parameter are distributed across the decoder of the transformer.

!pip install transformers torch

from transformers import GPT2Model
model = GPT2Model.from_pretrained('openai-community/gpt2')

Model String object contains the following information about the architecture.

print(model)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2SdpaAttention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

Lets write a utility to sum all the parameters of the model. I am using the numerize library to print in a more eye friendly format.

from numerize import numerize as n
total_params = sum(p.numel() for p in model.parameters())
print("number of params in model: ", n.numerize(total_params))

number of params in model: 124.44M

This is the total number of parameters. To understand how these parameters are distributed lets layout all the pieces of transformer, find out the number of parameters for the individual piece. At the end, for sanity check, we can add the number of parameters across all components to ensure we get the number above.

It’s helpful to define constants so later, we can just plug in the values.

Vocabulary size \((V)\): 50,257

Embedding dimension \((D)\): 768

Number of attention heads \((H)\): 12

Context Length \((C)\): 1,024

Number of layers \((L)\): 12

Parameter Breakdown Per layer

The first two layers are embedding layer wte and wpe.

Embedding Layer (wte)

The model starts with an embedding layer that converts input tokens (words) into continuous representations.

If the vocabulary size is V and the embedding dimension is D, this layer has V×D parameters.

No. of param = V x D = 50,257 x 768 = 38,597,376

Embedding Layer (wpe)

Recall the notion of Positional Embedding. Since transformers by itself don’t encode the position, this layer embeds the position of input tokens.

If the context size is C and the embedding dimension is D, this layer has C×D parameters.

No. of param = C x D = 1,024x768 = 786,432

To verify calculation, the function used to calculate for the entire model can be modified as following.

total_params = sum(p.numel() for p in model._modules['wte'].parameters())
print("number of params in model embedding wte: ", n.numerize(total_params))
total_params = sum(p.numel() for p in model._modules['wpe'].parameters())
print("number of params in model embedding wpe: ", n.numerize(total_params))

number of params in model embedding wte:  38.6M
number of params in model embedding wpe:  786.43K

Transformer layers

We have acheived the conversion of text tokens into an embedding. Next step is to apply the self-attention operation. GPT2Model defines 12 attention blocks starting with the layer norm.

Layer Norm (ln_1)

Each transformer block starts with a layer normalization steps that have parameters for scaling and shifting the normalized output, each adding (2D) parameters per block.

No. of param = 2xD = 2x768 = 1,536

Attention c_attn

The attention mechanism has ( H ) heads, each with its own set of weights for query, key, and value transformations. While each head has a dimension of D/h since there are h heads, it can be simplified to D. Since there are three matrices query, key, value and a bias vector the numbers of parameters in c_attn block could be represented by the expression [Dx(3D)+3D]

c_proj

c_proj combines the output of multiple heads. Its a size of DxD matrix and an additional bias vector D.

No. of param in c_attn = [Dx(3D)+(3D)] = 1,771,776
No. of param in c_proj = [DxD+D] = 590,592
params = c_attn + c_proj = [Dx(3D)+(3D)] + [DxD+D] = 2,362,368

c_attn_params = sum(p.numel() for p in model._modules['h'][0].attn.c_attn.parameters())
c_proj_params = sum(p.numel() for p in model._modules['h'][0].attn.c_proj.parameters())
print("number of params in c_attn: ", n.numerize(c_attn_params ))
print("number of params in c_proj: ", n.numerize(c_proj_params ))
print("sum of c_attn and c_proj", n.numerize(c_attn_params+c_proj_params))

number of params in c_attn:  1.77M
number of params in c_proj:  590.59K
sum of c_attn and c_proj 2.36M

Going through the definition of attn in the architecture you will come across attn_dropout and resid_dropout. I skip their description since they do not add any additional parameters.

Layer Norm (ln_2)

Its similar to Layer Norm ln_1.

Feed Forward Neural Network mlp

The feed-forward neural network has two linear transformations with a non-linear activation in between. c_fc is responsible for projecting the output of the Attention layer into a hidden space 4xD. The c_fc is of size Dx4D plus a bias vector of size 4D.
Next, c_proj is responsible for projecting the output of the first feed-forward layer back into the embedding space. Therefore c_proj is equal to 4DxD plus a bias vector of size D.

c_fc = [Dx(4xD)+(4xD)] = 2,362,368
c_proj = [(4xD)xD+D] = 2,360,064
No. of param = [Dx(4D)+(4D)] + [(4D)xD+D] = 4,722,432

c_fc_params = sum(p.numel() for p in model._modules['h'][0].mlp.c_fc.parameters())
c_proj_params = sum(p.numel() for p in model._modules['h'][0].mlp.c_proj.parameters())
print("number of params in c_attn: ", n.numerize(c_fc_params ))
print("number of params in c_proj: ", n.numerize(c_proj_params ))
print("number of params in mlp: ", n.numerize(c_fc_params +c_proj_params))

number of params in c_attn:  2.36M
number of params in c_proj:  2.36M
number of params in mlp: 4.72M

Going through the definition of MLP in the architecture you will come across act and dropoff. I skip their description since they do not add any additional parameters.

Layer Norm (ln_f)

Its similar to Layer Norm ln_1

Total number of parameters in GPT2

To calculate params in GPT2 we have to add all the parameters we have come across.

\[No. of param = w_{\text{te}} + c_{\text{te}} + L \times (ln_1 + \text{attn} + ln_2 + \text{mlp}) + ln_f\] \[= 38,597,376 + 786,432 + 12 \times (1,536 + 2,362,368 + 1,536 + 4,722,432) + 1,536\] \[ = 124,439,808\]

So the final calculation is the same as the first operation we did to determine number of parameters. Using the process we can get a ballpark estimate of number of parameter’s for model of any size.