BLT vs BPE FLOPs Comparison

Companion blog post can be found here.

Adjustable Parameters

1 10
512 8192
2 24

For inspiration, have a look at the paper's BLT architecture configurations for some inspiration.

A few things you'll notice:

  1. Patch size reduces global model FLOPs but not local model
  2. Increasing patch size and global model dimension doesn't change total FLOPs
  3. In smaller BLTs, local models constitute a larger portion of the total FLOPs

Parameter counts are displayed below each bar.

A core hypothesis of the paper is "that larger models taking fewer steps on larger patches might perform better than smaller models taking more steps." source

The purpose of this tool is to show the relationship between patch size, global model dimension and local model layers in terms of FLOPs and parameters. This tool implies nothing about the effectiveness of the FLOPs relative to loss (c.f FLOPs/BPB plots from the paper) or downstream benchmarks. In order to fully compare BPE-based transformers and BLT, you'll need to investigate those claims in the paper itself.

  • BPE's bytes per token (bpe_ps): 4.4
  • BPE/BLT Global - Num Layers (n_layers): 26
  • BPE/BLT Global - Num Heads (n_heads): 20
  • BPE - Vocabulary Size (n_vocab): 128,000
  • BPE/BLT - Context Length (n_ctx_base): 8,192 bytes
  • BLT Local - Model Dimension (local_d_model): 1024
  • BLT Local - Num Heads (local_n_heads): 16
  • BLT Local - Vocabulary Size (local_n_vocab): 256
  • BLT Local - FF Multiplier (local_d_ff_multiplier): 4