BLT vs BPE FLOPs Comparison

Companion blog post can be found here.

Adjustable Parameters

BLT Patch Size (blt_ps)

Patch size for BLT architecture

1 10

Global Model Dimension (d_model)

Hidden dimension size of the BPE model and BLT's Global model

512 8192

Local Model Layers (local_n_layers)

Number of layers in the BLT's local model

2 24

For inspiration, have a look at the paper's BLT architecture configurations for some inspiration.

A few things you'll notice:

Patch size reduces global model FLOPs but not local model
Increasing patch size and global model dimension doesn't change total FLOPs
In smaller BLTs, local models constitute a larger portion of the total FLOPs

Parameter counts are displayed below each bar.

A core hypothesis of the paper is "that larger models taking fewer steps on larger patches might perform better than smaller models taking more steps." source

The purpose of this tool is to show the relationship between patch size, global model dimension and local model layers in terms of FLOPs and parameters. This tool implies nothing about the effectiveness of the FLOPs relative to loss (c.f FLOPs/BPB plots from the paper) or downstream benchmarks. In order to fully compare BPE-based transformers and BLT, you'll need to investigate those claims in the paper itself.

FLOPs Comparison & Model Parameters