Arithmetic Transformer

A compact, high-performance neural architecture for multi-digit addition using a hybrid Transformer-LSTM model. This repository includes model code, training pipeline, evaluation metrics, and insights from an extensive ablation and error analysis. It also investigates how deep learning models—specifically a Transformer-LSTM hybrid—can learn to perform symbolic arithmetic (addition) with high accuracy. It includes:

Curriculum-based training on digit lengths 1 to 7
Evaluation of error types and model generalization
Ablation studies on architecture and hidden dimensions

Model Architecture

Base: Transformer encoder + LSTM decoder
Trainable Parameters: 102,991
Curriculum Learning: Gradual increase in digit length (1→7)
Evaluation Metrics: Exact match accuracy, character accuracy, perplexity

Performance by Digit Length

Digit Length	Epochs Required	Final Accuracy	Initial Accuracy
1-digit	1	100%	100%
2-digit	1	98.83%	98.83%
3-digit	2	97.86%	89.25%
4-digit	3	96.84%	85.36%
5-digit	10	93.57%	8.32%
6-digit	1	82.52%	82.52%
7-digit	2	97.23%	1.03%

Error Analysis

Error Types

Digit Substitution Errors (38%)
Carry Propagation Failures (45%)
Length Miscalculations (12%)
Positional Errors (5%)

Observations

92% of errors involve carry operations
Errors increase exponentially with digit length
Mid-positions (3rd–5th digits) are most error-prone
Digits with repeated 8s and 9s cause more confusion

Training Dynamics

Fast convergence on simpler sequences (1–4 digits)
5-digit tasks required 10 epochs for 93.57% accuracy
Perplexity spikes when increasing digit length
6-digit accuracy anomalously low despite training

Ablation Studies

1. Hybrid vs Pure Transformer

Metric	Transformer-LSTM	Pure Transformer
5-digit Accuracy	93.57%	89.42%
7-digit Accuracy	97.23%	94.81%
Epochs to Converge	17	23
Carry Error Rate	45%	58%

Takeaway: LSTM enhances sequence carry handling and convergence efficiency.

2. Hidden Dimension Size

Metric	Hidden Dim 64	Hidden Dim 128
5-digit Accuracy	99.96%	98.29%
7-digit Accuracy	99.95%	82.50%
Epochs to Converge	4	8
Perplexity	1.0005	1.0088

Takeaway: Surprisingly, smaller hidden dimension (64) outperforms larger ones (128).

Hyperparameter Search Results

d_model	nhead	num_encoder_layers	num_decoder_layers	dim_feedforward	learning_rate	best_epoch	val_loss	val_char_accuracy	val_seq_accuracy	training_time
64	2	2	2	256	0.0001	30	1.4035800695419312	0.49499374217772213	0.0	31.422882795333862
64	2	2	2	256	0.0005	30	1.0751647353172302	0.6151439299123905	0.0	30.989251613616943
64	2	2	2	256	0.001	30	1.0031758546829224	0.6376720901126408	0.0	29.890056133270264
64	2	2	2	512	0.0001	30	1.360324740409851	0.5071964956195244	0.0	29.192161083221436
64	2	2	2	512	0.0005	30	1.0545085668563843	0.6170212765957447	0.0	29.394908905029297
64	2	2	2	512	0.001	30	0.914421945810318	0.6602002503128911	0.0	30.800782203674316
64	2	2	2	1024	0.0001	30	1.3308556079864502	0.5237797246558198	0.0	28.76651930809021
64	2	2	2	1024	0.0005	30	1.0258559584617615	0.623279098873592	0.0	25.842985153198242
64	2	2	2	1024	0.001	30	0.833139955997467	0.6877346683354193	0.001	28.320575952529907
64	2	2	3	256	0.0001	29	1.3940486907958984	0.4903003754693367	0.001	26.37175154685974
64	2	2	3	256	0.0005	30	1.0782753229141235	0.6038798498122653	0.0	24.93513512611389
64	2	2	3	256	0.001	30	0.9739904403686523	0.6458072590738423	0.001	24.815475463867188
64	2	2	3	512	0.0001	30	1.333204209804535	0.5068836045056321	0.0	24.782799005508423
64	2	2	3	512	0.0005	30	1.0625354051589966	0.6070087609511889	0.0	24.964926719665527
64	2	2	3	512	0.001	30	0.9153944849967957	0.6645807259073843	0.0	25.089921236038208
64	2	2	3	1024	0.0001	30	1.2750869989395142	0.5525657071339174	0.0	25.409332513809204
64	2	2	3	1024	0.0005	30	0.9939264357089996	0.6323529411764706	0.001	24.969414234161377
64	2	2	3	1024	0.001	30	0.40968507528305054	0.8473091364205256	0.014	24.98571491241455
64	2	2	4	256	0.0001	30	1.3720954060554504	0.5018773466833542	0.001	28.31575298309326
64	2	2	4	256	0.0005	30	1.080482840538025	0.5973091364205256	0.0	29.52979850769043
64	2	2	4	256	0.001	30	0.9904355704784393	0.6414267834793492	0.0	27.33615493774414
64	2	2	4	512	0.0001	30	1.3286152482032776	0.5140801001251565	0.0	35.518285512924194

🧠 Human vs Model Arithmetic

Criteria	Human	Transformer Model
Generalizable Rules	✅	❌ (Length-specific)
Error Recovery	✅	❌
Carry Propagation	✅	❌ (Unstable)
Systematic Thinking	✅	❌ (Pattern-based)

The model shows near-perfect accuracy but lacks procedural arithmetic reasoning like humans.

Key Takeaways

Transformer-LSTM hybrid > Pure Transformer
Smaller hidden dim (64) performs better than larger (128)
Carry handling is the model’s main weakness
Curriculum learning is essential for training success
Performance is sensitive to digit length transitions

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
arithmetic_transformer.ipynb		arithmetic_transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arithmetic Transformer

Model Architecture

Performance by Digit Length

Error Analysis

Error Types

Observations

Training Dynamics

Ablation Studies

1. Hybrid vs Pure Transformer

2. Hidden Dimension Size

Hyperparameter Search Results

🧠 Human vs Model Arithmetic

Key Takeaways

About

Uh oh!

Releases

Packages

Languages

gojira69/arithmetic-transformer

Folders and files

Latest commit

History

Repository files navigation

Arithmetic Transformer

Model Architecture

Performance by Digit Length

Error Analysis

Error Types

Observations

Training Dynamics

Ablation Studies

1. Hybrid vs Pure Transformer

2. Hidden Dimension Size

Hyperparameter Search Results

🧠 Human vs Model Arithmetic

Key Takeaways

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages