A compact, high-performance neural architecture for multi-digit addition using a hybrid Transformer-LSTM model. This repository includes model code, training pipeline, evaluation metrics, and insights from an extensive ablation and error analysis. It also investigates how deep learning models—specifically a Transformer-LSTM hybrid—can learn to perform symbolic arithmetic (addition) with high accuracy. It includes:
- Curriculum-based training on digit lengths 1 to 7
- Evaluation of error types and model generalization
- Ablation studies on architecture and hidden dimensions
- Base: Transformer encoder + LSTM decoder
- Trainable Parameters: 102,991
- Curriculum Learning: Gradual increase in digit length (1→7)
- Evaluation Metrics: Exact match accuracy, character accuracy, perplexity
Digit Length | Epochs Required | Final Accuracy | Initial Accuracy |
---|---|---|---|
1-digit | 1 | 100% | 100% |
2-digit | 1 | 98.83% | 98.83% |
3-digit | 2 | 97.86% | 89.25% |
4-digit | 3 | 96.84% | 85.36% |
5-digit | 10 | 93.57% | 8.32% |
6-digit | 1 | 82.52% | 82.52% |
7-digit | 2 | 97.23% | 1.03% |
- Digit Substitution Errors (38%)
- Carry Propagation Failures (45%)
- Length Miscalculations (12%)
- Positional Errors (5%)
- 92% of errors involve carry operations
- Errors increase exponentially with digit length
- Mid-positions (3rd–5th digits) are most error-prone
- Digits with repeated 8s and 9s cause more confusion
- Fast convergence on simpler sequences (1–4 digits)
- 5-digit tasks required 10 epochs for 93.57% accuracy
- Perplexity spikes when increasing digit length
- 6-digit accuracy anomalously low despite training
Metric | Transformer-LSTM | Pure Transformer |
---|---|---|
5-digit Accuracy | 93.57% | 89.42% |
7-digit Accuracy | 97.23% | 94.81% |
Epochs to Converge | 17 | 23 |
Carry Error Rate | 45% | 58% |
Takeaway: LSTM enhances sequence carry handling and convergence efficiency.
2. Hidden Dimension Size
Metric | Hidden Dim 64 | Hidden Dim 128 |
---|---|---|
5-digit Accuracy | 99.96% | 98.29% |
7-digit Accuracy | 99.95% | 82.50% |
Epochs to Converge | 4 | 8 |
Perplexity | 1.0005 | 1.0088 |
Takeaway: Surprisingly, smaller hidden dimension (64) outperforms larger ones (128).
d_model | nhead | num_encoder_layers | num_decoder_layers | dim_feedforward | learning_rate | best_epoch | val_loss | val_char_accuracy | val_seq_accuracy | training_time |
---|---|---|---|---|---|---|---|---|---|---|
64 | 2 | 2 | 2 | 256 | 0.0001 | 30 | 1.4035800695419312 | 0.49499374217772213 | 0.0 | 31.422882795333862 |
64 | 2 | 2 | 2 | 256 | 0.0005 | 30 | 1.0751647353172302 | 0.6151439299123905 | 0.0 | 30.989251613616943 |
64 | 2 | 2 | 2 | 256 | 0.001 | 30 | 1.0031758546829224 | 0.6376720901126408 | 0.0 | 29.890056133270264 |
64 | 2 | 2 | 2 | 512 | 0.0001 | 30 | 1.360324740409851 | 0.5071964956195244 | 0.0 | 29.192161083221436 |
64 | 2 | 2 | 2 | 512 | 0.0005 | 30 | 1.0545085668563843 | 0.6170212765957447 | 0.0 | 29.394908905029297 |
64 | 2 | 2 | 2 | 512 | 0.001 | 30 | 0.914421945810318 | 0.6602002503128911 | 0.0 | 30.800782203674316 |
64 | 2 | 2 | 2 | 1024 | 0.0001 | 30 | 1.3308556079864502 | 0.5237797246558198 | 0.0 | 28.76651930809021 |
64 | 2 | 2 | 2 | 1024 | 0.0005 | 30 | 1.0258559584617615 | 0.623279098873592 | 0.0 | 25.842985153198242 |
64 | 2 | 2 | 2 | 1024 | 0.001 | 30 | 0.833139955997467 | 0.6877346683354193 | 0.001 | 28.320575952529907 |
64 | 2 | 2 | 3 | 256 | 0.0001 | 29 | 1.3940486907958984 | 0.4903003754693367 | 0.001 | 26.37175154685974 |
64 | 2 | 2 | 3 | 256 | 0.0005 | 30 | 1.0782753229141235 | 0.6038798498122653 | 0.0 | 24.93513512611389 |
64 | 2 | 2 | 3 | 256 | 0.001 | 30 | 0.9739904403686523 | 0.6458072590738423 | 0.001 | 24.815475463867188 |
64 | 2 | 2 | 3 | 512 | 0.0001 | 30 | 1.333204209804535 | 0.5068836045056321 | 0.0 | 24.782799005508423 |
64 | 2 | 2 | 3 | 512 | 0.0005 | 30 | 1.0625354051589966 | 0.6070087609511889 | 0.0 | 24.964926719665527 |
64 | 2 | 2 | 3 | 512 | 0.001 | 30 | 0.9153944849967957 | 0.6645807259073843 | 0.0 | 25.089921236038208 |
64 | 2 | 2 | 3 | 1024 | 0.0001 | 30 | 1.2750869989395142 | 0.5525657071339174 | 0.0 | 25.409332513809204 |
64 | 2 | 2 | 3 | 1024 | 0.0005 | 30 | 0.9939264357089996 | 0.6323529411764706 | 0.001 | 24.969414234161377 |
64 | 2 | 2 | 3 | 1024 | 0.001 | 30 | 0.40968507528305054 | 0.8473091364205256 | 0.014 | 24.98571491241455 |
64 | 2 | 2 | 4 | 256 | 0.0001 | 30 | 1.3720954060554504 | 0.5018773466833542 | 0.001 | 28.31575298309326 |
64 | 2 | 2 | 4 | 256 | 0.0005 | 30 | 1.080482840538025 | 0.5973091364205256 | 0.0 | 29.52979850769043 |
64 | 2 | 2 | 4 | 256 | 0.001 | 30 | 0.9904355704784393 | 0.6414267834793492 | 0.0 | 27.33615493774414 |
64 | 2 | 2 | 4 | 512 | 0.0001 | 30 | 1.3286152482032776 | 0.5140801001251565 | 0.0 | 35.518285512924194 |
Criteria | Human | Transformer Model |
---|---|---|
Generalizable Rules | ✅ | ❌ (Length-specific) |
Error Recovery | ✅ | ❌ |
Carry Propagation | ✅ | ❌ (Unstable) |
Systematic Thinking | ✅ | ❌ (Pattern-based) |
The model shows near-perfect accuracy but lacks procedural arithmetic reasoning like humans.
- Transformer-LSTM hybrid > Pure Transformer
- Smaller hidden dim (64) performs better than larger (128)
- Carry handling is the model’s main weakness
- Curriculum learning is essential for training success
- Performance is sensitive to digit length transitions