This ALEA project contains the complete source code to collect and preprocess all training data related to the KL3M embedding and generative models.
Pending arXiv submission
Pending arXiv submission
TODO: Table
- us/fdlp: US Federal Depository Library Program (FDLP) via GPO
- us/govinfo: US Government Publishing Office (GPO) data via GovInfo API
- us/usc: US Code releases via Office of the Law Revision Counsel (OLRC)
- us/ecfr: Electronic Code of Federal Regulations (eCFR) via NARA/GPO API
- us/fr: Federal Register data via NARA/GPO API
- us/edgar: SEC EDGAR data via SEC feed
- us/recap: RECAP raw documents via S3
- us/recap-docs: RECAP attached docs (Word, WordPerfect, PDF, MP3) via S3
- us/pacer-dockets: PACER docket sheets via archive.org
- us/patent-grants: USPTO patent grants via USPTO bulk data
- us/dotgov: filtered .gov TLD domains via direct retrieval
- uk/legislation: All enacted UK legislation via legislation.gov.uk bulk download
- eu/eurlex_oj: EU Official Journal via Cellar/EU API
TODO
TODO
The source code for this ALEA project is released under the MIT License. See the LICENSE file for details.
Top-level dependencies are all licensed MIT, BSD-3, or Apache 2.0 See poetry show --tree
for details.
If you encounter any issues or have questions about using this ALEA project, please open an issue on GitHub.
To learn more about ALEA and our KL3M models and data, visit the ALEA website.