replit/replit-code-v1-3b | Readme and Docs

Model Description

This model was developed by Replit.

replit-code-v1-3b is a 2.7B Causal Language Model focused on Code Completion. The model has been trained on a subset of the Stack Dedup v1.2 dataset.

The training mixture includes 20 different languages, listed here in descending order of number of tokens:
Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, Shell
In total, the training dataset contains 175B tokens, which were repeated over 3 epochs – in total, replit-code-v1-3b has been trained on 525B tokens (~195 tokens per parameter).

The model has been trained on the MosaicML platform with 256 x A100-40GB GPUs, leveraging their latest LLM examples repo.
replit-code-v1-3b is powered by state-of-the-art LLM techniques, such as: Flash Attention for fast training and inference, AliBi positional embeddings to support variable context length at inference time, LionW optimizer, etc.

Intended Use

Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.

Limitations

The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.

License

The model checkpoint and vocabulary file are licensed under the Creative Commons license (CC BY-SA-4.0). Under the license, you must give credit to Replit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests that Replit endorses you or your use.

The source code files (*.py) are licensed under the Apache 2.0 license.

Usage

Post Processing

Note that as with all code generation models, minimizing opportunities for output degradation is important for quality control. In general, you should consider setting max_length to a reasonable value based on your completion use case. You may also want to experiment with the stop_sequence argument, which allows you to specify a token that will force generation to stop.