Qwen2-7B-Instruct on Replicate
This Replicate model provides access to the Qwen2-7B-Instruct model, part of the Qwen2 language model series. It offers three variants:
Qwen/Qwen2-7B-Instruct
: Full precision modelQwen/Qwen2-7B-Instruct-GPTQ-Int8
: 8-bit quantized modelQwen/Qwen2-7B-Instruct-GPTQ-Int4
: 4-bit quantized model
Introduction
Qwen2 is the latest series of Qwen large language models, offering both pretrained and instruction-tuned models in five sizes: 0.5B, 1.5B, 7B, 57B-A14B, and 72B. This Replicate implementation focuses on the instruction-tuned 7B Qwen2 model.
Qwen2 demonstrates competitive performance against state-of-the-art open-source and proprietary models across various benchmarks, including language understanding, generation, multilingual capability, coding, mathematics, and reasoning.
Qwen2-7B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts.
For more details about Qwen2, visit:
Model Details
Qwen2 is based on the Transformer architecture and incorporates: - SwiGLU activation - Attention QKV bias - Group query attention - Improved tokenizer for multiple natural languages and code
Training Details
The model underwent pretraining with a large dataset, followed by post-training using both supervised fine-tuning and direct preference optimization.
Quickstart
To use this Replicate implementation:
-
Visit the Replicate model page.
-
Use the web interface or API to run a prediction with your desired parameters.
For local testing or development:
-
Clone the repository:
sh git clone -b Qwen2-7B-Instruct https://github.com/zsxkib/cog-qwen-2.git cd cog-qwen-2
-
Run a prediction using Cog:
sh cog predict \ -i 'top_k=1' \ -i 'top_p=1' \ -i 'prompt="Tell me a funny joke about cowboys in the style of Yoda from Star Wars"' \ -i 'model_type="Qwen2-7B-Instruct"' \ -i 'temperature=1' \ -i 'system_prompt="You are a funny and helpful assistant."' \ -i 'max_new_tokens=512' \ -i 'repetition_penalty=1'
Processing Long Texts
To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:
- Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"
Or you can install vLLM from source.
-
Configure Model Settings: After downloading the model weights, modify the
config.json
file by including the below snippet: ```json { “architectures”: [ “Qwen2ForCausalLM” ], // … “vocab_size”: 152064,// adding the following snippets "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } }
``` This snippet enable YARN to support longer contexts.
-
Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
bash python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights
Then you can access the Chat API by:
bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Your Long Input Here."} ] }'
For further usage instructions of vLLM, please refer to our Github.
Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling
configuration only when processing long contexts is required.
Evaluation
Performance comparison between Qwen2-7B-Instruct and similar-sized instruction-tuned LLMs:
Dataset | Llama-3-8B-Instruct | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen1.5-7B-Chat | Qwen2-7B-Instruct |
---|---|---|---|---|---|
English | |||||
MMLU | 68.4 | 69.5 | 72.4 | 59.5 | 70.5 |
MMLU-Pro | 41.0 | - | - | 29.1 | 44.1 |
GPQA | 34.2 | - | - | 27.8 | 25.3 |
TheoremQA | 23.0 | - | - | 14.1 | 25.3 |
MT-Bench | 8.05 | 8.20 | 8.35 | 7.60 | 8.41 |
Coding | |||||
HumanEval | 62.2 | 66.5 | 71.8 | 46.3 | 79.9 |
MBPP | 67.9 | - | - | 48.9 | 67.2 |
MultiPL-E | 48.5 | - | - | 27.2 | 59.1 |
Evalplus | 60.9 | - | - | 44.8 | 70.3 |
LiveCodeBench | 17.3 | - | - | 6.0 | 26.6 |
Mathematics | |||||
GSM8K | 79.6 | 84.8 | 79.6 | 60.3 | 82.3 |
MATH | 30.0 | 47.7 | 50.6 | 23.2 | 49.6 |
Chinese | |||||
C-Eval | 45.9 | - | 75.6 | 67.3 | 77.2 |
AlignBench | 6.20 | 6.90 | 7.01 | 6.20 | 7.21 |
Citation
If you find the Qwen2 model helpful in your work, please cite:
@article{qwen2,
title={Qwen2 Technical Report},
year={2024}
}
License
The Qwen2 model is licensed under the Apache 2.0 License.
Credits and Support
- The Qwen2 model was developed by the Qwen team.
- This Replicate implementation was created by @zsakib_.
- For issues related to the Replicate implementation, please use the GitHub issue tracker.
- For questions about the underlying Qwen2 model, refer to the official Qwen repository. ```