zsxkib / qwen2-7b-instruct

Qwen 2: A 7 billion parameter language model from Alibaba Cloud, fine tuned for chat completions

  • Public
  • 1.5K runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model costs approximately $0.0015 to run on Replicate, or 666 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 3 seconds.

Readme

Qwen2-7B-Instruct on Replicate

This Replicate model provides access to the Qwen2-7B-Instruct model, part of the Qwen2 language model series. It offers three variants:

  • Qwen/Qwen2-7B-Instruct: Full precision model
  • Qwen/Qwen2-7B-Instruct-GPTQ-Int8: 8-bit quantized model
  • Qwen/Qwen2-7B-Instruct-GPTQ-Int4: 4-bit quantized model

Introduction

Qwen2 is the latest series of Qwen large language models, offering both pretrained and instruction-tuned models in five sizes: 0.5B, 1.5B, 7B, 57B-A14B, and 72B. This Replicate implementation focuses on the instruction-tuned 7B Qwen2 model.

Qwen2 demonstrates competitive performance against state-of-the-art open-source and proprietary models across various benchmarks, including language understanding, generation, multilingual capability, coding, mathematics, and reasoning.

Qwen2-7B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts.

For more details about Qwen2, visit:

Model Details

Qwen2 is based on the Transformer architecture and incorporates: - SwiGLU activation - Attention QKV bias - Group query attention - Improved tokenizer for multiple natural languages and code

Training Details

The model underwent pretraining with a large dataset, followed by post-training using both supervised fine-tuning and direct preference optimization.

Quickstart

To use this Replicate implementation:

  1. Visit the Replicate model page.

  2. Use the web interface or API to run a prediction with your desired parameters.

For local testing or development:

  1. Clone the repository: sh git clone -b Qwen2-7B-Instruct https://github.com/zsxkib/cog-qwen-2.git cd cog-qwen-2

  2. Run a prediction using Cog: sh cog predict \ -i 'top_k=1' \ -i 'top_p=1' \ -i 'prompt="Tell me a funny joke about cowboys in the style of Yoda from Star Wars"' \ -i 'model_type="Qwen2-7B-Instruct"' \ -i 'temperature=1' \ -i 'system_prompt="You are a funny and helpful assistant."' \ -i 'max_new_tokens=512' \ -i 'repetition_penalty=1'

Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

  1. Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"

Or you can install vLLM from source.

  1. Configure Model Settings: After downloading the model weights, modify the config.json file by including the below snippet: ```json { “architectures”: [ “Qwen2ForCausalLM” ], // … “vocab_size”: 152064,

        // adding the following snippets
        "rope_scaling": {
            "factor": 4.0,
            "original_max_position_embeddings": 32768,
            "type": "yarn"
        }
    }
    

    ``` This snippet enable YARN to support longer contexts.

  2. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

    bash python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-7B-Instruct --model path/to/weights

    Then you can access the Chat API by:

    bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen2-7B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Your Long Input Here."} ] }'

    For further usage instructions of vLLM, please refer to our Github.

Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Evaluation

Performance comparison between Qwen2-7B-Instruct and similar-sized instruction-tuned LLMs:

Dataset Llama-3-8B-Instruct Yi-1.5-9B-Chat GLM-4-9B-Chat Qwen1.5-7B-Chat Qwen2-7B-Instruct
English
MMLU 68.4 69.5 72.4 59.5 70.5
MMLU-Pro 41.0 - - 29.1 44.1
GPQA 34.2 - - 27.8 25.3
TheoremQA 23.0 - - 14.1 25.3
MT-Bench 8.05 8.20 8.35 7.60 8.41
Coding
HumanEval 62.2 66.5 71.8 46.3 79.9
MBPP 67.9 - - 48.9 67.2
MultiPL-E 48.5 - - 27.2 59.1
Evalplus 60.9 - - 44.8 70.3
LiveCodeBench 17.3 - - 6.0 26.6
Mathematics
GSM8K 79.6 84.8 79.6 60.3 82.3
MATH 30.0 47.7 50.6 23.2 49.6
Chinese
C-Eval 45.9 - 75.6 67.3 77.2
AlignBench 6.20 6.90 7.01 6.20 7.21

Citation

If you find the Qwen2 model helpful in your work, please cite:

@article{qwen2,
  title={Qwen2 Technical Report},
  year={2024}
}

License

The Qwen2 model is licensed under the Apache 2.0 License.

Credits and Support