lucataco / phixtral-2x2_8

phixtral-2x2_8 is the first Mixure of Experts (MoE) made with two microsoft/phi-2 models, inspired by the mistralai/Mixtral-8x7B-v0.1 architecture

  • Public
  • 386 runs
  • GitHub
  • License

Run time and cost

This model costs approximately $0.015 to run on Replicate, or 66 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 65 seconds. The predict time for this model varies significantly based on the inputs.

Readme

phixtral-2x2_8

phixtral-2x2_8 is the first Mixure of Experts (MoE) made with two microsoft/phi-2 models, inspired by the mistralai/Mixtral-8x7B-v0.1 architecture. It performs better than each individual expert.

You can try it out using this Space.

🏆 Evaluation

The evaluation was performed using LLM AutoEval on Nous suite.

Model AGIEval GPT4All TruthfulQA Bigbench Average
phixtral-2x2_8 34.1 70.44 48.78 37.82 47.78
dolphin-2_6-phi-2 33.12 69.85 47.39 37.2 46.89
phi-2-dpo 30.39 71.68 50.75 34.9 46.93
phi-2 27.98 70.8 44.43 35.21 44.61

Check YALL - Yet Another LLM Leaderboard to compare it with other models.

🧩 Configuration

The model has been made with a custom version of the mergekit library (mixtral branch) and the following configuration:

base_model: cognitivecomputations/dolphin-2_6-phi-2
gate_mode: cheap_embed
experts:
  - source_model: cognitivecomputations/dolphin-2_6-phi-2
    positive_prompts: [""]
  - source_model: lxuechen/phi-2-dpo
    positive_prompts: [""]

💻 Usage

Here’s a Colab notebook to run Phixtral in 4-bit precision on a free T4 GPU.

!pip install -q --upgrade transformers einops accelerate bitsandbytes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "phixtral-2x2_8"
instruction = '''
    def print_prime(n):
        """
        Print all primes between 1 and n
        """
'''

torch.set_default_device("cuda")

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    f"mlabonne/{model_name}", 
    torch_dtype="auto", 
    load_in_4bit=True, 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    f"mlabonne/{model_name}", 
    trust_remote_code=True
)

# Tokenize the input string
inputs = tokenizer(
    instruction, 
    return_tensors="pt", 
    return_attention_mask=False
)

# Generate text using the model
outputs = model.generate(**inputs, max_length=200)

# Decode and print the output
text = tokenizer.batch_decode(outputs)[0]
print(text)

Inspired by mistralai/Mixtral-8x7B-v0.1, you can specify the num_experts_per_tok and num_local_experts in the config.json file (2 for both by default). This configuration is automatically loaded in configuration.py.

vince62s implemented the MoE inference code in the modeling_phi.py file. In particular, see the MoE class.

🤝 Acknowledgments

A special thanks to vince62s for the inference code and the dynamic configuration of the number of experts. He was very patient and helped me to debug everything.

Thanks to Charles Goddard for the mergekit library and the implementation of the MoE for clowns.

Thanks to ehartford and lxuechen for their fine-tuned phi-2 models.