Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Microsoft

Microsoft Releases Phi-2, a Small LLM That Outperforms Llama 2 and Mistral 7B (venturebeat.com) 22

An anonymous reader quotes a report from : Microsoft Research, the blue sky division of the software giant, [...] announced the release of its Phi-2 small language model (SML), a text-to-text AI program that is "small enough to run on a laptop or mobile device," according to a post on X. At the same time, Phi-2 with its 2.7 billion parameters (connections between artificial neurons) boasts performance that is comparable to other, much larger models including Meta's Llama 2-7B with its 7 billion parameters and even Mistral-7B, another 7 billion parameter model.

Microsoft researchers also noted in their blog post on the Phi-2 release that it outperforms Google's brand new Gemini Nano 2 model despite it having half a billion more parameters, and delivers less "toxicity" and bias in its responses than Llama 2. Microsoft also couldn't resist taking a little dig at Google's now much-criticized, staged demo video for Gemini in which it showed off how its forthcoming largest and most powerful new AI model, Gemini Ultra, was able to solve fairly complex physics problems and even correct students' mistakes on them. As it turned out, even though it is likely a fraction of the size of Gemini Ultra, Phi-2 also was able to correctly answer the question and correct the student using the same prompts.

However, despite these encouraging findings, there is a big limitation with Phi-2, at least for the time being: it is licensed only for "research purposes only," not commercial usage, under a custom Microsoft Research License, which further states Phi-2 may only be used for "non-commercial, non-revenue generating, research purposes." So, businesses looking to build products atop it are out of luck.

This discussion has been archived. No new comments can be posted.

Microsoft Releases Phi-2, a Small LLM That Outperforms Llama 2 and Mistral 7B

Comments Filter:
  • Someone should just start using proper AI names like skynet, Hal, Agent Smith, and Roy (Batty).

  • Outperform? (Score:5, Insightful)

    by RazorSharp ( 1418697 ) on Saturday December 16, 2023 @10:16AM (#64085705)

    I'm confused by what they mean by "outperform." Is there a standard test by which we test an LLM? If one is better at certain tasks and another is better at other certain tasks, which is outperforming the other? Are they just talking about the hardware requirements? What about this whole "toxicity" claim? What metrics measure that.

    It's lazy that the "outperform" description is parroted without being challenged. Maybe this half-assed article that just repeats the press release talking points was written by the very LLM it talks about.

    • Sounds like the LLM running this bot doesn't use RAG so it's unable to generating meaningful comments (since the article text isn't in its training corpus). For anyone who wants to know but is likewise unable to click on links, they do in fact use a variety of benchmarks designed to test various capabilities. Let me quote TFA

      Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT),

      • Actually, I read the article on VentureBeat but not the Microsoft press release where I assume you pulled your quote. Those tests mean nothing to be not because I am some AI, as you jokingly point out, but because I am generally ignorant regarding the specifics of LLMs. My rhetorical questions were not suggesting that Microsoft lacked a metric to quantify their claims, I'm just skeptical of our ability to quantify the quality of an LLM.

        A good example is the Flesch-Kincaid readability tests. The idea is to f

    • Re:Outperform? (Score:4, Insightful)

      by gizmo2199 ( 458329 ) on Saturday December 16, 2023 @11:33AM (#64085759) Homepage

      > and delivers less "toxicity" and bias

      Ironically you have to bias the model so it doesn't deliver "toxicity". I mean, the concept of "toxicity" is already biased toward a Western liberal understanding of various -isms that isn't used by most of the world. I mean, try filing a racial bias lawsuit in Indonesia or Nigeria and see how far you'll get...
      Or try explaining to someone in Dubai how a picture of black people should never be referred to as resembling gorillas because it racist according to left-wing Western ideologies, even though some humans can optically resemble animals.

      • Not that ironic. It requires tuned reverse bias to counteract systemic bias. That's just math.
        Whether or not that's right or wrong isn't a discussion I care to get dragged into.
      • Or try explaining to someone in Dubai how a picture of black people should never be referred to as resembling gorillas because it racist according to left-wing Western ideologies, even though some humans can optically resemble animals.

        Yeah sure, that's totally the motivation behind why certain people refer to black people as gorillas*.

        * For the benefit of the completely oblivious I'll point out that I'm being sarcastic.

    • Isn't "Small LLM" an oxymoron?

      • by Rei ( 128717 )

        Well, all terms are relative. I often train with the even smaller "TinyLLaMA" (1,1B), so that's a "tiny large language model" ;)

        Phi-2 might be a nice basis for training (should be small enough that it'd be no problem to finetune with a single 24GB consumer card), but unfortunately the license isn't open enough for my standards (TinyLLaMA is Apache licensed, as are Mistral-7B, Mixtral-8x7B, and all of their derivatives - alongside Falcon (up to 40B) and a good number of others).

    • Re:Outperform? (Score:4, Informative)

      by Rei ( 128717 ) on Saturday December 16, 2023 @01:21PM (#64085899) Homepage

      There are a LOT of different metrics for evaluating LLM performance. The concern however is that since most of them are open, there's the possibility of "training to the test", so it's questionable how much you can trust them.

      My bigger issue with this is that their comparison is LLaMA 2 and Mistral 7B. But these are "old" models by now. Mistral in particular - 7B is now known as "Mistral small", and was considered just a prototype. The most advanced one that's been fully released is Mixtral 8x7B, which is a MoE (Mixture of Experts), sort of a mini-GPT-4, which outperforms models far larger than it. Not yet released (but available via API) is Mistral Medium, which on tests seems to be practically GPT-4 level performance, but at 1/10th the API cost.

      Mixtral-8x7B has 8 experts, trained such that whichever model is best at predicting the next token gets to learn from that token (but offset so that learning is spread evenly through each expert model), and in inference, for each token, and for each layer, whichever two models are rated as best for processing it do so, with the weights being an additive sum of the two model's outputs. The training can be better spread across training compute nodes, while on inference, you're effectively only running a 13B model at any point in time, so it's much faster (and again you have more potential for distribution across multiple inference nodes).

      Honestly, I'm really excited about MoEs - this is definitely the way forward (and indeed, is much closer to how biological systems work - we don't have every neuron in a "layer" (not that we use layers) talking to every other neuron; compute in our brains is mostly regionalized). Honestly, I think there's even some potential in MoEs solving the hallucination issue. It's already known that via repeated runs you can sus out hallucination in LLMs, because LLMs confident in an answer tend to give similar results with each run, while hallucinating LLMs can give wildly different answers in the same environment. Well, picture that you have a MoE model that has hundreds of experts, with a dozen or more running at each layer for each token. Now you have the ability to test, are these experts each giving relatively similar outputs, or are they wildly divergent? Aka, dotproducting the outputs to get similarity metrics should give you - I would think - a measure of confidence. And in the case of low confidence, you can boost the odds of an "uncertainty" token. Uncertainty tokens can then be used in the finetune to teach how to react appropriately to uncertainty.

    • Not to mention that saying that something outperforms Llama 2 is roughly equivalent to saying that its smarter than a banana.
  • by Ecuador ( 740021 ) on Saturday December 16, 2023 @10:30AM (#64085719) Homepage

    So.. It really whips the Llama's ass?

  • Reminds me of jumbo shrimp.

  • "Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores."

    Seems present day inference schemes are leaving a lot of value on the table. Would be interesting to see this kind of scaling could be emulated during inference rather than baked into the model. If it were possib

    • by Rei ( 128717 )

      This sort of thing isn't complex, and nor is it new. To change a model size, you can just add or remove layers to your model, and then resume training for some number of tokens, and it continues to make use of the preexisting knowledge, just readjusting to use the new layers / compensate for the loss of layers.

      • Would be interesting to see this kind of scaling could be emulated during inference rather than baked into the model. If it were possible you could do speculative inference batching to scale performance when needed to minimize ram requirements and possibly memory bandwidth.

        This sort of thing isn't complex, and nor is it new. To change a model size, you can just add or remove layers to your model, and then resume training for some number of tokens, and it continues to make use of the preexisting knowledge, just readjusting to use the new layers / compensate for the loss of layers.

        You are talking about training while I'm explicitly talking about inference. If you do the above you just end up with a bigger model that requires more ram and bandwidth to execute.

        The trick would be tweaking inference in some way to achieve the same outcomes as larger models without increasing ram requirements. Something more akin to moving slightly away from FNNs... iterating or reprocessing various layers in some way.

        • by Rei ( 128717 )

          I seem to have misunderstood you. Though I personally think MoE models are a better choice. If you're going to be selecting models, might as well select them for each layer for each token, and have them be as specialized to the task as possible.

  • Is it seems to be only available on Azure Cloud. Please tell me if I'm wrong.

As far as the laws of mathematics refer to reality, they are not certain, and as far as they are certain, they do not refer to reality. -- Albert Einstein

Working...