I haven’t gotten around to adding Klein to my GenAI Showdown site yet, but if it’s anything like Z-Image Turbo, it should perform extremely well.
For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.
I am amazed, though not entirely surprised, that these models keep getting smaller while the quality and effectiveness increases. z image turbo is wild, I'm looking forward to trying this one out.
Quality is increasing, but these small models have very little knowledge compared to their big brothers (Qwen Image/Full size Flux 2). As in characters, artists, specific items, etc.
There are probably some more subtle tipping points that small models hit too. One of the challenges of a 100GB model is that there is non-trivial difficulty in downloading and running the thing that a 4GB model doesn't face. At 4GB I think it might be reasonable to assume that most devs can just try it and see what it does.
I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.
This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.
> FLUX.2 [klein] 4B The fastest variant in the Klein family. Built for interactive applications, real-time previews, and latency-critical production use cases.
I wonder what kind of use cases could be "latency-critical production use cases"?
If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold.
> Generally, text compresses extremely well. Images and video do not.
Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.
Images and video compress vastly better than text. You're lucky to get 4:1 to 6:1 compression of text (1), while the best perceptual codecs for static images are typically visually lossless at 10:1 and still look great at 20:1 or higher. Video compression is much better still due to temporal coherence.
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.
I appreciate that they released a smaller version that is actually open source. It creates a lot more opportunities when you do not need a massive budget just to run the software. The speed improvements look pretty significant as well.
Flux2 Klein isn’t some generation leap or anything. It’s good, but let’s be honest, this is an ad.
What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.
Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.
If this was intentional, I can’t think of the last time I saw such shrewd marketing.
I’m a bit confused, both you and another commenter mention something called Z-Image, presumably another Flux model?
Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)
ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.
Z-Image is another open-weight image-generation model by Alibaba [1]. Z-Image Turbo was released around the same time as (non-Klein) FLUX.2 and received generally warmer community response [2] since Z-image Turbo was faster, also high-quality, and reportedly better at generating NSFW material. The base (non-Turbo) version of Z-Image is not yet released.
Z-Image is roughly as censored as Flux 2, from my very limited testing. It got popular because Flux 2 is just really big and slow. It is, however, great at editing, has an amazing breadth of built in knowledge, and has great prompt adherence.
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
Ahh I see, and Klein is basically a response to Z-Image Turbo, i.e. another 4-8B sized model that fits comfortably on a consumer GPU.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
They are underselling Z-Image Turbo somewhat. It's arguably the best overall model for local image generation for several reasons including prompt adherence, overall output quality and realism, and freedom from censorship, even though it's also one of the smallest at 6B parameters.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
Hold on now. Z-Image Turbo has gotten a lot of hype but it's worse at all of those things other than perhaps looking like it was shot on a cell phone camera than Qwen Image and Flux 2 (the full sized version). Once you get away from photographic portraits of people it quickly shows just how little it can do.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
A few days ago people were replying to every image on Twitter saying "Grok, put him/her/it in a bikini" and Grok would just do it. It was minimum effort, maximum damage trolling and people loved it.
Ah. So, see, this is exactly why I need to check apparently.
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
Reducing it to some people put people in bikinis for a couple days for the lulz is...not quite what happened.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.
Nah it's been happening for months and involved kids, over and over, albeit for the same reasoning, lulz & totally based. I am a bit surprised that you thought this was just a PG-rated stunt on X for a couple days, it's been in the news for weeks, including on HN.
For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.
Local model comparisons only:
https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt
An older thread on this has a lot of comments: https://news.ycombinator.com/item?id=46046916
I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.
Z-Image / Flux 2 / Hidream / Omnigen2 / Qwen Samples:
https://imgur.com/a/tB6YUSu
This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.
Tiger on pogo stick: https://i.imgur.com/lnGfbjy.jpeg
Dunno what this is, but it's not a pogo stick: https://i.imgur.com/OmMiLzQ.jpeg
Nano Banana Pro FTW: https://i.imgur.com/6B7VBR9.jpeg
I wonder what kind of use cases could be "latency-critical production use cases"?
Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.
What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.
Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.
If this was intentional, I can’t think of the last time I saw such shrewd marketing.
Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)
ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.
[1] https://tongyi-mai.github.io/Z-Image-blog/
[2] https://www.reddit.com/r/StableDiffusion/comments/1p9uu69/no...
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
It is, however, small and quick.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.