Skyvern Browser Agent 2.0: How We Reached State of the Art in Evals

(blog.skyvern.com)

49 points | by suchintan 182 days ago

6 comments

happyopossum 182 days ago
Many of the examples given for agents such as this are things I just flat wouldn’t trust an LLM to do - buying something on Amazon for example: Will it pick new or ‘renewed’? Will it select an item that is from a janky looking vendor and may be counterfeit? Will it pick the cheapest option for me? What if multiple colors are offered?
This one example alone has so many branches that would require knowing what’s in my head.
On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?
Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.
That would be useful.
[-]
- suchintan 182 days ago
  This is a great point -- the example we chose was meant to be a consumer example that we could relate with.. however a similar example exists for the enterprise which may be more interesting
  Let's say that you are a parts procurement shop and want to order 10,000 of SKU1, and 20,000 of SKU2. If you go on parts websites like finditparts.com, you'll see that there is little ambiguity when it comes to ordering specific SKUs
  We've seen cases of companies that want to automate item ordering like this on tens of different websites, and have people (usually the CEO) spending a few hours a week doing it manually.
  Writing a script can take ~10-20hours to do it (if you know how to code).. but we can help you automate it in <30 minutes with Skyvern, even if you don't know how to code!
- abrichr 181 days ago
  This is exactly why in https://github.com/OpenAdaptAI/OpenAdapt:
  1. Our goal is to automate repetitive tasks (not simple one off tasks that need human input, but repetitive tasks that human shouldn't be doing in the first place), and
  2. The way we do this is by grounding in human demonstrations, so that the model has strict guardrails around what it can and can't do.
- Fnoord 182 days ago
  I got Amazon Prime. If it has Prime, it is a no-brainer. Free return for 30 days. No S&H costs. Only cost is my time.
  [-]
  - drdaeman 182 days ago
    Yea, but LLMs cannot reason - we've all seen them blurt out complete non-sequitur, or end up in death loops of pseudo-reasoning (e.g. https://news.ycombinator.com/item?id=42734681 has a few examples). I don't think one should trust an LLM to pick Prime products all the time even if that's very explicitly requested - I'm sure it's possible to minimize errors so it'll do the right thing most of the time, but having a guarantee that it won't pick non-Prime item sounds impossible. Same for any other tasks - if there is a way to make a mistake, a mistake will be eventually made.
    (Idk if we can trust a human either - brain farts are a thing after all, but at least humans are accountable. Machines are not - at least not at the moment.)
    [-]
    - lyime 182 days ago
      To your last point -- Humans make mistakes too. I asked my EA to order a few things for our office a few days ago, and she ended up ordering things that I did not want. In this case I could have wrote a better prompt. Even with a better prompt she could have ordered the unwanted item. This is a reversible decision.
      So my point is, that while you might get some false positives, it's worth automating as long as many of the decisions are reversible or correctable.
      You might not want to use this in all cases, but it's still worthwhile for many many cases. The use case worth automating depends on the acceptable rate of error for the given use case.
    - Fnoord 182 days ago
      You cannot trust a human to avoid buying crap on Amazon but like I said with Prime the only cost is time (and, to be fair: Co2 footprint).
      Dynamic CVV would mean you'd have to authorize the payment. If amount seems off, decline.
      To be clear, I don't think I'd use it but if it could save you time (a precious value, in our day and age) with good signal to noise ratio it is win-win for user, author, and Amazon.
      If you want to buy an Apple device from a trusted party, including trusted accessories, then there's apple.com. My point being: buying from there is much more secure. But even then, there is no '1 iPhone 16'; there's variants. Many of them.
      [-]
      - bravura 181 days ago
        If it goes on your skin or in your body, do you trust Amazon 3rd party sellers?
        [-]
        Fnoord 181 days ago
        My rule of thumb is: not more or less than AliExpress, though Prime is a bit reassuring. I got Chinese tech which gives me skin rashes, or smells really bad. The advantage of Amazon though, is that I can sent back items without much hassle. Doesn't work with AliExpress.
  - CryptoBanker 182 days ago
    If it fails enough times and you have to return enough items…well, Amazon has been known to ban people for that.
    If you have an AWS account created before 2017, am Amazon ban means an AWS ban
- sureglymop 181 days ago
  Well.. since llms are the new search engines here's something to ask yourself. Would google reorder the results to show you more personalized results? Would they inject unrelated news into the results? Hell, would they even go so far as do "sponsored" results/ads and put those on top?
  They wouldn't do such outrageous things would they. So, sure, you can trust these llms to do critical things for you just fine. Don't worry.
- mmooss 182 days ago
  You don't trust it yet, like a new human assistant you might hire - will they be able to handle all the variables? Eventually, they earn your trust and you start offloading everything to their inbox.
  [-]
  - paulryanrogers 182 days ago
    No, not like a human assistant. Competent humans will use logical reasoning, non-digital signals like body language and audible clues, and know the limits of their knowledge, so are more likely to ask for missing input. Humans will also be more predictable.
    [-]
    - mmooss 181 days ago
      You're missing the point. The point is, trust grows with familiarity and a track record.
      [-]
      - paulryanrogers 180 days ago
        Humans are easier to trust because (IME) their motivations and reasoning are easier to understand and evaluate.
        [-]
        mmooss 180 days ago
        You trust all sorts of technology and services: Your computer (and of course that includes an incredible, integrated collection of hardware and software), your car, the plane you flew on, your lightswitch, weekly garbage collection, the fire extinguisher, the chair you are sitting in (will it collapse?), your hammer and the nail you just pounded in. The list is effectively infinite.
        This technology is new, but soon it will be old and trusted too.
  - binarymax 181 days ago
    LLMs don’t learn. They’re static. You could try to fine tune, or continually add longer and longer context, but in the end you hit a wall.
    [-]
    - ErikBjare 180 days ago
      You can provide them a significant amount of guidance through prompting. The model itself won't "learn", but if given lessons in the prompt, which you can accumulate from mistakes, it can follow them. You will always hit a wall "in the end", but you can get pretty far!
    - mmooss 181 days ago
      But you can learn how to work with one.
mkagenius 181 days ago
Pre-planned steps by Planner will go wrong more often than not, as it will try to guess the UI layers from its memory/training data. Its better to just ask the "next step" by giving it current state of the UI.
I have built a similar project for mobile automation [1] and the validator phase is not separate rather it's inherently baked in each step since we only ask next step based on current UI and previous actions.
My Planner sometimes goes "Oh, we are still on home screen, let's find the Uber app icon". This sort of self-correcting behaviour was not programmed but the LLM does it on its own.
1. https://github.com/BandarLabs/ClickClickClick - A framework to automate mobile use via any LLM (local/remote)
lyime 182 days ago
This is an impressive tool. I especially like the observability around the workflow and the steps it takes to achieve the outcome. We are potentially interested in exploring this if we can get the cost down at scale.
[-]
- suchintan 182 days ago
  I'd love to chat to see how we can help! Here's my email: suchintan@skyvern.com
  We're working on 2 major improvements that will get cost down at scale: 1. We're building a code generation layer under the hood that will start to memorize actions Skyvern has taken on a website, so repeated runs will be nearly free 2. We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page. For example, if you're looking at the product page and want to add a product to cart, the likelihood you'll need to interact with the Reviews page will be 0. No need to send that context along to the LLM
  [-]
  - dataviz1000 182 days ago
    > We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page.
    Computer vision is useful and very quick, however, it has been my experience parsing stacking context is much more useful. The problem is creating a stacking context when a news site embeds a youtube or blusky post. It requires injecting script into each using playwright. (Not mine, but, prior art [0]).
    I've been quietly solving a problem I encountered creating browser agents that didn't have a solution 2 years ago in my free time. Most webpages are several independent global execution contexts and I'm developing a coherent way to get them all to speak with each other. [1]
    > "Go to Amazon.com and add an iPhone 16, a screen protector, and a case to cart"
    Are you familiar with Google Dialogflow? [2] It is a service which returns an object with intent and parameters which make it is to map to automation actions. I asked GhatGPT to help with an example of how Dialogflow might handle your request. [3]
    [0] https://github.com/andreadev-it/stacking-contexts-inspector
    [1] https://news.ycombinator.com/item?id=42576240
    [2] https://cloud.google.com/dialogflow/es/docs/intents-overview
    [3] https://chatgpt.com/share/678ae18d-5370-8004-97d4-f9949887b0...
    [-]
    - MarcelOlsz 181 days ago
      How can I reach out to you?
      [-]
      - dataviz1000 181 days ago
        I'm at [HN username]@gmail.com
wejick 181 days ago
UI is most common interface but not particularly AI friendly, i'll wait for more standardized interface that's both human and AI friendly. Hoping it will still br a browser based.
[-]
- MarcelOlsz 181 days ago
  I am actually building just this. An AI test builder that uses the browser like a human. 60fps realtime streaming and zero-latency interaction (for manual step overriding). Would love to chat!
skull8888888 182 days ago
isn't browser use sota on web voyager? At this point web voyager seems to be outdated, there's def a need for a new harder benchmark.
[-]
- MagMueller 182 days ago
  Yes, this is the report of browser-use with 89%: https://browser-use.com/posts/sota-technical-report
  We definitely need a new dataset with more complex tasks, like uploading files, handling multiple tabs, and handling many more steps.
- suchintan 182 days ago
  Definitely need a newer benchmark.
  I couldn't find where browser-use published their run results (expected to see it here https://github.com/browser-use/eval)
  We went ahead and published our full run at https://eval.skyvern.com so our run could be independently audited
govindsb 182 days ago
congrats Suchintan! huge achievement!