Cheap AI 'Video Scraping' Can Now Extract Data From Any Screen Recording (arstechnica.com) 25
An anonymous reader quotes a report from Ars Technica: Recently, AI researcher Simon Willison wanted to add up his charges from using a cloud service, but the payment values and dates he needed were scattered among a dozen separate emails. Inputting them manually would have been tedious, so he turned to a technique he calls "video scraping," which involves feeding a screen recording video into an AI model, similar to ChatGPT, for data extraction purposes. What he discovered seems simple on its surface, but the quality of the result has deeper implications for the future of AI assistants, which may soon be able to see and interact with what we're doing on our computer screens.
"The other day I found myself needing to add up some numeric values that were scattered across twelve different emails," Willison wrote in a detailed post on his blog. He recorded a 35-second video scrolling through the relevant emails, then fed that video into Google's AI Studio tool, which allows people to experiment with several versions of Google's Gemini 1.5 Pro and Gemini 1.5 Flash AI models. Willison then asked Gemini to pull the price data from the video and arrange it into a special data format called JSON (JavaScript Object Notation) that included dates and dollar amounts. The AI model successfully extracted the data, which Willison then formatted as CSV (comma-separated values) table for spreadsheet use. After double-checking for errors as part of his experiment, the accuracy of the results -- and what the video analysis cost to run -- surprised him.
"The cost [of running the video model] is so low that I had to re-run my calculations three times to make sure I hadn't made a mistake," he wrote. Willison says the entire video analysis process ostensibly cost less than one-tenth of a cent, using just 11,018 tokens on the Gemini 1.5 Flash 002 model. In the end, he actually paid nothing because Google AI Studio is currently free for some types of use.
"The other day I found myself needing to add up some numeric values that were scattered across twelve different emails," Willison wrote in a detailed post on his blog. He recorded a 35-second video scrolling through the relevant emails, then fed that video into Google's AI Studio tool, which allows people to experiment with several versions of Google's Gemini 1.5 Pro and Gemini 1.5 Flash AI models. Willison then asked Gemini to pull the price data from the video and arrange it into a special data format called JSON (JavaScript Object Notation) that included dates and dollar amounts. The AI model successfully extracted the data, which Willison then formatted as CSV (comma-separated values) table for spreadsheet use. After double-checking for errors as part of his experiment, the accuracy of the results -- and what the video analysis cost to run -- surprised him.
"The cost [of running the video model] is so low that I had to re-run my calculations three times to make sure I hadn't made a mistake," he wrote. Willison says the entire video analysis process ostensibly cost less than one-tenth of a cent, using just 11,018 tokens on the Gemini 1.5 Flash 002 model. In the end, he actually paid nothing because Google AI Studio is currently free for some types of use.
I do similar things (Score:5, Interesting)
I take screenshots of a bunch of web pages and then just describe to the MML what it's looking at, and how I'd like it combined, arranged, formatted (in markdown, to boot) It's rather impressive how well it gets stuff like that right off the bat. Took a task I used to hate to do, now it takes me a 1/10th of the time, if that. It wouldn't surprise me it works equally well with video, although maybe how cheap it is to do is notable.
Add self-training (Score:2)
While trying to learn a new programming language after deep experience in years of developing with several other languages, I gave up on reading the documentation and tutorials and just started asking GPT questions like:
In the Z programming language, how do you define a variable?
What datatypes are built into the language?
How do you do a for loop in the language?
How do you define a function which takes X as an integer parameter, and returns an integer value -1 if X is less than 0. and returns X+1 if X is 0
Looks like a tool for the incapable (Score:2)
Obviously, you sometimes simply will get a wrong result on top as a bonus. I mean, we are now using "AI" to add numbers?
Re: (Score:1)
Obviously, you sometimes simply will get a wrong result on top as a bonus. I mean, we are now using "AI" to add numbers?
Reminds me of the Google analytics chart showing how many people asked "What's the number for 911?" -- which apparently wasn't a joke.
Re: (Score:2)
That is "special" and not in any good way. Unfortunately, it is entirely credible.
Re: (Score:2)
Re: (Score:2)
Your closed mind has prevented you from realizing that some major LLMs, when faced with a data processing request write a Python program to process the data, including possibly using an OCR library to process text in images. The programs they have to write for these mostly simple requests are equally simple in calculation and data manipulation and thus usually correct on the first try. The calculations run by the program are obviously 100% correct all the time.
Note that this is also how a human that knows h
Not news??? (Score:4, Funny)
Re:Not news??? (Score:4, Funny)
GoogleyMoogley AI has finished watching all 927 hours of pornographic content on your mobile device and suggests you.......take a seat over there.
Re: Not news??? (Score:3)
Re: (Score:2)
Coming attractions.
I assume they lose something in that use case. (Score:2)
Meaning the people 'selling' porn must not benefit from an AI tool that matches consumers with appropriate content.
Not sure if that's the websites themselves would lose out (or they don't see value in attempting it for the expected costs)... Or the content creators freak out and leave. Or what.
Or maybe the people who could fund something like that haven't decided to? Meaning even in 2024 we seem to have a lot of people who ignore stuff like violence, lack of food/water/housing, etc... but freak out abou
Re: I assume they lose something in that use case. (Score:2)
I know what you looked at last summer (Score:4, Interesting)
An AI distorting you for more energy and compute power, Microsoft Recall will deliver it in 2025!
one step further (Score:2)
I wonder if he could have taken a second video recording of the JSON result set and asked the AI model to then convert it for him as the desired CSV format...
Re: (Score:2)
That's almost what he did [simonwillison.net]:
I wanted to paste that into Numbers, so I followed up with:
turn that into copy-pastable csv
Which gave me back the same data formatted as CSV.
Re: (Score:2)
From the blog post:
Let’s consider the alternatives here.
* I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.
* Accessing my Gmail data programatically. This seems to get harder every year—it’s still possible to access it via IMAP right now if you set up a dedicated app password but that’s a whole lot of work for a one-off scraping task. The official API is no fun at all.
I expect this took less time than writing a program to select the right emails and the right text within each email. And he got a blog post out of it. Probably doesn't count as a publication, but the publicity can't hurt him or Google.
Neat, but why did you need the video? (Score:2)
I guess it gave you the data you could 'step by step' look at and prove to yourself how well it worked. But I thought the point was to just explain what we wanted to know, and it'd try to do that for us. Especially when something is already in a text format (like emails), it feels b0rken to take recordings of them and feed that into a computer based tool.
Is this just to work around the 'I cannot prove/limit what you will use of a large data source, so I will artificially limit what data you can see instea
Re: (Score:2)
It's just faster and more flexible. Don't need to worry about exporting content to any specific format. Especially for interfaces that don't have any easy export function. Sometimes at work to document things I would just flip through configuration pages while recording.
Correctness? (Score:2, Insightful)
If it's tedious to enter manually, wouldn't it also be tedious to verify?
They did this in the movie Antitrust (Score:2)
Finally a real use case for Recall. (Score:2)