Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI

GPT and Other AI Models Can't Analyze an SEC Filing, Researchers Find (cnbc.com) 50

According to researchers from a startup called Patronus AI, ChatGPT and other chatbots that rely on large language models frequently fail to answer questions derived from Securities and Exchange Commission filings. CNBC reports: Even the best-performing artificial intelligence model configuration they tested, OpenAI's GPT-4-Turbo, when armed with the ability to read nearly an entire filing alongside the question, only got 79% of answers right on Patronus AI's new test, the company's founders told CNBC. Oftentimes, the so-called large language models would refuse to answer, or would "hallucinate" figures and facts that weren't in the SEC filings. "That type of performance rate is just absolutely unacceptable," Patronus AI co-founder Anand Kannappan said. "It has to be much much higher for it to really work in an automated and production-ready way." [...]

Patronus AI worked to write a set of more than 10,000 questions and answers drawn from SEC filings from major publicly traded companies, which it calls FinanceBench. The dataset includes the correct answers, and also where exactly in any given filing to find them. Not all of the answers can be pulled directly from the text, and some questions require light math or reasoning. Qian and Kannappan say it's a test that gives a "minimum performance standard" for language AI in the financial sector. Patronus AI tested four language models: OpenAI's GPT-4 and GPT-4-Turbo, Anthropic's Claude 2 and Meta's Llama 2, using a subset of 150 of the questions it had produced. It also tested different configurations and prompts, such as one setting where the OpenAI models were given the exact relevant source text in the question, which it called "Oracle" mode. In other tests, the models were told where the underlying SEC documents would be stored, or given "long context," which meant including nearly an entire SEC filing alongside the question in the prompt.

GPT-4-Turbo failed at the startup's "closed book" test, where it wasn't given access to any SEC source document. It failed to answer 88% of the 150 questions it was asked, and only produced a correct answer 14 times. It was able to improve significantly when given access to the underlying filings. In "Oracle" mode, where it was pointed to the exact text for the answer, GPT-4-Turbo answered the question correctly 85% of the time, but still produced an incorrect answer 15% of the time. But that's an unrealistic test because it requires human input to find the exact pertinent place in the filing -- the exact task that many hope that language models can address. Llama 2, an open-source AI model developed by Meta, had some of the worst "hallucinations," producing wrong answers as much as 70% of the time, and correct answers only 19% of the time, when given access to an array of underlying documents. Anthropic's Claude 2 performed well when given "long context," where nearly the entire relevant SEC filing was included along with the question. It could answer 75% of the questions it was posed, gave the wrong answer for 21%, and failed to answer only 3%. GPT-4-Turbo also did well with long context, answering 79% of the questions correctly, and giving the wrong answer for 17% of them.

This discussion has been archived. No new comments can be posted.

GPT and Other AI Models Can't Analyze an SEC Filing, Researchers Find

Comments Filter:
  • Well... (Score:5, Insightful)

    by Mr. Dollar Ton ( 5495648 ) on Thursday December 21, 2023 @06:05AM (#64095503)

    "GPT and Other AI Models Can't Analyze" anything at all, they just build an array of plausible follow-up strings to some salt phrase.

    • Re:Well... (Score:4, Insightful)

      by mysidia ( 191772 ) on Thursday December 21, 2023 @10:24AM (#64095835)

      Right... LLMs are a language modeling tool not a general-purpose analysis tool.

      Interpreting a SEC filing requires more than the simple parsing and reformulation of English language and numbers.

      It is interesting that ChatGPT has been able to work with codes and scripts, But then again, that doesn't necessarily require any complex analysis of data, and there's probably been Manual work by the AIs' developers specifically on the problem of interpreting programming languages + common tasks. And there's a Ton of internet forums where you can find sample code for generalized tasks.

      SEC filings, On the other hand, are very situation-specific artifacts. And I don't believe there exists a financial equivalent like Stackoverflow that would break down SEC filings in detail --- Research reports and Analyst reports on public company tend to be Paywalled and never really available on the general internet for crawling by AIs, Because those reports are used by people to make money investing, they thus have a ton of value, and access to such analyses generally gets monetized under paid analysis or subscriptions that sell for Top dollar.

      • Yes, it is called domain knowledge. If you don't have it, you can't even understand what the experts are talking about, nevermind making a meaningful contribution.

        If I recall correctly, there was a time, when it was enough to know a foreign languages to qualify to be a diplomat in most Western countries. But then the complexity of international relations became such, that knowing a language became only one of the requirements, and far from being the most important one.

        • Part could be a domain knowlegde issue, but it's also laziness on behalf of the developers.

          A general model with full contexts not doing well enough for you?

          * Specialize a custom model to the task

          * Break down the process into multiple steps, to minimize context by finding the right section in earlier passes

          * Use a summary or pruning model to trim unwanted info out of the context

          * Run multiple passes to have the model criticize and refine its own output.

          • In other words, split the task that "AI" can't handle into things people do well and things computers do well, dump the hard part to the people, let the computer handle the rest and call it "AI", right?

            • TFA says the LLM gave the right answer to 75% of questions.

              But TFA doesn't say that humans do any better. It doesn't mention the human error rate.

              Perhaps we should let LLMs analyse Slashdot articles.

              After it analysed this one, we could ask, "Are LLMs worse than humans at analysing SEC filings?"

              And it might correctly reply: "There isn't enough information in the article to answer that."

              • Bill, let's be serious here, the questions the "AI" appears to have answered correctly according to TFA were questions you don't need thinking for.

                Reading a SEC filing and being able to say that "XXX's filing say the company paid dividents in Q2 of year XXX" or "The margins were reported on YYY's report were..." isn't "analysis", it is the same as data extraction from the output of SEC's Edgar.

                "Analysis" and "thinking" presume that you produce something new in your answer that wasn't available in column XXX

      • It is interesting that ChatGPT has been able to work with codes and scripts, But then again, that doesn't necessarily require any complex analysis of data,

        Most likely it means that ChatGPT has leetcode problems (and similar [onlinejudge.org]) in its training set. It can solve standard problems but nothing too far outside its training data.

    • by gweihir ( 88907 )

      Indeed. No idea why so many people refuse to see reality here.

      • Indeed. No idea why so many people refuse to see reality here.

        Probably because they've actually bothered to use the technology and have dismissed your rhetoric as inconsistent with reality.

        Every day that goes by for deniers is like the flat earther in a rocket ship watching the planet get rounder and rounder as capabilities of systems continue to improve.

        • by gweihir ( 88907 )

          Sorry, but that is denial. The way the tech works is completely clear and "analyze" is not something it can do. All it can do is fake it by matching analyses contained in the training data, but it does that with zero insight and fact-checking ability and it may mush patterns together while faking it that do not go together, hence "hallucinations". Not so smart people that want to see something in this tech that it is not may get fooled by that. I am not.

          As to your insinuation that I have not had a look, fal

          • These days it is more of a vested interest then a denial. Looking at the shiny faces in the article picture, what's left of my feeble natural intelligence cannot help, but think of another pair out for "investor money", Beth Holmes and Sunny Whatever.

          • Sorry, but that is denial. The way the tech works is completely clear and "analyze" is not something it can do.

            The ability to analyze was not even tested. Analysis is a process. Asking an LLM a question is like asking a person a question and going with the first thing that comes to mind. Neither of these things constitute an analysis. If you expect a reliable vigorous analysis (TFA indicates perfection is required) you need a process in place that enables such a result. It's no different in this regard than if a person were conducting the analysis.

            All it can do is fake it by matching analyses contained in the training data, but it does that with zero insight and fact-checking ability and it may mush patterns together while faking it that do not go together, hence "hallucinations". Not so smart people that want to see something in this tech that it is not may get fooled by that. I am not.

            The AI does have insight and can check facts but it's capabilitie

            • by gweihir ( 88907 )

              You really have no clue what you are talking about. That sort-of meshes with you being unable to analyze what LLMs can do and what they cannot do, because they also have no clue what they are talking about. So at least, there is symmetry.

              Sorry, not discussing this anymore with somebody that "wants to believe" without actually getting a solid grounding in the facts first.

              • You really have no clue what you are talking about. That sort-of meshes with you being unable to analyze what LLMs can do and what they cannot do, because they also have no clue what they are talking about. So at least, there is symmetry.

                Unlike your baseless commentary I have receipts to support my claims.

                Prompting strategies:
                https://arxiv.org/pdf/2201.119... [arxiv.org]

                Hallucinations:
                https://arxiv.org/pdf/2303.088... [arxiv.org]

                Sorry, not discussing this anymore with somebody that "wants to believe" without actually getting a solid grounding in the facts first.

                I believe what I've seen demonstrated with my own eyes. When they can write valid programs that do what they are instructed to do in a languages they've never seen before via ICL alone no amount of excuses can explain that capability away.

        • Did you see the kind of questions the "technology" wasn't able to answer?

          Reading a SEC filing and being unable to say over 20% of the time if "XXX's filing say the company paid dividents in Q2 of year XXX" or "The margins were reported on YYY's report were..." means you're asking an idiot.

          Asserting this kind of "AI" fail isn't "analysis" is not denialism, buddy, it is stating the patently obvious.

          • Reading a SEC filing and being unable to say over 20% of the time if "XXX's filing say the company paid dividents in Q2 of year XXX" or "The margins were reported on YYY's report were..." means you're asking an idiot.

            Impossible from TFA to tell what they did specifically or what questions were answered correctly or not because TFA does not say and there is no further information cited. Your assumptions the example questions in TFA were answered incorrectly are not based on any stated facts.

            As near as I can tell this entire article is an advertisement for the FinanceBench product of Patronus AI which seems to have a vested interests in pointing out the failures of generative AI of which there are many.

            Asserting this kind of "AI" fail isn't "analysis" is not denialism, buddy, it is stating the patently obvious.

            No analysis was ev

            • Impossible from TFA to tell

              Only if you cannot read or pretend so.

              No analysis was ever conducted

              Even from first principles, if the "AI" fails to reliably answer factual queries about the data set it's been "trained" on, there is obviously no need to discuss its "analysis" capabilities - it has none.

              If you have used "AI", and know how it works (which you obviously don't so it is pointless to argue the finer points with you) you'd know the "analysis" this "AI" is capable of is about the same as "predicting weather" by saying that tomorrow it will be the same as toda

              • Only if you cannot read or pretend so.

                You are pretending, what you asserted was never stated in TFA. This is an irrefutable fact.

                Even from first principles, if the "AI" fails to reliably answer factual queries about the data set it's been "trained" on, there is obviously no need to discuss its "analysis" capabilities - it has none.

                By this same standard humans are incapable of "analysis" because they forget and jumble figures. You can't have it both ways.

                If you have used "AI", and know how it works (which you obviously don't so it is pointless to argue the finer points with you) you'd know the "analysis" this "AI" is capable of is about the same as "predicting weather" by saying that tomorrow it will be the same as today, yesterday and the day before weighted with some coefficient. That method gives you roughly 85% correct "predictions" and is totally useless for the remaining 15%, when you actually need a real prediction.

                My general high level understanding is logits from model output are normalized via softmax to arrive at a probability distribution for each token. The situation is more complex and even worse than you describe because the token that is selected may not even correspond with the best prediction

                • Your general high level understanding is so shaky it is virtually non-existent. Hence the bullshit you spew with conviction. Have a nice day.

                  • Your general high level understanding is so shaky it is virtually non-existent. Hence the bullshit you spew with conviction. Have a nice day.

                    This is what is expected of those who don't have the facts on their side and don't even have a coherent argument. All they can do is pound the table.

  • So this means that AI (Assinine Idiocy) is just like most Americans. No average citizen can read and comprehend legal documents without proper training. Voccabulary is one thing but the intent and meaning are just crazy with SEC documents.
    • No. Although legal texts and laws are a challenge to interpret (sometimes even with training and a background knowledge) a filing is expected to be self-evidently not a waste of time (it's the start of a process, it is the chance to nip a waste of time in the bud) so it demands a level of simplicity and intelligibility on at least the broad strokes; the who are involved and what the issue is over. This was not a brutally hard selection of reading material for analysis.

      And that it cannot analyse it throws th

      • I question their testing methods. I've fed ChatGPT lots of documents. Generally unless you have ability to upload source docs, you have to copy and paste TEXT into a web form. The problem comes with table data. If you just copy and paste table data from a Word or PDF doc then tables get mangled unrecognizably. But if I paste them in xml or html structure then I get great results asking questions about the data. All the tags create a lot of overhead cost against data token limits though, reducing the amou
        • All the tags create a lot of overhead cost against data token limits though, reducing the amount you can feed it.

          Does CSV work?

          • All the tags create a lot of overhead cost against data token limits though, reducing the amount you can feed it.

            Does CSV work?

            It might, and it might save some on token overhead. I haven't tried CSV. Truthfully though, I'd hope GPT is smart enough to tokenize any length of XML/HTML tag as a single token though. Rather than the general purpose 4 bytes, give or take, of text per token. If so, then it may not make any difference vs CSV. Because I'd also expect any delimiter to be a token, even if it's only a 1 character comma.

            The big problem with processing documents is the token limit per conversation. Even Azure AI's flavor

        • So, why not paste the output of SEC's EDGAR into the form?

          Or, better yet, why don't just search using EDGAR for the answers you need?

          The only thing that comes to mind when I see the happy pair in the picture of TFA is Beth Holmes/Sunny Balwani out for your "investor" money.

          The value of the "AI" as described in TFA is negative.

    • by gweihir ( 88907 )

      Another good one. I like "Artificial Idiocy" or "Artificial Ignorance", but your expansion is pretty good too.

  • by ihadafivedigituid ( 8391795 ) on Thursday December 21, 2023 @06:22AM (#64095531)
    SEC filings are designed to obfuscate as much as possible, so it's impressive they got any information at all.

    I want to know how the researchers did on their own Voight-Kampff tests before forming an opinion about the AIs' results.
  • Surely this shows that the data should be put straight into a SEC database by the filing organisation. Why does it have to be put in so directly that it NEEDS AI to interpret it and find the data?

  • by Anonymous Coward
    Disappointed to hear that researchers are still labelling LLMs as Artifical Intelligence even though they're nothing more than statisically driven random number generators. If people in the industry can't get it right how are we going to teach marketers and the general public to stop abusing the terminology?
    • by ranton ( 36917 )

      Disappointing there are still people who don't consider machine learning and other computer science techniques as AI just because it doesn't fit their personal definition of AI, which is usually restricted to Skynet. AI as a field includes far more than what you think it does.

      • by sapgau ( 413511 )
        AI has rapidly become a generic, catch all word like "the Cloud".
        So it will be very difficult to use "AI" for describing a particular topic.
  • The weakness of current AI is that it can't tell you anything about what you have given it unless many other people have written all about and posted on the internet about the novel thing you are trying to file with the SEC.
    • The weakness of current AI is that it can't tell you anything about what you have given it unless many other people have written all about and posted on the internet about the novel thing you are trying to file with the SEC.

      Some models especially larger models tuned for instruction have the capability to learn from and apply knowledge provided externally via context.

      For example I used ICL to provide a model with documentation describing portions of a proprietary language it has never before seen. It was then able to write programs in that language.

      For a general summary of capabilities and shortcomings:
      https://arxiv.org/pdf/2311.089... [arxiv.org]

      A specific example of ICL being used to provide a model with new capability:
      https://arxiv.org [arxiv.org]

  • Yeah, Bob, says here all of these companies are overextended and floating false narratives.
    Oh, then ask it how long we can sustain that until we can cash out profitably.
    Later: OK, we retrained with quarterly reports.
  • SEC filings can't be analyzed by most fashion models, either.

  • by WaffleMonster ( 969671 ) on Thursday December 21, 2023 @03:06PM (#64096645)

    "There just is no margin for error that's acceptable because especially in regulated industries even if the model gets the answer wrong 1 out of 20 times, that's still not high enough accuracy"

    You don't use LLMs when there is no margin for error except when it is cheaper to ask and independently check than do it any other way.

    Specifically expecting them to do math reliably or recall exact token hungry figures from huge scaled context windows is a fools errand.

  • They need more training recognizing conventions and specific contexts before being reliable.

    Is not surprising, using ChatGPT today requires very frequent verification. And picking the correct answer before making another question, so it can try to respond with a valid context.

    This could improve with a dedicated Large Languge Model (LLM) solely trained on these fiscal/financial environments.
  • C'mon, how can a model trained on human interaction convincingly talk with lawyers?

God help those who do not help themselves. -- Wilson Mizner

Working...