Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy (wired.com) 20

Posted by msmash on Thursday June 20, 2024 @08:25AM from the closer-look dept.

Perplexity AI, a billion-dollar "AI" search startup, has come under scrutiny for its data collection practices and accuracy of its chatbot responses. Despite claiming to respect website operators' wishes, Perplexity appears to scrape content from sites that have blocked its crawler, using an undisclosed IP address, a Wired investigation found. The chatbot also generates summaries that closely paraphrase original reporting with minimal attribution. Furthermore, its AI often "hallucinates," inventing false information when unable to access articles directly. Perplexity's CEO, Aravind Srinivas, maintains the company is not acting unethically.

Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 20 Comments Log In/Create an Account

Comments Filter:

"bullshit" , it’s surprisingly unclear what (Score:2)

by AleRunner ( 4556245 ) writes:

Why do we care? Perplexity Is a Bullshit Machine, / it’s surprisingly unclear what the AI search startup actually is [wired.com]
Do we finally have the beginning of the end of the AI hype cycle? Maybe it's just the end of the beginning?
- Re: (Score:1)
  
  by buck-yar ( 164658 ) writes:
  
  I too have noticed Perplexity straight up making up citations. But is it just Perplexity bs-ing?
  https://www.cbc.ca/news/canada... [www.cbc.ca]
  https://fortune.com/2023/06/23... [fortune.com]
  https://www.fd.org/news/colora... [fd.org]
  Maybe its not a bad thing. If its generally understood to be wrong some part of the time, people will have to check everything instead of just copy/pasting. Its no different than real life, where its an important skill to discern fact from fiction.
  - Re: (Score:2)
    
    by AleRunner ( 4556245 ) writes:
    
    As far as the making things up goes, I think it's worse than people think. The AI gets trained off any lies that people spot. On the other hand, if a lie doesn't get spotted then on it goes. Furthermore, that lie may welll get out into the world and into training documents. That means that it's learning to tell exactly those lies that are impossible to spot. People will check the (fake)AI by searching on the internet and will find validation in an answer that's already been dumped from the AI by some previo
  - Re: (Score:2)
    
    by Pf0tzenpfritz ( 1402005 ) writes:
    
    Problem is that AI will produce bullshit at a thousand times the rate of even the worst hunan bullshitters.
  - Re: (Score:2)
    
    by Seven Spirals ( 4924941 ) writes:
    
    I had this same problem with ChatGPT 4.0o today. I asked it for some crime stats and it linked to articles written by Pew Research, The FBI, and Reason magazine. At one point the response contained about 8 URLs. Of those eight, only two were real. Not only were the articles in Reason all coming up as "404 not found" searching for the article titles got me nothing. That's because they never existed at all.
Anthropic - I'm looking at you too (Score:5, Interesting)

by TomGreenhaw ( 929233 ) writes: on Thursday June 20, 2024 @08:46AM (#64563429)

I thought I was dealing with a low grade denial of service attack until I saw all the traffic coming from a Claude Bot user agent slurping the same pages relentlessly from dozens of Amazon VMs. Robots.txt was completely ignored. As much as I dislike useless regulation, there ought to be a law...

- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:3)
    
    by TomGreenhaw ( 929233 ) writes:
    
    Not dynamic, I can only assume it's a bug of some kind. It's a shame really. I didn't want to block them as I would like my customer's site to be in AI LLM models.
    
    It may not even be Anthropic, but another actor posing as them.
Despite (Score:2)

by Barny ( 103770 ) writes:

Despite all the other BS Perplexity is doing, which may or may not be accurate, there is one thing here that needs to be addressed:
Robots.txt is not a gatekeeper
In theory, Perplexity’s chatbot shouldn’t be able to summarize WIRED articles, because our engineers have blocked its crawler via our robots.txt file since earlier this year. This file instructs web crawlers on which parts of the site to avoid, and Perplexity claims to respect the robots.txt standard. WIRED’s analysis found that
- - Re: (Score:2)
    
    by bleedingobvious ( 6265230 ) writes:
    
    This ^^
    Point it back to content it generated.. and let it choke to death
  - Re: (Score:2)
    
    by Barny ( 103770 ) writes:
    
    Right, but it only affects automated access (crawling/polling).
    If an end user sends your bot a link to get details from, you are not obliged to check the robots.txt file at all, since it is a user-triggered event and would akin to the user themselves visiting the page.
    Ugh, I hate having to defend these scumbags. State what you wish about the rest of their shitfuckery, but if they are direct requests as a result of an end user requesting data about a specific page, then robots.txt is not how you stop them.
    - Re: (Score:2)
      
      by znrt ( 2424692 ) writes:
      
      Right, but it only affects automated access (crawling/polling).
      If an end user sends your bot a link to get details from, you are not obliged to check the robots.txt file at all
      that's convoluted, what "end user" would be sending "links" to a crawler? but why? there's no need, the links don't matter, robots.txt is a polite request not to scan the site that no one is obliged to honor.
    - Re: (Score:1)
      
      by daveron ( 2034640 ) writes:
      
      The web server is not obligated to return to the crawler that ignores robot.txt what is asked for, and not instead for example a bunch of made-up bullshit generated by the AI of the company doing the crawling.
  - Re: (Score:1)
    
    by buck-yar ( 164658 ) writes:
    
    From Google's website:
    A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.
    
    Even if you extend the coverage of this from search engines to AI, it still says its not for keeping Google off your site. If anything, they are guilty of excess traffic, but the Robot.txt is apparently not a "keep out" sign.
  - Re: (Score:2)
    
    by znrt ( 2424692 ) writes:
    
    robot.txt is irrelevant and it's the other way around: you can't be both visible and invisible. if you want to expose content publicly you can't really set limits as to who can access it, you can try to waste time and money on filters, blocks and screens, but crawlers will find ways to circumvent any of those. they can become less polite too.
    they are actually quite a pest that's screwing up much of the internet. on the bright side, crawling costs money too, specifically the subsequent training costs a lot,
- Re: (Score:2)
  
  by LordHighExecutioner ( 4245243 ) writes:
  
  If you do not want to have your text scraped from AI bots, just don't publish it on internet. Unfortunately this makes internet useless...
Spitting in your face (Score:2)

by RitchCraft ( 6454710 ) writes:

A forum I visit daily was really slow for a few days about a week ago. The forum was showing over 160 "guests" when on any normal day about 20 would exist. The forum admin tracked the IP addresses of these "guests" back to Facebook. These guests were visiting every page in the forum in rapid succession for days on end. Robots.txt did nothing to stop them. It looks like a LLama was spitting all over the forum.
AI company is complete scum. What else is new? (Score:2)

by gweihir ( 88907 ) writes:

These people are greedy assholes and do not care who they steal from or how much damage they do.
I'm going to defend it... (Score:2)

by Budenny ( 888916 ) writes:

I've used it quite a bit over the last month and found it generally useful on a variety of subjects.
Its good if you are asking it how to do something very specific in linux. Not always right, because the answer may be out of date, but it gives a better and quicker pointer to a viable source than doing the screening yourself.
It is sometimes quite wrong because it relies on one inaccurate source, eg an opinion piece on an obscure partisan web site. So you have to be wary
On literature and literary criticism

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy (wired.com) 20

Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy More Login

Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy

"bullshit" , it’s surprisingly unclear what (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Anthropic - I'm looking at you too (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Despite (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Spitting in your face (Score:2)

AI company is complete scum. What else is new? (Score:2)

I'm going to defend it... (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot