
Perplexity AI Faces Scrutiny Over Web Scraping and Chatbot Accuracy (wired.com) 20
Perplexity AI, a billion-dollar "AI" search startup, has come under scrutiny for its data collection practices and accuracy of its chatbot responses. Despite claiming to respect website operators' wishes, Perplexity appears to scrape content from sites that have blocked its crawler, using an undisclosed IP address, a Wired investigation found. The chatbot also generates summaries that closely paraphrase original reporting with minimal attribution. Furthermore, its AI often "hallucinates," inventing false information when unable to access articles directly. Perplexity's CEO, Aravind Srinivas, maintains the company is not acting unethically.
"bullshit" , it’s surprisingly unclear what (Score:2)
Why do we care? Perplexity Is a Bullshit Machine, / it’s surprisingly unclear what the AI search startup actually is [wired.com]
Do we finally have the beginning of the end of the AI hype cycle? Maybe it's just the end of the beginning?
Re: (Score:1)
https://www.cbc.ca/news/canada... [www.cbc.ca]
https://fortune.com/2023/06/23... [fortune.com]
https://www.fd.org/news/colora... [fd.org]
Maybe its not a bad thing. If its generally understood to be wrong some part of the time, people will have to check everything instead of just copy/pasting. Its no different than real life, where its an important skill to discern fact from fiction.
Re: (Score:2)
As far as the making things up goes, I think it's worse than people think. The AI gets trained off any lies that people spot. On the other hand, if a lie doesn't get spotted then on it goes. Furthermore, that lie may welll get out into the world and into training documents. That means that it's learning to tell exactly those lies that are impossible to spot. People will check the (fake)AI by searching on the internet and will find validation in an answer that's already been dumped from the AI by some previo
Re: (Score:2)
Re: (Score:2)
Anthropic - I'm looking at you too (Score:5, Interesting)
Re: (Score:2)
Re: (Score:3)
It may not even be Anthropic, but another actor posing as them.
Despite (Score:2)
Despite all the other BS Perplexity is doing, which may or may not be accurate, there is one thing here that needs to be addressed:
Robots.txt is not a gatekeeper
Re: (Score:2)
This ^^
Point it back to content it generated.. and let it choke to death
Re: (Score:2)
Right, but it only affects automated access (crawling/polling).
If an end user sends your bot a link to get details from, you are not obliged to check the robots.txt file at all, since it is a user-triggered event and would akin to the user themselves visiting the page.
Ugh, I hate having to defend these scumbags. State what you wish about the rest of their shitfuckery, but if they are direct requests as a result of an end user requesting data about a specific page, then robots.txt is not how you stop them.
Re: (Score:2)
Right, but it only affects automated access (crawling/polling).
If an end user sends your bot a link to get details from, you are not obliged to check the robots.txt file at all
that's convoluted, what "end user" would be sending "links" to a crawler? but why? there's no need, the links don't matter, robots.txt is a polite request not to scan the site that no one is obliged to honor.
Re: (Score:1)
Re: (Score:1)
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.
Even if you extend the coverage of this from search engines to AI, it still says its not for keeping Google off your site. If anything, they are guilty of excess traffic, but the Robot.txt is apparently not a "keep out" sign.
Re: (Score:2)
robot.txt is irrelevant and it's the other way around: you can't be both visible and invisible. if you want to expose content publicly you can't really set limits as to who can access it, you can try to waste time and money on filters, blocks and screens, but crawlers will find ways to circumvent any of those. they can become less polite too.
they are actually quite a pest that's screwing up much of the internet. on the bright side, crawling costs money too, specifically the subsequent training costs a lot,
Re: (Score:2)
Spitting in your face (Score:2)
A forum I visit daily was really slow for a few days about a week ago. The forum was showing over 160 "guests" when on any normal day about 20 would exist. The forum admin tracked the IP addresses of these "guests" back to Facebook. These guests were visiting every page in the forum in rapid succession for days on end. Robots.txt did nothing to stop them. It looks like a LLama was spitting all over the forum.
AI company is complete scum. What else is new? (Score:2)
These people are greedy assholes and do not care who they steal from or how much damage they do.
I'm going to defend it... (Score:2)
I've used it quite a bit over the last month and found it generally useful on a variety of subjects.
Its good if you are asking it how to do something very specific in linux. Not always right, because the answer may be out of date, but it gives a better and quicker pointer to a viable source than doing the screening yourself.
It is sometimes quite wrong because it relies on one inaccurate source, eg an opinion piece on an obscure partisan web site. So you have to be wary
On literature and literary criticism