Slashdot: News for nerds, stuff that matters, & software

'Copyright Traps' Could Tell Writers If an AI Has Scraped Their Work 79

Posted by BeauHD on Friday July 26, 2024 @11:30PM from the cat-and-mouse dept.

It May Soon Be Legal To Jailbreak AI To Expose How It Works (404media.co) 26

Posted by BeauHD on Thursday July 18, 2024 @11:30PM from the chamber-of-secrets dept.

An anonymous reader quotes a report from 404 Media: A group of researchers, academics, and hackers are trying to make it easier to break AI companies' terms of service to conduct "good faith research" that exposes biases, inaccuracies, and training data without fear of being sued. The U.S. government is currently considering an exemption to U.S. copyright law that would allow people to break technical protection measures and digital rights management (DRM) on AI systems to learn more about how they work, probe them for bias, discrimination, harmful and inaccurate outputs, and to learn more about the data they are trained on. The exemption would allow for "good faith" security and academic research and "red-teaming" of AI products even if the researcher had to circumvent systems designed to prevent that research. The proposed exemption has the support of the Department of Justice, which said "good faith research can help reveal unintended or undisclosed collection or exposure of sensitive personal data, or identify systems whose operations or outputs are unsafe, inaccurate, or ineffective for the uses for which they are intended or marketed by developers, or employed by end users. Such research can be especially significant when AI platforms are used for particularly important purposes, where unintended, inaccurate, or unpredictable AI output can result in serious harm to individuals."

Much of what we know about how closed-sourced AI tools like ChatGPT, Midjourney, and others work are from researchers, journalists, and ordinary users purposefully trying to trick these systems into revealing something about the data they were trained on (which often includes copyrighted material indiscriminately and secretly scraped from the internet), its biases, and its weaknesses. Doing this type of research can often violate the terms of service users agree to when they sign up for a system. For example, OpenAI's terms of service state that users cannot "attempt to or assist anyone to reverse engineer, decompile or discover the source code or underlying components of our Services, including our models, algorithms, or systems (except to the extent this restriction is prohibited by applicable law)," and adds that users must not "circumvent any rate limits or restrictions or bypass any protective measures or safety mitigations we put on our Services."

Shayne Longpre, an MIT researcher who is part of the team pushing for the exemption, told me that "there is a lot of apprehensiveness about these models and their design, their biases, being used for discrimination, and, broadly, their trustworthiness." "But the ecosystem of researchers looking into this isn't super healthy. There are people doing the work but a lot of people are getting their accounts suspended for doing good-faith research, or they are worried about potential legal ramifications of violating terms of service," he added. "These terms of service have chilling effects on research, and companies aren't very transparent about their process for enforcing terms of service." The exemption would be to Section 1201 of the Digital Millennium Copyright Act, a sweeping copyright law. Other 1201 exemptions, which must be applied for and renewed every three years as part of a process through the Library of Congress, allow for the hacking of tractors and electronic devices for the purpose of repair, have carveouts that protect security researchers who are trying to find bugs and vulnerabilities, and in certain cases protect people who are trying to archive or preserve specific types of content. Harley Geiger of the Hacking Policy Council said that an exemption is "crucial to identifying and fixing algorithmic flaws to prevent harm or disruption," and added that a "lack of clear legal protection under DMCA Section 1201 adversely affect such research."

Radar Images Suggest There's a Tunnel On the Moon (gizmodo.com) 58

Posted by BeauHD on Tuesday July 16, 2024 @03:00AM from the would-you-look-at-that dept.

Thunderbird 128: Annual ESR Brings New Features and 'a Rust Revolution' (thunderbird.net) 78

Posted by EditorDavid on Sunday July 14, 2024 @01:36PM from the you've-got-mail dept.

CISA Broke Into a US Federal Agency, No One Noticed For a Full 5 Months (theregister.com) 35

Posted by BeauHD on Friday July 12, 2024 @08:45PM from the would-you-look-at-that dept.

A 2023 red team exercise by the U.S. Cybersecurity and Infrastructure Security Agency (CISA) at an unnamed federal agency exposed critical security failings, including unpatched vulnerabilities, inadequate incident response, and weak credential management, leading to a full domain compromise. According to The Register's Connor Jones, the agency failed to detect or remediate malicious activity for five months. From the report: According to the agency's account of the exercise, the red team was able to gain initial access by exploiting an unpatched vulnerability (CVE-2022-21587 - 9.8) in the target agency's Oracle Solaris enclave, leading to what it said was a full compromise. It's worth noting that CVE-2022-21587, an unauthenticated remote code execution (RCE) bug carrying a near-maximum 9.8 CVSS rating, was added to CISA's known exploited vulnerability (KEV) catalog in February 2023. The initial intrusion by CISA's red team was made on January 25, 2023. "After gaining access, the team promptly informed the organization's trusted agents of the unpatched device, but the organization took over two weeks to apply the available patch," CISA's report reads. "Additionally, the organization did not perform a thorough investigation of the affected servers, which would have turned up IOCs and should have led to a full incident response. About two weeks after the team obtained access, exploit code was released publicly into a popular open source exploitation framework. CISA identified that the vulnerability was exploited by an unknown third party. CISA added this CVE to its Known Exploited Vulnerabilities Catalog on February 2, 2023." [...]

After gaining access to the Solaris enclave, the red team discovered they couldn't pivot into the Windows part of the network because missing credentials blocked their path, despite enjoying months of access to sensitive web apps and databases. Undeterred, CISA managed to make its way into the Windows network after carrying out phishing attacks on unidentified members of the target agency, one of which was successful. It said real adversaries may have instead used prolonged password-praying attacks rather than phishing at this stage, given that several service accounts were identified as having weak passwords. After gaining that access, the red team injected a persistent RAT and later discovered unsecured admin credentials, which essentially meant it was game over for the agency being assessed. "None of the accessed servers had any noticeable additional protections or network access restrictions despite their sensitivity and critical functions in the network," CISA said.

CISA described this as a "full domain compromise" that gave the attackers access to tier zero assets -- the most highly privileged systems. "The team found a password file left from a previous employee on an open, administrative IT share, which contained plaintext usernames and passwords for several privileged service accounts," the report reads. "With the harvested Lightweight Directory Access Protocol (LDAP) information, the team identified one of the accounts had system center operations manager (SCOM) administrator privileges and domain administrator privileges for the parent domain. "They identified another account that also had administrative permissions for most servers in the domain. The passwords for both accounts had not been updated in over eight years and were not enrolled in the organization's identity management (IDM)." From here, the red team realized the victim organization had trust relationships with multiple external FCEB organizations, which CISA's team then pivoted into using the access they already had.

The team "kerberoasted" one partner organization. Kerberoasting is an attack on the Kerberos authentication protocol typically used in Windows networks to authenticate users and devices. However, it wasn't able to move laterally with the account due to low privileges, so it instead used those credentials to exploit a second trusted partner organization. Kerberoasting yielded a more privileged account at the second external org, the password for which was crackable. CISA said that due to network ownership, legal agreements, and/or vendor opacity, these kinds of cross-organizational attacks are rarely tested during assessments. However, SILENTSHIELD assessments are able to be carried out following new-ish powers afforded to CISA by the FY21 National Defense Authorization Act (NDAA), the same powers that also allow CISA's Federal Attack Surface Testing (FAST) pentesting program to operate. It's crucial that these avenues are able to be explored in such exercises because they're routes into systems adversaries will have no reservations about exploring in a real-world scenario. For the first five months of the assessment, the target FCEB agency failed to detect or remediate any of the SILENTSHIELD activity, raising concerns over its ability to spot genuine malicious activity. CISA said the findings demonstrated the need for agencies to apply defense-in-depth principles. The cybersecurity agency recommended network segmentation and a Secure-by-Design commitment.

Rabbit R1 AI Device Exposed by API Key Leak (404media.co) 15

Posted by msmash on Wednesday June 26, 2024 @02:50PM from the poor-engineering dept.

Anthropic Launches Claude 3.5 Sonnet, Says New Model Outperforms GPT-4 Omni (anthropic.com) 34

Posted by msmash on Thursday June 20, 2024 @10:49AM from the intensifying-competition dept.

Python 'Language Summit' 2024: Security Workflows, Calendar Versioning, Transforms and Lightning Talks (blogspot.com) 19

Posted by EditorDavid on Saturday June 15, 2024 @10:04AM from the enhancement-proposals dept.

Friday the Python Software Foundation published several blog posts about this year's "Python Language Summit" May 15th (before PyCon US), which featured talks and discussions by core developers, triagers, and Python implementation maintainers.

There were several lightning talks. One talk came from the maintainer of the PyO3 project, offering Rust bindings for the Python C API (which requires mapping Rust concepts to Python — leaving a question as to how to map Rust's error-handling panic! macro). There was a talk on formalizing the PEP prototype process, and a talk on whether the Python team should have a more official presence in the Apple App Store (and maybe the Google Play Store). One talk suggested changing the formatting of error messages for assert statements, and one covered a "highly experimental" project to support structured data sharing between Python subinterpreters. One talk covered Python's "unsupported build" warning and how it should behave on platforms beyond Python's officially supported list.

Python Foundation blog posts also covered some of the longer talks, including one on the idea of using type annotations as a mechanism for transformers. One talk covered the new interactive REPL interpreter coming to Python 3.13.

And one talk focused on Python's security model after the xz-utils backdoor: Pablo Galindo Salgado, Steering Council member and the release manager for Python 3.10 and 3.11, brought this topic to the Language Summit to discuss what could be done to improve Python's security model... Pablo noted the similarities shared between CPython and xz-utils, referencing the previous Language Summit's talk on core developer burnout, the number of modules in the standard library that have one or zero maintainers, the high ratio of maintainers to source code, and the use of autotools for configuration. Autotools was used by [xz's] Jia Tan as part of the backdoor, specifically to obscure the changes to tainted release artifacts. Pablo confirmed along with many nods of agreement that indeed, CPython could be vulnerable to a contributor or core developer getting secretly malicious changes merged into the project.

For multiple reasons like being able to fix bugs and single-maintainer modules, CPython doesn't require reviewers on the pull requests of core developers. This can lead to "unilateral action", meaning that a change is introduced into CPython without the review of someone besides the author. Other situations like release managers backporting fixes to other branches without review are common.
Much discussion ensued about the possibility of altering workflows (including pull request reviews), identity verification, and the importance of post-incident action plans. Guido van Rossum suggested a "higher bar" for granting write access, but in the end "Overall it was clear there is more discussion and work to be done in this rapidly changing area."

In another talk, Hugo van Kemenade, the newly announced Release Manager for Python 3.14 and 3.15, "started the Language Summit with a proposal to change Python's versioning scheme. The perception of Python using semantic versioning is a source of confusion for users who don't expect backwards incompatible changes when upgrading to new versions of Python. In reality almost all new feature releases of Python include backwards incompatible changes such as the removal of "dead batteries" where PEP 594 marked 19 modules for removal in Python 3.13. Calendar Versioning (CalVer) encompasses a wide array of different versioning schemes that have one property in common: using the release date as part of a release's version... Hugo offered multiple proposed versioning schemes, including:

- Using the release year as minor version (3.YY.micro, "3.26.0")
- Using the release year as major version (YY.0.micro, "26.0.0")
- Using the release year and month as major and minor version (YY.MM.micro, "26.10.0")

[...] Overall the proposal to use the current year as the minor version was well-received, Hugo mentioned that he'd be drafting up a PEP for this change.

Researcher Finds Side-Channel Vulnerability in Post-Quantum Key Encapsulation Mechanism (thecyberexpress.com) 12

Posted by EditorDavid on Sunday June 09, 2024 @02:44PM from the improving-encryption dept.

Slashdot reader storagedude shared this report from The Cyber Express: A security researcher discovered an exploitable timing leak in the Kyber key encapsulation mechanism (KEM) that's in the process of being adopted by NIST as a post-quantum cryptographic standard. Antoon Purnal of PQShield detailed his findings in a blog post and on social media, and noted that the problem has been fixed with the help of the Kyber team. The issue was found in the reference implementation of the Module-Lattice-Based Key-Encapsulation Mechanism (ML-KEM) that's in the process of being adopted as a NIST post-quantum key encapsulation standard. "A key part of implementation security is resistance against side-channel attacks, which exploit the physical side-effects of cryptographic computations to infer sensitive information," Purnal wrote.

To secure against side-channel attacks, cryptographic algorithms must be implemented in a way so that "no attacker-observable effect of their execution depends on the secrets they process," he wrote. In the ML-KEM reference implementation, "we're concerned with a particular side channel that's observable in almost all cryptographic deployment scenarios: time." The vulnerability can occur when a compiler optimizes the code, in the process silently undoing "measures taken by the skilled implementer." In Purnal's analysis, the Clang compiler was found to emit a vulnerable secret-dependent branch in the poly_frommsg function of the ML-KEM reference code needed in both key encapsulation and decapsulation, corresponding to the expand_secure implementation.
While the reference implementation was patched, "It's important to note that this does not rule out the possibility that other libraries, which are based on the reference implementation but do not use the poly_frommsg function verbatim, may be vulnerable — either now or in the future," Purnal wrote.

Purnal also published a proof-of-concept demo on GitHub. "On an Intel Core i7-13700H, it takes between 5-10 minutes to leak the entire ML-KEM 512 secret key using end-to-end decapsulation timing measurements."

Louisiana Becomes 10th US State to Make CS a High School Graduation Requirement (linkedin.com) 89

Posted by EditorDavid on Saturday June 08, 2024 @04:34PM from the let-the-good-times-roll dept.

Long-time Slashdot reader theodp writes: "Great news, Louisiana!" tech-backed Code.org exclaimed Wednesday in celebratory LinkedIn, Facebook, and Twitter posts. Louisiana is "officially the 10th state to make computer science a [high school] graduation requirement. Huge thanks to Governor Jeff Landry for signing the bill and to our legislative champions, Rep. Jason Hughes and Sen. Thomas Pressly, for making it happen! This means every Louisiana student gets a chance to learn coding and other tech skills that are super important these days. These skills can help them solve problems, think critically, and open doors to awesome careers!"

Representative Hughes, the sponsor of HB264 — which calls for each public high school student to successfully complete a one credit CS course as a requirement for graduation and also permits students to take two units of CS instead of studying a Foreign Language — tweeted back: "HUGE thanks @codeorg for their partnership in this effort every step of the way! Couldn't have done it without [Code.org Senior Director of State Government Affairs] Anthony [Owen] and the Code.org team!"

Code.org also on Wednesday announced the release of its 2023 Impact Report, which touted its efforts "to include a requirement for every student to take computer science to receive a high school diploma." Since its 2013 launch, Code.org reports it's spent $219.8 million to push coding into K-12 classrooms, including $19 million on Government Affairs (Achievements: "Policies changed in 50 states. More than $343M in state budgets allocated to computer science.").

In Code.org by the Numbers, the nonprofit boasts that 254,683 students started Code.org's AP CS Principles course in the academic year (2025 Goal: 400K), while 21,425 have started Code.org's new Amazon-bankrolled AP CS A course. Estimates peg U.S. public high school enrollment at 15.5M students, annual K-12 public school spending at $16,080 per pupil, and an annual high school student course load at 6-8 credits...

TikTok Preparing a US Copy of the App's Core Algorithm (reuters.com) 57

Posted by BeauHD on Thursday May 30, 2024 @07:20PM from the contingency-plans dept.

An anonymous reader quotes a report from Reuters: TikTok is working on a clone of its recommendation algorithm for its 170 million U.S. users that may result in a version that operates independently of its Chinese parent and be more palatable to American lawmakers who want to ban it, according to sources with direct knowledge of the efforts. The work on splitting the source code ordered by TikTok's Chinese parent ByteDance late last year predated a bill to force a sale of TikTok's U.S. operations that began gaining steam in Congress this year. The bill was signed into law in April. The sources, who were granted anonymity because they are not authorized to speak publicly about the short-form video sharing app, said that once the code is split, it could lay the groundwork for a divestiture of the U.S. assets, although there are no current plans to do so. The company has previously said it had no plans to sell the U.S. assets and such a move would be impossible. [...]

In the past few months, hundreds of ByteDance and TikTok engineers in both the U.S. and China were ordered to begin separating millions of lines of code, sifting through the company's algorithm that pairs users with videos to their liking. The engineers' mission is to create a separate code base that is independent of systems used by ByteDance's Chinese version of TikTok, Douyin, while eliminating any information linking to Chinese users, two sources with direct knowledge of the project told Reuters. [...] The complexity of the task that the sources described to Reuters as tedious "dirty work" underscores the difficulty of splitting the underlying code that binds TikTok's U.S. operations to its Chinese parent. The work is expected to take over a year to complete, these sources said. [...] At one point, TikTok executives considered open sourcing some of TikTok's algorithm, or making it available to others to access and modify, to demonstrate technological transparency, the sources said.

Executives have communicated plans and provided updates on the code-splitting project during a team all-hands, in internal planning documents and on its internal communications system, called Lark, according to one of the sources who attended the meeting and another source who has viewed the messages. Compliance and legal issues involved with determining what parts of the code can be carried over to TikTok are complicating the work, according to one source. Each line of code has to be reviewed to determine if it can go into the separate code base, the sources added. The goal is to create a new source code repository for a recommendation algorithm serving only TikTok U.S. Once completed, TikTok U.S. will run and maintain its recommendation algorithm independent of TikTok apps in other regions and its Chinese version Douyin. That move would cut it off from the massive engineering development power of its parent company in Beijing, the sources said. If TikTok completes the work to split the recommendation engine from its Chinese counterpart, TikTok management is aware of the risk that TikTok U.S. may not be able to deliver the same level of performance as the existing TikTok because it is heavily reliant on ByteDance's engineers in China to update and maintain the code base to maximize user engagement, sources added.

Memory Sealing 'mseal' System Call Merged For Linux 6.10 (phoronix.com) 50

Posted by EditorDavid on Sunday May 26, 2024 @05:53PM from the no-thanks-for-the-memory dept.

Rust Foundation Reports 20% of Rust Crates Use 'Unsafe' Keyword (rust-lang.org) 92

Posted by EditorDavid on Sunday May 26, 2024 @11:34AM from the trusting-Rust dept.

A Rust Foundation blog post begins by reminding readers that Rust programs "are unable to compile if memory management rules are violated, essentially eliminating the possibility of a memory issue at runtime."

But then it goes on to explore "Unsafe Rust in the wild" (used for a small set of actions like dereferencing a raw pointer, modifying a mutable static variable, or calling unsafe functions). "At a superficial glance, it might appear that Unsafe Rust undercuts the memory-safety benefits Rust is becoming increasingly celebrated for. In reality, the unsafe keyword comes with special safeguards and can be a powerful way to work with fewer restrictions when a function requires flexibility, so long as standard precautions are used."

The Foundation lists those available safeguards — which "make exploits rare — but not impossible." But then they go on to analyze just how much Rust code actually uses the unsafe keyword: The canonical way to distribute Rust code is through a package called a crate. As of May 2024, there are about 145,000 crates; of which, approximately 127,000 contain significant code. Of those 127,000 crates, 24,362 make use of the unsafe keyword, which is 19.11% of all crates. And 34.35% make a direct function call into another crate that uses the unsafe keyword [according to numbers derived from the Rust Foundation project Painter]. Nearly 20% of all crates have at least one instance of the unsafe keyword, a non-trivial number.

Most of these Unsafe Rust uses are calls into existing third-party non-Rust language code or libraries, such as C or C++. In fact, the crate with the most uses of the unsafe keyword is the Windows crate, which allows Rust developers to call into various Windows APIs. This does not mean that the code in these Unsafe Rust blocks are inherently exploitable (a majority or all of that code is most likely not), but that special care must be taken while using Unsafe Rust in order to avoid potential vulnerabilities...

Rust lives up to its reputation as an excellent and transformative tool for safe and secure programming, even in an Unsafe context. But this reputation requires resources, collaboration, and constant examination to uphold properly. For example, the Rust Project is continuing to develop tools like Miri to allow the checking of unsafe Rust code. The Rust Foundation is committed to this work through its Security Initiative: a program to support and advance the state of security within the Rust Programming language ecosystem and community. Under the Security Initiative, the Rust Foundation's Technology team has developed new tools like [dependency-graphing] Painter, TypoMania [which checks package registries for typo-squatting] and Sandpit [an internal tool watching for malicious crates]... giving users insight into vulnerabilities before they can happen and allowing for a quick response if an exploitation occurs.

NetBSD Bans AI-Generated Code (netbsd.org) 64

Posted by BeauHD on Thursday May 16, 2024 @10:02PM from the good-luck-with-that dept.

Revolutionary Genetics Research Shows RNA May Rule Our Genome (scientificamerican.com) 80

Posted by BeauHD on Wednesday May 15, 2024 @03:00AM from the rulers-of-the-genome dept.

Philip Ball reports via Scientific American: Thomas Gingeras did not intend to upend basic ideas about how the human body works. In 2012 the geneticist, now at Cold Spring Harbor Laboratory in New York State, was one of a few hundred colleagues who were simply trying to put together a compendium of human DNA functions. Their Âproject was called ENCODE, for the Encyclopedia of DNA Elements. About a decade earlier almost all of the three billion DNA building blocks that make up the human genome had been identified. Gingeras and the other ENCODE scientists were trying to figure out what all that DNA did. The assumption made by most biologists at that time was that most of it didn't do much. The early genome mappers estimated that perhaps 1 to 2 percent of our DNA consisted of genes as classically defined: stretches of the genome that coded for proteins, the workhorses of the human body that carry oxygen to different organs, build heart muscles and brain cells, and do just about everything else people need to stay alive. Making proteins was thought to be the genome's primary job. Genes do this by putting manufacturing instructions into messenger molecules called mRNAs, which in turn travel to a cell's protein-making machinery. As for the rest of the genome's DNA? The "protein-coding regions," Gingeras says, were supposedly "surrounded by oceans of biologically functionless sequences." In other words, it was mostly junk DNA.

So it came as rather a shock when, in several 2012 papers in Nature, he and the rest of the ENCODE team reported that at one time or another, at least 75 percent of the genome gets transcribed into RNAs. The ENCODE work, using techniques that could map RNA activity happening along genome sections, had begun in 2003 and came up with preliminary results in 2007. But not until five years later did the extent of all this transcription become clear. If only 1 to 2 percent of this RNA was encoding proteins, what was the rest for? Some of it, scientists knew, carried out crucial tasks such as turning genes on or off; a lot of the other functions had yet to be pinned down. Still, no one had imagined that three quarters of our DNA turns into RNA, let alone that so much of it could do anything useful. Some biologists greeted this announcement with skepticism bordering on outrage. The ENCODE team was accused of hyping its findings; some critics argued that most of this RNA was made accidentally because the RNA-making enzyme that travels along the genome is rather indiscriminate about which bits of DNA it reads.

Now it looks like ENCODE was basically right. Dozens of other research groups, scoping out activity along the human genome, also have found that much of our DNA is churning out "noncoding" RNA. It doesn't encode proteins, as mRNA does, but engages with other molecules to conduct some biochemical task. By 2020 the ENCODE project said it had identified around 37,600 noncoding genes -- that is, DNA stretches with instructions for RNA molecules that do not code for proteins. That is almost twice as many as there are protein-coding genes. Other tallies vary widely, from around 18,000 to close to 96,000. There are still doubters, but there are also enthusiastic biologists such as Jeanne Lawrence and Lisa Hall of the University of Massachusetts Chan Medical School. In a 2024 commentary for the journal Science, the duo described these findings as part of an "RNA revolution."

What makes these discoveries revolutionary is what all this noncoding RNA -- abbreviated as ncRNA -- does. Much of it indeed seems involved in gene regulation: not simply turning them off or on but also fine-tuning their activity. So although some genes hold the blueprint for proteins, ncRNA can control the activity of those genes and thus ultimately determine whether their proteins are made. This is a far cry from the basic narrative of biology that has held sway since the discovery of the DNA double helix some 70 years ago, which was all about DNA leading to proteins. "It appears that we may have fundamentally misunderstood the nature of genetic programming," wrote molecular biologists Kevin Morris of Queensland University of Technology and John Mattick of the University of New South Wales in Australia in a 2014 article. Another important discovery is that some ncRNAs appear to play a role in disease, for example, by regulating the cell processes involved in some forms of cancer. So researchers are investigating whether it is possible to develop drugs that target such ncRNAs or, conversely, to use ncRNAs themselves as drugs. If a gene codes for a protein that helps a cancer cell grow, for example, an ncRNA that shuts down the gene might help treat the cancer.

Game Dev Says Contract Barring 'Subjective Negative Reviews' Was a Mistake (arstechnica.com) 26

Posted by msmash on Tuesday May 14, 2024 @11:22AM from the better-late-than-never dept.

Did OpenAI, Google and Meta 'Cut Corners' to Harvest AI Training Data? (indiatimes.com) 58

Posted by EditorDavid on Saturday May 11, 2024 @12:34PM from the devouring-data dept.

What happened when OpenAI ran out of English-language training data in 2021?

They just created a speech recognition tool that could transcribe the audio from YouTube videos, reports The New York Times, as part of an investigation arguing that tech companies "including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law" in their search for AI training data. [Alternate URL here.] Some OpenAI employees discussed how such a move might go against YouTube's rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are "independent" of the video platform. Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI's president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4...

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works, according to recordings of internal meetings obtained by the Times. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company's practices said. That potentially violated the copyrights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company's privacy team and an internal message viewed by the Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products...

Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn't stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.
The article adds that some tech companies are now even developing "synthetic" information to train AI.

"This is not organic data created by humans, but text, images and code that AI models produce — in other words, the systems learn from what they themselves generate."

Some San Francisco Tech Workers are Renting Cheap 'Bed Pods' (sfgate.com) 184

Posted by EditorDavid on Sunday May 05, 2024 @02:34PM from the sleep-mode dept.

An anonymous reader shared this report from SFGate: Late last year, tales of tech workers paying $700 a month for tiny "bed pods" in downtown San Francisco went viral. The story provided a perfect distillation of SF's wild (and wildly expensive) housing market — and inspired schadenfreude when the city deemed the situation illegal. But the provocative living situation wasn't an anomaly, according to a city official.

"We've definitely seen an uptick of these 'pod'-type complaints," Kelly Wong, a planner with San Francisco's code enforcement and zoning and compliance team, told SFGATE... Wong stressed that it's not that San Francisco is inherently against bed pod-type arrangements, but that the city is responsible for making sure these spaces are safe and legally zoned.

So Brownstone Shared Housing is still renting one bed pod location — but not accepting new tenants — after citations for failing to get proper permits and having a lock on the front door that required a key to exit.

And SFGate also spoke to Alex Akel, general manager of Olive Rooms, which opened up a co-living and co-working space in SoMa earlier this year (and also faced "a flurry of complaints.") "Unfortunately, we had complaints from neighbors because of foot traffic and noise, and since then we cut the number of people to fit the ordinance by the city," Akel wrote. Olive Rooms describes its space as targeted at "tech founders from Central Asia, giving them opportunities to get involved in the current AI boom." Akel added that its residents are "bringing new energy to SF," but that the program "will not accept new residents before we clarify the status with the city."

In April, the city also received a complaint about a group called Let's Be Buds, which rents out 14 pods in a loft on Divisadero Street that start at $575 per month for an upper bunk.

While this recent burst of complaints is new, bed pods in San Francisco have been catching flak for years... a company called PodShare, which rents — you guessed it — bed pods, squared itself away with the city and has operated in SF since 2019.
Brownstone's CEO told SFGate "A lot of people want to be here for AI, or for school, or different opportunities." He argues that "it's literally impossible without a product like ours," and that their residents had said the option "positively changed the trajectory of their lives."

AI Engineers Report Burnout, Rushed Rollouts As 'Rat Race' To Stay Competitive Hits Tech Industry (cnbc.com) 36

Posted by BeauHD on Friday May 03, 2024 @05:30PM from the speed-above-everything dept.

Microsoft Overhaul Treats Security as 'Top Priority' After a Series of Failures 55

Posted by msmash on Friday May 03, 2024 @12:40PM from the fwiw dept.

2016	PayPal Pulls North Carolina Plan After Transgender Bathroom Law	1095 comments
2014	How the Internet Is Taking Away America's Religion	1037 comments
2005	No More BitKeeper Linux	958 comments
2004	U.S. Justice Department Prepares Assault on Pr0n	1103 comments
2003	The Clueless Newbie's Linux Odyssey	998 comments

Slashdot Top Deals