Forum: TFSI

Open Internet, web scraping, and AI: the unbreakable link

From TechnologyDaily@1337:1/100 to All on Tue Apr 23 14:15:05 2024

Open Internet, web scraping, and AI: the unbreakable link

Date:
Tue, 23 Apr 2024 14:12:11 +0000

Description:
With existing technological capabilities, open access to publicly available web data is the only way to improve the quality of AI outputs.

FULL STORY ======================================================================

Last year, a nonprofit internet archival organization, The Internet Archive (IA), lost the first circuit court (Hachette v. Internet Archive) against
four major publishers that sued IA for its decision to act as a digital library during the pandemic, lending more than one copy of a book at a single time.

Whether it was an ethical decision and who is right in this battlepublishers, using the existing copyright law provisions to their advantage, or the IA, stating that todays copyright law is outdated and does not meet the requirements of digital societiesremains to be answered. The IA appealed its loss in the Second Circuit Court, a decision supported by many authors themselves.

The IA case, however, indicates a broader issue: a struggle to keep open access to information on a free and open internet. In recent years, this mission has been increasingly complicated by mounting legal cases against artificial intelligence firms that gather web data for algorithmic training, contextual advertising services that analyze public data to understand the content of different sites, and even nonprofits that gather web data for socially-driven purposesearlier this year, X sued the Center for Countering Digital Hate and lost the case.

Although presented as a fight over data ownership on the surface, it is usually a fight over the distribution of monetary gains offered by a growing digital economy. Without rethinking current compensation mechanisms, this fight might end up in nothing positive but a fragmented society,
proliferation of disinformation, and biased, primitive AI solutions. The philosophy of the open Internet

The concept of the open web is a broad concoction of ideas resting on the basic principles of information as a public good, peoples right to share it, and the importance of data neutrality. Its supporters promote equal access to the Internet as a way to distribute knowledge globally, first of all through nonprofit means such as the Creative Commons, open-source scholarship and coding , open licensing, and archival organizations, such as the previously mentioned IA.

The open Internet has its downsides. An easy example would be that cybercrime can benefit significantly from open-source coding, whereas open access to digital content might stimulate piracy. However, crime proliferates in closed social systems, too. Therefore, making the Internet less accessible would hardly solve this issue.

Open access to information, on the other hand, has been the main driver of human civilization from the days when our hominid ancestors developed
language to the Gutenberg Revolution to the emergence of the world wide web. The argument for access to public web data

The Internet Archive is the epitome of the open Internet and free access to data. Holding the archive of 410 billion web pages in its Wayback Machine, tens of millions of books, images, and audio recordings, and over 200,000 software programs (including historic applications), it is a huge historical repository, a sociocultural phenomenon, and an educational project with a mission to distribute knowledge to remote locations.

The content to the IA can be uploaded by its users, but the lions share is collected from the web with the help of web crawlersautomated solutions that scour the Internet and store the contents of the websites. The IA crawlers collect data only from the public domain, meaning that information behind logins or paywalls is omitted.

There are multiple ways in which free data repositories, such as the IA, benefit critical social missions. The IA is used for scientific research, to access old court documents, and even as evidence in court proceedings. It can also be utilized to support the fight against disinformation and
investigative journalism. AI in the echo chambers

A relatively new use case that necessitates open access to vast amounts of public web data, including historical repositories, is training artificial intelligence algorithms (AI, do not mix it with IA). Making AI training and testing data as diverse as possible is a prerequisite not only for developing increasingly complex systems but also for keeping AI algorithms less biased, avoiding hallucinations, and improving accuracy.

As my colleague has argued, if training datasets are primarily built on data that is either synthetic or too homogenous, the system will tend to
accentuate specific patterns (including biases) inherent in the underlying datasets, resulting in echo chambers and making AI outputs primitive and less reliable. Moreover, probabilistic algorithms would form closed epistemic systems where the abundance of ideas, theories, and other representations of the real world would slowly vanish.

Unfortunately, getting open access to abundant human-created data is the main challenge for AI developers today. AI firms received a huge social and legal backlash over using publicly available web data, part of it related to data privacy concerns and partto data ownership and copyright concerns.

On the one hand, the argument that AI firms developing popular commercial AI solutions must compensate content owners (be it photographers, writers, designers, or scientists) for using their work, sounds absolutely legit. On the other hand, it leaves AI developers in a stalemate.

First, web content is nearly boundless, and a big part of it might be considered technically copyrighted without having clearly attributed rights. Content actively produced by millions of web users is the best example of
this phenomenon-usually, none of them claim their public outputs as copyrighted material, and it would be impossible to identify all potential copyright holders. Moreover, it would also mean negotiating compensation
terms with all of them, an effort of such a scale that it makes commercial AI development unfeasible.

Recognizing the complicated nature of the situation, some major data owners (often called gatekeepers) hurried up to monetize their resources. BBC announced it is in talks with technology companies to sell access to its content archive to use as AI training data, and other publishers are considering similar revenue diversification models, too.

However, this solution might still make the costs of AI development too burdensome, especially for smaller companies. Without rethinking current compensation mechanisms and the established copyright regime that, currently, favors the big players, the move towards more intelligent, reliable, and responsible AI systems might remain stuck in the realm of science fiction for years to come. Concluding remarks

Due to rapid internet expansion, the way people live their everyday lives has drastically changed over the last few decades. First, we started consuming digital informationreading books, watching movies, listening to music, and talking to each other using our gadgets. Today, it is not only us but also robots that create digital art, gather all sorts of information, and read online, trying to make sense of the content humans have created.

However, the established copyrights regime and resulting compensation mechanisms havent been quick enough to adapt, causing troubles for different participants of the digital economybusinesses that gather public web intelligence, historical repositories that store Internet data for future generations, and AI developers that need to make robots smart and, even more important, reliable. As the case of the Internet Archive shows, even the concept of a digital library is still legally problematic.

With existing technological capabilities, open access to publicly available web data is the only way to improve the quality of AI outputs. AI tools that are better at digesting and distributing information would, in turn, make information more accessible and useful to wider audiences. However, if AI developers are forced to pay for all data they use, there might be no
business argument to develop these systems further.

We've featured the best database software.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

======================================================================
Link to news story: https://www.techradar.com/pro/open-internet-web-scraping-and-ai-the-unbreakabl e-link

--- Mystic BBS v1.12 A47 (Linux/64)
* Origin: tqwNet Technology News (1337:1/100)

Who's Online
Recent Visitors
- CyberNix
  Sun Jun 8 19:09:43 2025
  from London, UK via HTTPS
- CyberNix
  Sun Jun 1 21:35:53 2025
  from London, UK via Telnet
- gretchiie
  Sat May 31 08:42:43 2025
  from austin tx via Telnet
- CyberNix
  Tue May 27 21:43:34 2025
  from London, UK via Telnet

System Info

Sysop:	CyberNix
Location:	London, UK
Users:	20
Nodes:	10 (0 / 10)
Uptime:	61:52:28
Calls:	884
Files:	4,313
Messages:	652,975

Open Internet, web scraping, and AI: the unbreakable link

Who's Online

Recent Visitors

System Info