Chatbots Are Absorbing the People’s Content and The Internet Is Eager to Monetize It!

Millions of People’s Content Is Being Used by Artificial Intelligence Companies Without Their Consent.

Companies that use artificial intelligence are exploiting content made by numerous people on the internet without their permission or remuneration. Now, a rising number of IT and media organisations are charging a charge in order to profit from the chatbot craze.

If you’ve ever blogged, posted on Reddit, or shared anything on the open web, you’ve likely contributed to the birth of the next generation of artificial intelligence.

AI language models are used in Google’s Bard, OpenAI’s ChatGPT, Microsoft’s latest version of Bing, and comparable tools from other businesses.

However, without the vast amount of text freely available on the Internet, these clever robot writers would not be possible.

ai singularity inspire2rise

Today, web content is once again at the centre of controversy. This has not been true since the beginning of the search engine wars. The tech titans are attempting to carve out a new value for themselves from this precious stream of knowledge.

Companies in technology and media that were previously oblivious to the importance of this data in creating a new generation of language-based artificial intelligence are becoming aware of its importance.

Reddit, a significant OpenAI training resource, recently stated that it would charge AI businesses for data access. For this story, OpenAI refused to comment.

Twitter recently began charging for data access, a development that has an impact on many elements of Twitter’s business, including the use of data by AI businesses.

A news media consortium representing publishers said this month in a report that corporations should pay licencing costs when using work created by its members to train artificial intelligence.

“What’s really important to us is the attribution of information”

as said by Prashanth Chandrasekar who is the chief executive of Stack Overflow, a question-and-answer site for programmers.

The company plans to start charging for User-generated content. The Stack Overflow community has put so much energy into answering questions over the past 15 years.

There have already been a variety of AI systems, such as OpenAI’s Dall-E 2, that can train to make images but have been accused of huge intellectual property theft.

The businesses that developed these technologies are currently being sued over the allegations. And the battle over AI-generated content could be even more heated, involving not only compensation and credit but also privacy.

However, Emily M. Bender, a computational linguist at the University of Washington, claims that AI agencies are not liable for their actions under current law.

The disagreement emerged because of how artificial intelligence chatbots were created. These robots’ basic algorithm is known as a “large language model algorithm,” and it must absorb and process a vast amount of existing language text material in order to mimic what and how humans speak.

This type of data differs from what we’re used to seeing on the internet, such as behavioural and personal information utilised for targeted advertising by Facebook’s parent firm Meta Platforms.

This data is generated by human users through the use of numerous platforms, such as the hundreds of millions of Reddit postings. A large enough artificially created thesaurus can only be found on the Internet. Without it, all chat-based AI and associated technologies would fail today.

Jesse Dodge, a research scientist at the nonprofit Allen Institute for Artificial Intelligence, discovered in a 2021 paper that Wikipedia and innumerable copyrighted news pieces from large and small media outlets both exist in the most regularly used web crawler datasets. This dataset is used by Google and Facebook to train huge language models, while OpenAI utilises a comparable database.

According to a document the business published in 2020, OpenAI’s massive language model leverages posts gathered from Reddit to filter and refine the data used to train its artificial intelligence.

Reddit spokesperson Tim Rathschmidt said the company wasn’t sure how much cash it would generate by charging others to use its data, but that the data it possessed may assist enhance today’s state-of-the-art Large language models.

Legal and Moral Dilemma:

Copying data from the open web, often known as scraping, is lawful in some situations, though firms are still arguing about how and when they can do so. Most businesses and organisations publish their material online in order for it to be discovered and indexed by search engines so that consumers may readily locate it.

However, replicating this data to train an AI, thereby eliminating the need to locate the original source, is an entirely different matter.

According to computational linguist Bender, tech corporations that collect information from the web to teach artificial intelligence follow the idea of “we can take it, so it’s ours.”

When text (including books, magazine articles, essays on personal blogs, patents, scientific papers, and Wikipedia content) is converted into chatbot responses, the source connections are removed. It also makes it more difficult for users to confirm what the bot is telling them. This is a major issue for systems that lie a lot.

These bulk scrapings steal our personal information as well. For more than a decade, Common Crawl has been crawling the massive amount of content on the open web and making its database publicly available to academics. The database of Common Crawl is also utilised as a starting point for firms trying to train AI, such as Google, Meta, OpenAI, and others.

According to Sebastian Nagel, a data scientist and programmer at Common Crawl, a blog post you authored a few years ago that has since been deleted may still be in the OpenAI training set. Among the data, the business trained its artificial intelligence on years-old web content.

Unlike Google and Microsoft’s search indexes, deleting personal information from a well-trained AI requires retraining the entire model, according to Bender.

Dodge further stated that because the cost of retraining a big language model might be prohibitively expensive, firms are unlikely to do so even if customers can demonstrate that personal data was used to train artificial intelligence. Due to the massive computing power required, training such models can cost tens of millions of dollars.

However, Dodge added that it is often difficult to regurgitate that information to an AI trained on data sets containing personal information.

According to OpenAI, it has modified its chat-based system to deny requests for personal information. The governments of the European Union and the United States are exploring new legislation and regulations to govern this form of artificial intelligence.

Accountability and Profit Sharing

Some AI supporters think that AI should have access to as much data as its engineers can because that is how humans learn. Why, logically, shouldn’t a machine do this?

Despite the fact that AI is not yet comparable to humans, a problem with this viewpoint, according to Bender, is that AI cannot be held accountable for its actions under current law. Individuals who plagiarise other people’s work or attempt to repackage falsehoods as reality may suffer harsh consequences, but the machine and its developers are not held to the same standard.

Of course, this is not always true. Just as Getty sued image-generating AI companies for using their intellectual property as training data, businesses and other organisations are likely to sue chat-based AI makers for using their content without permission. Unless they agree to an authorization, go to court.

Can many people’s personal writings and messages on obscure forums and defunct social networks, among other things, really make today’s chatbots as competent as writers? Perhaps the only benefit to the creators of this content is that they contribute to the development of chatbots in terms of language use.

Source: Netease Technology Report 

So guys, if you liked this post and wish to receive more tech stuff delivered daily, don’t forget to subscribe to the Inspire2Rise newsletter to obtain more timely tech news, updates and more!

Keep visiting for more such excellent posts, internet tips, and gadget reviews, and remember we cover, “Everything under the Sun!”
inspire2rise10revised
Follow Inspire2rise on Twitter. | Follow Inspire2rise on Facebook. | Follow Inspire2rise on YouTube.

Sukhdev has a passion for sharing insights and experiences on a wide range of topics from technology to personal development!



Chatbots Are Absorbing the People’s Content and The Internet Is Eager to Monetize It!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.