G4Media.ro

The secrets and risks related to the training data used by Artificial…

Sursa foto: Dreamstime

The secrets and risks related to the training data used by Artificial Intelligence. Physicist Cristian Presură answers the question of whether chatbots can reach consciousness

AI chatbots are the new wonder of the internet, wowing audiences with their revolutionary abilities. The leap seems huge. It’s no small thing to chat with a chatbot „as if” you were chatting with a person and not be able to tell the difference without asking it direct questions about its capabilities.

 

 

But can chatbots think, like humans? Do they really understand what they’re saying, or are they just making connections that seem logical, or at least perfectly harmonized, between the enormous amounts of information they’ve been trained with?

 

If chatbots rely 100% on processed information and can’t make other connections, we should know what those connections are and how accurate the processed information is. Several major Western newspapers have been investigating what the training sources of the apps are and, not least, what the limitations of these AI apps are. What Romania wants to change in its Recovery and Resilience Plan? The Ministry for European Funds proposes vouchers for solar panels, a metro line to Otopeni and the elimination of the 9.4% of GDP

 

G4Media spoke to physicist Cristian Presură about the capabilities of chatbots and the possibility of them making the leap to more than „simply” harmonizing the huge amounts of information they have been trained with.

 

Can chatbots think like humans? Do they really understand what they say? Can they develop consciousness, and if so, when might such a leap occur?

 

 

„Personally, I don’t currently see a physical property that would act as a definitive barrier to artificial intelligence developing consciousness.

 

The physical basis of our consciousness and artificial intelligence ( chatbots for example) are atoms.

 

Bringing atoms together, for example, builds our bodies together with neurons. In these, at some point, consciousness emerges.

 

 

Looking at it this way, another organization of atoms I think could develop consciousness, if there are no natural barriers we don’t know about, or other necessary conditions than physical ones.

 

But if there are no such barriers or conditions, then the question is not if that transition will occur, but when.

 

But to get there will require a much more complex hardware architecture than what chatbots have now (compared to human neural networks), and a software architecture that comes much closer to the way our neurons work together (if we want to mimic human thought),” replied Cristian Presură.

 

He says he is amazed by the capabilities of chatbots using artificial intelligence, and that his expectations have been exceeded.

„Today we know a lot about our brain, but at the same time very little about the essential processes of consciousness.

 

For example, we don’t know how the simple thought of turning off the water tap is expressed within neurons.

 

 

How could we then implement this thought in artificial intelligence, or worse, if artificial intelligence has it, how could we prove that it has it? I’m sure the coming years will bring unexpected surprises in this direction.”

 

What a chat bot answers when asked if it is aware of what it is saying

 

I asked OpenAI if it can think in a similar way to the way a human does.

 

Here’s what it answered:

 

„As an AI-based language model, I am able to process and generate text in a similar way to a human. I use deep learning techniques to learn and identify patterns in natural language, which allows me to produce responses that resemble what a human would provide. However, I don’t have my own emotions or beliefs and cannot feel or think in a conscious way like a person.”

I also asked if he is aware of what he is saying. The answer was again direct:

 

 

„No, as an AI-based language model, I have no consciousness or perception of the world beyond my ability to process and generate text. I can produce responses in a way that seems coherent and logical, but I don’t have my own emotions or beliefs and I can’t feel or think in a conscious way.”

Chatbots can mimic human speech, however, because the artificial intelligence powering them has ingested a huge amount of text, much of it from the internet.

 

So what’s at issue

 

The training information chatbots have processed form the basis, but also the limitation, of what each of these models can do.

 

So, if such a tool „passes” the proficiency exam or the Baccalaureate exam, it is because it has already trained on large amounts of specific information, but draws no conclusions of its own, and is unable to formulate solutions other than those it has already encountered.

 

This data is AI’s main source of information about the world in which it operates and influences how it responds to users.

 

 

But what data does it use? To say it’s the Internet might be too vague.

 

Tech companies keep the sources used by AI secret. The Washington Post attempted in an analysis dedicated to the phenomenon to reveal information about the sites used as sources that make chatbots appear intelligent.

 

„To look inside this black box, we analyzed the Google C4 dataset, a massive snapshot of the content of 15 million websites that have been used to train some high-profile English-language AIs called large language models, including Google’s T5 and Facebook’s LLaMA.

 

Note that OpenAI does not disclose what datasets it uses to train the models that support its chatbot, ChatGP,” the US journalists write.

 

The Washington Post has been working on this investigation with researchers at the Allen Institute for AI and has been ranking websites using data from Similarweb, a web analytics company. About a third of the websites could not be classified, mostly because they no longer appear on the internet. That is, they are not displayed.

 

 

The dataset was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation.

 

Hence the impression that these fields may be threatened by the new wave of artificial intelligence, which seems or even is competent on these topics.

 

The three biggest sites were patents.google.com, which contains text from patents issued around the world, wikipedia.org and scribd.com, a subscription-only digital library.

 

Among the sources are some questionable ones

 

Also on the list, according to US journalists, would be an honourable 190th place for AI primary sources, a notorious marketplace for pirated e-books that was subsequently seized by the Justice Department.

 

At least 27 other sites have been identified as belonging to pirated information marketplaces.

 

Some top sites seemed arbitrarily used, such as wowhead.com, a forum for World of Warcraft players, tripeglobal.com, a site with information on fighting exhaustion, and at least 10 sites selling dumpsters, including dumpsteroid.com, which no longer appears accessible.

 

Among the sources used, some flash red privacy lights. Two websites used, coloradovoters.info and flvoters.com, allegedly hosted copies of state voter registration databases.

 

Although voter data is public, the models could be using this personal information in unknown ways, hence the concern.

 

Business and industry websites have been a major source. kickstarter.com or patreon.com, which help creators collect monthly fees from subscribers for exclusive content, have also been among the favorites of AI apps.

 

Kickstarter and Patreon can give AI access to artists’ ideas and marketing processes.

 

„Currently, artists receive no compensation or credit when their work is included in AI training data, and some have filed copyright infringement complaints against image text generators Stable Diffusion, MidJourney and DeviantArt,” the Washington Post adds.

 

News is also an important category for data used by AI.

 

The nytimes.com, latimes.com, theguardian.com, forbes.com, and huffpost.com and even Washingtonpost.com, the author of the analysis, were favored.

 

Criticisms of the databases and the accuracy of the findings

 

Like artists and creators, some news organizations have criticized tech companies for using their content without permission or compensation.

 

But US journalists have also found that Russian propaganda sites such as RT.com, another publication known for its far-right news and views, and an anti-immigration website have also been used.

 

The discussion is important in the context that chatbots have been shown to confidently share incorrect information, but do not always provide quotes from sources so that the user is put on their guard.

 

Thus, training data from untrusted sources could lead to the spread of propaganda and misinformation – without the user being able to trace it back to the original source, is another criticism of the model.

 

Chatbot apps have also used religious websites. Of the top 20 religious sites, 14 were Christian, two were Jewish and one Muslim, one Mormon, one Jehovah’s Witness and one celebrating all religions.

 

Anti-Muslim bias emerged as a problem in some language patterns. For example, a study published in the journal Nature found that OpenAI’s ChatGPT-3 complemented the phrase „Two Muslims walked into a…” with violent actions 66% of the time.

 

Personal blogs and technology websites were also heavily used.

 

The dataset contained more than half a million personal blogs, ranging from professional to personal.

 

Social networks such as Facebook and Twitter prohibit scraping, which means that most datasets used to train AI cannot access them.

 

Tech giants like Facebook and Google, which sit on huge amounts of conversational data, have not been clear about how users’ personal information can be used to train AI models that are used internally or sold as products, The Washington Post adds.

 

Like most companies, Google heavily filtered the data before it was passed to AI (C4 stands for Colossal Clean Crawled Corpus).

 

In addition to eliminating gibberish (garbled speech or writing) and duplicate text, the company used an open source of obscene words, which includes 402 English terms and an emoji depicting a hand making a common but obscene gesture.

 

Beyond the issues raised, however, companies mostly use high-quality datasets to protect users from certain unwanted content.

Susține-ne activitatea G4Media logo
Donație Paypal recurentă

Donează lunar pentru susținerea proiectului G4Media

Donează prin Transfer Bancar

CONT LEI: RO89RZBR0000060019874867

Deschis la Raiffeisen Bank
Donează prin Patreon

Donează

Citește și...