ADVERTISEMENT

Wednesday, April 23, 2025

No Result

View All Result

Vegas Valley News

No Result

View All Result

Vegas Valley News

No Result

View All Result

Home Technology

GPT-4o’s Chinese language token-training information is polluted by spam and porn web sites

by Vegas Valley News

May 18, 2024

in Technology

GPT-4o’s Chinese language token-training information is polluted by spam and porn web sites

0

SHARES

1

VIEWS

Share on Facebook Share on Twitter

The brand new tokenizer has 200,000 tokens in whole, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to depend the variety of tokens in numerous languages, and the highest languages, moreover English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s most important influence, for my part, is you get the price down in these languages, not that the standard in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it could possibly analyze the prompts quicker and cost customers much less for a similar reply. With the brand new tokenizer, “you’re virtually 4 instances value discount,” he says.

Das, who additionally speaks Hindi and Bengali, took a have a look at the longest tokens in these languages. The tokens mirror discussions occurring in these languages, so that they embrace phrases like “Narendra” or “Pakistan,” however widespread English phrases like “Prime Minister,” “college,” and “worldwide” additionally come up continuously. Additionally they don’t exhibit the problems surrounding the Chinese language tokens.

That seemingly displays the coaching information in these languages, Das says: “My working principle is the web sites in Hindi and Bengali are very rudimentary. It’s like [mostly] information articles. So I’d count on this to be the case. There aren’t many spam bots and porn web sites attempting to occur in these languages. It’s principally going to be in English.”

Polluted information and an absence of cleansing

Nonetheless, issues are drastically completely different in Chinese language. In line with a number of researchers who’ve regarded into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese language are virtually completely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese language phrases, mirror these subjects to a major diploma.

“The issue is evident: the corpus used to coach [the tokenizer] isn’t clear. The English tokens appear nice, however the Chinese language ones aren’t,” says Cai from Princeton College. It’s not uncommon for a language mannequin to crawl spam when amassing coaching information, however normally there can be important effort taken to wash up the information earlier than it’s used. “It’s doable that they didn’t do correct information clearing with regards to Chinese language,” he says.

The content material of those Chinese language tokens might recommend that they’ve been polluted by a particular phenomenon: web sites hijacking unrelated content material in Chinese language or different languages to spice up spam messages.

These messages are sometimes commercials for pornography movies and playing web sites. They could possibly be actual companies or merely scams. And the language is inserted into content material farm web sites or generally reliable web sites to allow them to be listed by search engines like google, circumvent the spam filters, and are available up in random searches. For instance, Google listed one search consequence web page on a US Nationwide Institutes of Well being web site, which lists a porn website in Chinese language. The identical website identify additionally appeared in no less than 5 Chinese language tokens in GPT-4o.

Tags: Chinese data GPT4os polluted porn spam tokentraining websites

Vegas Valley News

Vegas Valley News Local, Breaking News

Next Post

Use Google Flights Like a Professional

Use Google Flights Like a Professional

Leave a Reply Cancel reply

About Us

Vegas Valley News, based in Las Vegas, Nevada, is your go-to source for local news and events. Stay updated with the latest happenings in our vibrant community. For advertising opportunities, contact us at sales@vegasvalleynews.com. Your connection to the pulse of Vegas!

Category

Recent Posts

Copyright © 2024 Vegasvalleynews.com | All Rights Reserved.

No Result

View All Result

Copyright © 2024 Vegasvalleynews.com | All Rights Reserved.