Thursday, May 18, 2023

How Large Language Models Prove Chomsky Wrong with Steven Piantadosi

Joining SlatorPod this week is Steven Piantadosi, Associate Professor of Psychology at UC Berkeley. Steven also runs the computation and language lab (colala) at UC Berkeley, which studies the basic computational processes involved in human language and cognition.


Steven talks about the emergence of large language models (LLMs) and how it has reshaped our understanding of language processing and language acquisition.

Steven breaks down his March 2023 paper, “Modern language models refute Chomsky’s approach to language”. He argues that LLMs demonstrate a wide range of powerful language abilities and disprove foundational assumptions underpinning Noam Chomsky’s theories and, as a consequence, negate parts of modern.

Steven shares how he prompted ChatGPT to generate coherent and sensible responses that go beyond its training data, showcasing its ability to produce creative outputs. While critics argue that it is merely an endless sequence of predicting the next token, Steven explains how the process allows the models to discover insights about language and potentially the world itself.

Steven acknowledges that LLMs operate differently from humans, as models excel at language generation but lack certain human modes of reasoning when it comes to complex questions or scenarios. He unpacks the BabyLM Challenge which explores whether models can be trained on human-sized amounts of data and still learn syntax or other linguistic aspects effectively.

Despite industry advancements and the trillion-dollar market opportunity, Steven agrees with Chomsky’s ethical concerns, including issues such as the presence of harmful content, misinformation, and the potential impact on job displacement.

teven remains enthusiastic about the potential of LLMs and believes the recent advancements are a step forward to achieving artificial general intelligence, but refrains from making any concrete predictions.

Thursday, May 11, 2023

Why Large Language Models Hallucinate When Machine Translating ‘in the Wild’

 Large language models (LLMs) have demonstrated impressive machine translation (MT) capabilities, but new research shows they can generate different types of hallucinations compared to traditional models when deployed in real-world settings. 

The findings, published in a paper on March 28, 2023, included evidence that the hallucinations were more prevalent when translating into low-resource languages and out of English and that they can introduce toxic text.

Hallucinations present a critical challenge in MT, as they may damage user trust and pose serious safety concerns, according to a 2022 research paper, though studies to improve the detection and mitigation of hallucinations in MT have been limited to small models trained on a single English-centric language pair.

This has left “a gap in our understanding of hallucinations […] across diverse translation scenarios,” explained Nuno M. Guerreiro and Duarte M. Alves from the University of Lisbon, Jonas Waldendorf, Barry Haddow, and Alexandra Birch from the University of Edinburgh, Pierre Colombo from the Université Paris-Saclay, and André F. T. Martin, Head of Research at Unbabel, in the newly published research paper.

Looking to fill that gap, the researchers conducted a comprehensive analysis of various massively multilingual translation models and LLMs, including ChatGPT. The study covered a broad spectrum of conditions, spanning over 100 translation directions across various resource levels and going beyond English-centric language pairs.

According to the authors, this research provides key insights into the prevalence, properties, and mitigation of hallucinations, “paving the way towards more responsible and reliable MT systems.”

Detach from the Source 

The authors found that hallucinations are more frequent when translating into low-resource languages and out of English, leading them to conclude that “models tend to detach more from the source text when translating out of English.”

In terms of type of hallucinations, oscillatory hallucinations — erroneous repetitions of words and phrases — are less prevalent in low-resource language pairs, while detached hallucinations — translations that bear minimal or no relation at all to the source — occur more frequently. 

According to the authors, “this reveals that models tend to rely less on the source context when translating to or from low-resource languages.”

The rate of hallucinations exceeded 10% in some language pairs, such as English-Pashto, Tamil-English, Azerbaijani-English, English-Azerbaijani, Welsh-English, English-Welsh, and English-Asturian. However, the authors suggest that hallucination rates can be reduced by increasing the size of the model (scaling up) or using smaller distilled models.

Hallucinations and Toxicity

The authors also found that hallucinations may contain toxic text, mainly when translating out of English and into low-resource languages, and that scaling up the model size may not reduce hallucinations. 

This indicates that hallucinations might be attributed to toxic patterns in the training data and underlines the need to filter the training data rigorously to ensure the safe and responsible use of these models in real-world applications.

The authors emphasize that while massive multilingual models have significantly improved the translation quality for low-resource languages, the latest findings underscore potential safety concerns and the need for improvement.

To mitigate hallucinations and improve overall translation quality, they explored fallback systems, finding that hallucinations can be “sticky and difficult to reverse when using models that share the same training data and architecture.” 

However, external tools, such as NLLB, can be leveraged as fallback systems to improve translation quality and eliminate pathologies such as oscillatory hallucinations.

ChatGPT Surprise

The authors also found that ChatGPT produces different hallucinations compared to traditional MT models. These errors may include off-target translations, overgeneration, or even failed attempts to translate. 

Furthermore, unlike traditional MT models, which frequently produce oscillatory hallucinations, ChatGPT does not generate any such hallucinations under perturbation. “This is further evidence that translation errors, even severely critical ones, obtained via prompting an LLM are different from those produced by traditional machine translation models,” explained the authors.

Moreover, the results revealed that ChatGPT generates more hallucinations for mid-resource languages than for low-resource languages, highlighting that “it surprisingly produces fewer hallucinations for low-resource languages than any other model.”

The authors note that while the majority of the hallucinations can be reversed with further sampling from the model, this does not necessarily indicate a defect in the model’s ability to generate adequate translations, but rather may be a result of “bad luck” during generation, as Guerreiro, Martins, and Elena Voita, AI Research Scientist at Meta, wrote in a 2022 research paper.

To facilitate future research in this area, the authors have made their code openly available and released over a million translations and detection results across several models and language pairs.

Saturday, January 28, 2023

Tencent Pits ChatGPT Translation Quality Against DeepL and Google Translate

 


Since OpenAI launched ChatGPT in November 2022, headlines have asked whether workers in a range of fields should worry about being replaced by the advanced AI chatbot. Now, a January 2023 paper from a Chinese tech company, Tencent, asks the question on behalf of the language industry: Is ChatGPT A Good Translator?

The Tencent team goes about answering the question by reviewing, shall we say, a limited set of data. The team said “obtaining the translation results from ChatGPT is time-consuming since it can only be interacted with manually and can not respond to large batches. Thus, we randomly sample 50 sentences from each set for evaluation.” So, let’s see what insights the team gathered by evaluating those 50 sentences.

According to the paper, ChatGPT performs “competitively” with commercial machine translation (MT) products, such as Google TranslateDeepL, and Tencent’s own system, on high-resource European languages, but struggles with low-resource or unrelated language pairs.

In other words, one observer on Twitter quipped, “Potential alternative headline/interpretation: ‘ChatGPT was trained for translation on common publicly available parallel corpora.’”

For this “preliminary study,” Tencent AI Lab researchers, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu evaluated translation prompts, multilingual translation, and translation robustness.

Meta Moment

The experiment started with a “meta” moment when the team asked ChatGPT itself for prompts or templates that would trigger its MT ability. The prompt that produced the best Chinese–English translations was then used for the rest of the study — 12 directions total between Chinese, English, German, and Romanian.

Researchers were curious as to how ChatGPT’s performance might vary by language pair. While ChatGPT performed “competitively” with Google Translate and DeepL for English–German translation, its BLEU score for English–Romanian translation was 46.4% lower than that of Google Translate.

The team attributed the poor performance to the pronounced difference in monolingual data for English and Romanian, which “limits the language modeling capability of Romanian.”

Romanian–English translation, on the other hand, “can benefit from the strong language modeling capability of English such that the resource gap of parallel data can be somewhat compensated,” for a BLEU score just 10.3% below Google Translate.

Beyond the Family

Beyond resource differences, the authors wrote, translating between language families is considered more difficult than translating within language families. The difference in the quality of ChatGPT’s output for German–English versus Chinese–English translation seems to bear this out.  

Researchers observed an even greater performance gap between ChatGPT and commercial MT systems for low-resource language pairs from different families, such as Romanian–Chinese. 

“Since ChatGPT handles different tasks in one model, low-resource translation tasks not only compete with high-resource translation tasks but also with other NLP tasks for the model capacity, which explains their poor performance,” they wrote.

Google Translate and DeepL both surpassed ChatGPT in translation robustness on two out of three test sets: WMT19 Bio (Medline abstracts) and WMT20 Rob2 (Reddit comments), likely thanks to their continuous improvement as real-world applications fed by domain-specific and noisy sentences. 

However, ChatGPT outperformed Google Translate and DeepL “significantly” on the WMT20 Rob3 test set, which contained a crowdsourced speech recognition corpus. The authors believe this finding suggests that ChatGPT is “capable of generating more natural spoken languages than these commercial translation systems,” hinting at a possible future area of study.

Also Read:

We Prompted ChatGPT to be a Translation Manager

Thursday, January 26, 2023

Why Netflix Shut Down Its Translation Portal Hermes

 In response to soaring content localization needs, online streaming giant Netflix launched a recruitment drive to attract fresh translation talent in March 2017. The program, named Hermes, was billed as “the first online subtitling and translation test and indexing system by a major content creator,” and was advertising the fact that “​Netflix is Looking for the Best Translators Around the Globe.”

At that time, Netflix movies were being translated into more than 20 languages, and the scale of the localization was on overdrive following the launch of the service globally just a year earlier in January 2016.

By March 2018, one year after the Hermes launch, Netflix had issued a statement on its website to announce that the program was being closed. The notification read: “we have reached our capacity for each one of the language tests due to the rapid popularity and response from applicants all over the world. Therefore we are closing the platform to future testing at this time.”

At the time, Slator reached out to Netflix for comment on the closure of the platform, which seemed rather unusual. “We [don’t] have anything to add at this time outside of the messaging posted on our site,” a Netflix spokesperson told Slator in an email.

Leaving Onboarding to the Experts

Now, Netflix has provided more color on the reasons behind the closure of the Hermes project and it seems the company may have bitten off more than it could chew. In a presentation at the Languages & The Media 2018 conference in Berlin, Allison Smith, Program Manager, Localization Solutions at Netflix, explained how the Hermes project had been highly ambitious in its goal of testing, training and onboarding thousands of new translators.

LocJobs.com I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. Browse new jobs now.

LocJobs.com I Recruit Talent. Find Jobs

Yet after much introspection, Smith said, the team pivoted and decided that those activities were better left to the ten or so localization vendors that Netflix partners with, allowing Netflix to focus on tasks more aligned to its core competencies such as content localization workflows, engineering and development. It is a tech company after all.

When asked by an audience member during the Q&A how successful Hermes had been in onboarding translators, Smith responded that it had been valuable in other ways. The project generated lots of new ideas that Netflix has taken forward such as scheduling improvements, enhanced style guides and continued development of a cloud-based content localization platform, Smith said.

Netflix aimed to own the full process from subtitler recruitment through to working in our tools, and this started with Hermes. While we learned a lot and did get value from the test, after introspection and analyzing our core competencies, we decided vendors were better suited to use their core competencies and add value to the content localization ecosystem by owning the recruiting, training and onboarding processes.” — Allison Smith, Program Manager, Localization Solutions, Netflix

Netflix Asks the Audience

Many of Netflix’s preferred localization vendors have their own translation environments, which translators may use as opposed to Netflix’s own tool. It’s not clear how much of the translation work is being done in the Netflix platform itself and how much is being done in external platforms.

Still, Netflix continues to seek feedback on its localization platform to inform the development roadmap. To collect additional on-the-spot feedback from the 370 Languages & the Media audience members, many of whom translators, a snap poll was taken during Smith’s presentation asking people what additional features they would like to see built in to Netflix’s timed-text tool. The most popular feature requests were for spell check and autocorrect capabilities, an offline version of the cloud platform, and translation memory (TM) integration.

The most popular feature requests were for spell check and autocorrect capabilities, an offline version of the cloud platform, and translation memory (TM) integration.

The same audience, professionals from all areas of the media localization industry, were also polled on how they feel about using machine translation (MT) in subtitling. The response was conservative. On a scale of one to four stars, with four being the most positive, 61% gave one or two stars. 19% gave three stars while 20% gave the full four stars.

SlatorCon Remote March 2023 | Super Early Bird Now $98

SlatorCon Remote March 2023 | Super Early Bird Now $98

A rich online conference which brings together our research and network of industry leaders.

Register Now

Smith was also asked about Netflix’s approach to neural MT during the closing panel. The Localization Manager said that currently “it is not part of the strategic plan but we are certainly aware of it.”

Who’s Doing Dubbing?

Smith also explained that the economics of dubbing vs subbing are very different, but highlighted the fact that Netflix sees real value in providing customers with choice. “Choice is more important than a particular preference”, Smith said, and the Netflix localization model is not based on the traditional idea of particular countries preferring dubbing vs subbing. It’s more personal than that for Netflix and, most notably, it’s not possible to get the real data on preference if there’s not the choice, Smith explained.

Netflix has also recently released a list of their dubbing partners, who are arranged into gold, silver and bronze tiers. A list of these partners is included in the table below:

Download the Slator Media Localization Report 2018 for more actionable insights into the media localization industry.

Tuesday, January 17, 2023

Here are the Best and Worst-Performing, Publicly-Traded LSPs in 2022

 In a tough year for global stock and bond markets against a backdrop of rising interest rates, only a couple language service providers (LSPs) were able to buck the downtrend.

Zoo Digital tops the list again in 2022. The multimedia localization company started the year with a 5% stock price increase and was up by 14% as January ended. Stocks continued their steady climb past the end of Q2, ending with a share price of USD 178 (+33%). The company boldly announced in September that it expects a revenue increase of 89% from the same period in 2023.

In a year that saw global markets tumble across the board, it is no surprise that LSPs except Zoo Digital and, notably, Honyaku Center, shed value in 2022. Japan’s largest LSP came in at a second place, rising a mere 8% over the year.

2022 Performance of Listed LSPs

Game localizer Keywords Studios had a good year, with five new acquisitions (59 to date) and a revenue increase of 36.7%. The company’s Globalize division accounted for approximately half of the total revenue. Despite overall positive results, the company’s stock lost value during the first three quarters but was up 1% at the end of the calendar year.

Star7, a Star Group company with 22 years in business, has spent only one-year trading publicly. The company’s shares performed comparatively better than nine other LSPs, ending at a value of –6,44% for the year. 

Shares in Japan’s MetaReal Group, Rozetta Corporation’s parent company, greatly fluctuated and ended on the negative end at –16%. The language technology company ended the fiscal year with an increase in revenues of 4% to JPY 4.16Bn (USD 32.2m).

Australia’s Straker Translations shared positive news in November 2022 about getting an extension on the company’s contract with IBM. The markets reacted favorably, but the news was not enough to defy the broader market forces and the stock ended the year down –21%. 

RWS saw its shares tumble during 2022 as investors reassessed its exposure to big tech companies, which for the first time in over a decade are making cuts to staff and budgets. RWS shares shed over half of their value by October but have since regained some ground, ending the year at –46%.

Despite closing FY23 with a positive cash flow and some wins on the sales front, revenues were down 5% year on year at multilingual captioning company Ai-Media. Stocks had lost half their value by 52% as the year ended.

2022 Language Industry Market Report Cover

Slator 2022 Language Industry Market Report

100-page flagship report on market size, buyer-segments, competitive landscape, sales and marketing insights, language tech and more.

The big loser this year was data annotation company Appen, whose stock continued its free fall throughout 2022, down –74.5% for the year. Once valued at USD 4bn, Appen’s market cap now stands at about USD 300m.

Keeping Appen company at the bottom of the performance scale is Canada’s VIQ Solutions, which started the year with a stock price above USD 2 and then ended at penny stock, USD 0.25.

Check out Slator’s Real-Time Charts of Listed LSPs for an up-to-the minute look at the current performance of listed language and related tech companies.

Language Discordance Raises Risk of Hospital Readmissions, U.S. Study Finds

  A June 2024 meta-analysis published in   BMJ Quality & Safety   was recently brought back into the spotlight by Dr. Lucy Shi, who disc...