蒙古国初创公司Egune 做最懂蒙古的AI
硅谷公司正争相用数以万计的图形处理单元来训练人工智能模型,并为明星研究人员提供数百万美元的薪水。
Meanwhile, in Mongolia, Badral Sanlig is building a large language model with only 128 GPUs and a small team of engineers to serve just his country.
与此同时,在蒙古国,巴德拉勒·桑利格(Badral Sanlig)正用128个图形处理单元和一支小规模的工程师团队,为他的国家打造一个大型语言模型。
Sanlig’s startup, Egune AI, is part of a growing movement to build LLMs in low-resource languages to expand AI access for underserved populations.
桑利格(Sanlig)创办的初创公司Egune AI,正投身于一项日益壮大的运动:用低资源语言构建大型语言模型,以扩大人工智能在服务不足人口中的普及。
Despite a shortage of training data, compute power, talent, and funding, these small models are attracting government clients and individual users keen to safeguard their language, cultural identity, and sovereignty in the face of dominance by American and Chinese firms.
尽管缺乏训练数据、算力、人才和资金,但这些小规模模型正吸引着政府客户和个人用户,他们渴望在美国和中国公司的主导地位面前,保护自己的语言、文化认同和主权。
A software engineer trained in Germany, Sanlig began developing Mongolian speech recognition models in 2019, after noticing that smart devices like Alexa often struggled to understand his native tongue.
作为一名在德国受训的软件工程师,桑利格(Sanlig)在2019年注意到,像Alexa这样的智能设备常常难以理解他的母语,于是他开始开发蒙古语语音识别模型。
Weeks after OpenAI launched ChatGPT in November 2022, Sanlig started working on an LLM.
在2022年11月OpenAI发布ChatGPT几周后,桑利格(Sanlig)开始着手开发一个大型语言模型。
He named the startup Egune, which means “it” in Mongolian, he told Rest of World.
他告诉《世界其他地区》杂志,他将这家初创公司命名为Egune,在蒙古语中意为“它”。
“The early stage was very difficult, because we didn’t have a lot of GPUs or a lot of money,” the 45-year-old said.
这位45岁的创始人说:“早期阶段非常困难,因为我们没有很多图形处理单元,也没有很多钱。”
“But we believed that LLMs could solve our language [recognition] problem because OpenAI had proved it.”
“但我们相信大型语言模型能解决我们的语言“识别”问题,因为OpenAI已经证明了这一点。”
Due to Mongolia’s close ties with China and Russia, getting U.S. approval for purchasing the Nvidia chips commonly used to train AI models can be a complicated and long-drawn-out affair — Sanlig flies to Berlin frequently to buy gaming-grade chips.
由于蒙古国与中国和俄罗斯的紧密关系,获得美国批准购买常用于训练人工智能模型的英伟达芯片可能是一个复杂而漫长的过程——桑利格(Sanlig)频繁飞往柏林购买游戏级芯片。
Training data is another hurdle.
训练数据是另一个障碍。
Egune relies on Mongolian university texts, library archives, and synthetic data.
Egune依赖于蒙古国大学的课本、图书馆档案和合成数据。
The team in Ulaanbaatar of about 40 consists mostly of young engineers who studied abroad and do not have much work experience.
位于乌兰巴托的这支约40人的团队,成员大多是在国外学习、工作经验尚浅的年轻工程师。
Top-ranked LLMs such as GPT, Anthropic’s Claude, and Google’s Gemini generally have hundreds of billions of parameters, or variables within a model that determine how it behaves.
像GPT、Anthropic的Claude和谷歌的Gemini等顶级大型语言模型通常拥有数千亿个参数,也就是决定模型行为的变量。
In contrast, the first model from Egune had 5 billion parameters and could only understand Mongolian.
相比之下,Egune的第一个模型只有50亿个参数,并且只能理解蒙古语。
The next model, with 34 billion parameters, was built with multi-language data and more general knowledge, Sanlig said.
桑利格(Sanlig)说,下一个模型拥有340亿个参数,是用多语言数据和更多常识构建的。
The latest Egune model has 70 billion parameters and improved coding and reasoning capability.
Egune最新的模型拥有700亿个参数,并且编码和推理能力都有所提升。
Despite their larger scale, the performance of bigger models in low-resource languages can be unpredictable: Companies are more focused on getting the models to excel in English and other widely used languages such as Spanish, French, and German.
尽管规模更大,但大型模型在低资源语言上的表现却可能难以预测:公司更专注于让模型在英语和其他广泛使用的语言(如西班牙语、法语和德语)上表现出色。
With their own models, smaller communities can make improvements and correct inaccuracies more quickly, Sanlig said.
桑利格(Sanlig)说,有了自己的模型,小群体可以更快地进行改进并纠正不准确之处。
Egune focuses on serving a small base of 3.5 million people in Mongolia.
Egune专注于服务蒙古国350万人口这一小众群体。
Its models are better at not only the language but also the history, culture, and nomadic lifestyle, he said.
他说,他们的模型不仅在语言方面表现更好,而且在历史、文化和游牧生活方式方面也更胜一筹。
“Mongolian culture is very different from Western culture, and the nomadic people may have some [questions like]: What’s the problem with this horse?” said Sanlig.
桑利格(Sanlig)说:“蒙古文化与西方文化大相径庭,游牧民族可能会提出一些“类似”的问题:‘这匹马有什么问题?’”
“We can decide which information we feed into the models.”
“我们可以决定向模型输入哪些信息。”
Egune Chat currently has about 24,000 daily users.
Egune聊天机器人目前拥有约2.4万名日活跃用户。
The company has also developed AI applications for a telecom company, a bank, and several government agencies, Sanlig said.
桑利格(Sanlig)说,该公司还为一家电信公司、一家银行和多个政府机构开发了人工智能应用程序。
There is this desire to be able to develop your own AI systems that align with your own specific cultural and linguistic context.”
“人们渴望能够开发出符合自己特定文化和语言背景的人工智能系统。”
Several similar efforts around the world are seeking similar outcomes with their own models.
世界各地有几项类似的工作正在用他们自己的模型寻求类似的结果。
These include Latin America’s Latam-GPT, Singapore’s Sea-Lion, India’s BharatGPT, and Nigerian startup Awarri, which is building an LLM trained in five local languages and accented English.
其中包括拉丁美洲的Latam-GPT、新加坡的Sea-Lion、印度的BharatGPT,以及尼日利亚初创公司Awarri,后者正在构建一个用五种当地语言和带口音的英语进行训练的大型语言模型。
There are real benefits to such LLMs.
这类大型语言模型确实具有优势。
The unreliability of mainstream models in low-resource languages can exacerbate inequality, causing communities to miss out on AI’s economic benefits, while suffering more from its biases and hallucinations, Caroline Meinhardt, policy research manager at the Stanford Institute for Human-Centered Artificial Intelligence, told Rest of World.
斯坦福大学以人为本人工智能研究所的政策研究经理卡罗琳·迈因哈特(Caroline Meinhardt)告诉《世界其他地区》杂志,主流模型在低资源语言方面表现不可靠,可能加剧不平等,导致相关社区错过人工智能带来的经济效益,同时遭受更多偏见和“幻觉”的困扰。
Besides exploring AI to better serve local populations and represent local values, researchers and authorities are concerned with the national security implications of relying on mostly U.S. AI technologies, said Meinhardt, who studies low-resource language LLMs.
迈因哈特(Meinhardt)研究低资源语言大型语言模型,她说,除了探索人工智能以更好地服务当地人口和体现当地价值观之外,研究人员和政府当局还关注过度依赖美国人工智能技术对国家安全的影响。
“There is this desire to be able to develop your own AI systems that align with your own specific cultural and linguistic context, but also perhaps your own political context,” she said.
她说:“人们渴望能够开发出符合自己特定文化和语言背景、或许还有自己的政治背景的人工智能系统。”
But a lack of resources makes it hard for these models to compete with those made by American and Chinese tech giants.
但资源的匮乏使得这些模型难以与美国和中国科技巨头制造的模型竞争。
ChatGPT is much more popular than Egune in Mongolia, said Tserennyam Sukhbaatar, a native Mongolian and a marketing professor at Brigham Young University–Hawaii.
一位土生土长的蒙古人,也是杨百翰大学夏威夷分校的市场营销学教授策仁尼亚姆·苏赫巴特尔(Tserennyam Sukhbaatar)说,在蒙古国,ChatGPT比Egune更受欢迎。
“Even though it’s a Mongolian AI, [it’s] not as good as ChatGPT,” he told Rest of World.
他告诉《世界其他地区》杂志:“尽管它是蒙古国的人工智能,但它不如ChatGPT。”
Still, it’s important that a local company has control over Mongolian models and training data, and he tries to use Egune occasionally to show his support, Sukhbaatar said.
苏赫巴特尔(Sukhbaatar)说,不过,由一家本土公司控制蒙古国的模型和训练数据至关重要,他偶尔会尝试使用Egune,以示支持。
“The Mongolian economy is heavily focused on mining, so if somebody is trying to do business in technology, we have to raise both hands to support that business,” he said.
他说:“蒙古国经济严重依赖采矿业,所以如果有人试图在科技领域创业,我们必须举双手支持他们。”
Mongolia, with its vast deposits of coal and copper, derives about a quarter of its gross domestic product from mining.
蒙古国拥有储量丰富的煤炭和铜矿,其国内生产总值(GDP)约有四分之一来自采矿业。
The country has a small tech industry, with a startup ecosystem worth only about $156 million, according to a 2022 report by the Japan International Cooperation Agency.
根据日本国际协力机构2022年的一份报告,蒙古国的科技产业规模很小,其初创企业生态系统的总价值仅为1.56亿美元左右。
Egune is valued at about $39 million, with a recent investment of $3.5 million from Golomt Bank, one of the biggest banks in the country.
Egune的估值约为3900万美元,最近从该国最大的银行之一Golomt银行获得了350万美元的投资。
While Mongolia is not a tech heavyweight, Egune can help other small countries build their own LLMs, Sanlig said.
桑利格(Sanlig)说,虽然蒙古国不是一个科技强国,但Egune可以帮助其他小国建立自己的大型语言模型。
The startup showcases Mongolia’s potential to build a homegrown tech industry, even as the country suffers from a brain drain of young workers seeking education and jobs abroad, he said.
他说,尽管该国正遭受年轻劳动力为寻求海外教育和工作而导致的人才流失,但这家初创公司仍展示了蒙古国建立本土科技产业的潜力。
“If they use Egune, the young generation will understand the importance of this fundamental work,” said Sanlig.
桑利格(Sanlig)说:“如果他们使用Egune,年轻一代就会明白这项基础性工作的重要性。”
“The investment climate will also be warmer.”
“投资环境也将因此变得更加温暖。”
