Earth Science News
TECH SPACE
AI Training Strategies Tested on World's Fastest Supercomputer
illustration only
AI Training Strategies Tested on World's Fastest Supercomputer
by Clarence Oxford
Los Angeles CA (SPX) May 16, 2024

Researchers at Oak Ridge National Laboratory (ORNL) investigated training techniques for a significant AI model using the Frontier supercomputer.

The study led by Sajal Dash, Feiyi Wang, and Prasanna Balaprakash utilized Frontier, the world's first exascale supercomputer, for initial stages of training on a large language model. They tested how models with 22 billion, 175 billion, and 1 trillion parameters could run across 128 and later 384 of Frontier's more than 9,400 nodes. The team did not complete the training of a full model.

Large language models aim to mimic human brain patterns in learning and recognizing words and numbers, improving over time with more training. The goal is to create a model that can apply learned knowledge to new, unfamiliar tasks.

Traditionally, the resources needed for such training are held by private companies, limiting research opportunities and verification. Frontier's supercomputing power, however, offers new possibilities for training AI models more efficiently.

"Traditionally, this process has relied on expert knowledge or on trial and error," said Prasanna Balaprakash, ORNL's director of AI programs. "One of the highlights of our work in this study is the automation of identifying high-performing strategies among a vast array of options. We leveraged DeepHyper, an open-source scalable tuning software, to automatically determine the optimal settings. We plan to extend this automated approach to fine-tune system-level performance and enhance efficiency at an...

Training a large language model with a trillion parameters from start to finish without optimizations would take months, even at Frontier's speeds. The ORNL study looked at data parallelism, breaking a large problem into smaller parts to reach a solution faster, to train AI and transfer training across different GPU frameworks.

"It's about finding the best combination of training strategies while getting the best throughput," Dash said. "Most deep-learning frameworks target the GPUs made by NVIDIA rather than the GPUs made by AMD that power Frontier. We wanted to see if existing models could run on Frontier, how to make the best use of Frontier's computing power and how to make that level of performance possible across GPU platforms.

"We can't train a model this size on a single GPU or a single node, for example, and every time we cross the barrier between nodes that requires more communication that consumes more time. How do we slice up the model across GPUs so that we can fit and train the model without losing too much time and energy communicating between nodes?"

The researchers found a blend of parallelism strategies worked best when tailored to the computing platform but said their work's far from finished.

"The efficiency we achieved on Frontier with this model was decent but not decent enough," Wang said. "At extreme scale, we achieved 30% efficiency - which means we left about 70% of Frontier's computing power on the floor. We need much more optimization to make the machine more efficient at this scale."

Next steps include training a model further with peer-reviewed scientific data across more nodes.

"This study and our findings aren't so much a manual as a potential set of guidelines for users training a large model," Dash said. "They can draw from our experience to decide how to use Frontier's resources to train their particular model and make the most effective use of their allotted computing time."

The study was presented at the International Supercomputing Conference High Performance 2024 in Hamburg, Germany. Collaborators included Isaac Lyngaas, Junqi Yin, Xiao Wang, and Guojing Cong of ORNL and Romaine Egele of Paris-Saclay University.

The study focused on optimizing the use of GPUs for training AI, with each of Frontier's nodes relying on four AMD MI250X GPUs.

The training ran for a few hours on about 100 million tokens of test data, a small fraction of the data needed for a trillion-parameter model.

"This study was largely an exercise to show we can train this particular size of model on Frontier at this particular scale with this particular level of efficiency," Wang said. "We didn't get anywhere near the finish line of a complete large language model."

Research Report:Optimizing Distributed Training on Frontier for Large Language Models

Related Links
Oak Ridge National Laboratory
Innovative and Novel Computational Impact on Theoryand Experiment Program
Space Technology News - Applications and Research

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
TECH SPACE
Amazon cloud division head unexpectedly steps down
San Francisco (AFP) May 14, 2024
The head of Amazon's AWS cloud computing business, Adam Selipsky, who was helping lead the company's expansion into AI, told workers he was stepping down Tuesday. Amazon Web Services is a key subsidiary of the tech giant, having made $25 billion worldwide in the first quarter, capitalizing on the growing appetite among businesses for remote computer and artificial intelligence services. In a memo to staff, Selipsky said he was leaving with "mixed emotions," but "given the state of the business a ... read more

TECH SPACE
Mumbai billboard owner arrested after deadly collapse: reports

'Calling from humanity': Indonesia rescuers search for flood missing

Gaza aid pier to be operational in coming days: Pentagon

Egypt arrests Uber driver after latest attack on women

TECH SPACE
Energy transition risks critical mineral shortage: IEA

Microbial Enzyme Could Make Plastics Biodegradable

SwRI investigates boiling processes in partial gravity

AI Training Strategies Tested on World's Fastest Supercomputer

TECH SPACE
Manta Ray UUV moves closer to operational readiness after successful tests

Senegal government halts Dakar coastal construction for two months

NASA chooses UF mission to monitor Earth's water and ice

Costa Rica to ration electricity as drought bites

TECH SPACE
Ritacuba Blanco: death of a Colombian glacier

Emperor penguins perish as ice melts to new lows: study

West Antarctic ice shelf stability threatened by feedback loop

New geological map redefines understanding of Greenland's subterranean rocks

TECH SPACE
'Sowing peace'? Colombia program for war criminals stokes debate

Cambodia's famed Kampot pepper withers in scorching heatwave

Polish farmers protest 'harmful' EU environmental rules

Brazil floods strike blow to powerful agriculture sector

TECH SPACE
Indonesia floods kill 58 as rescuers race to find missing

Indonesia floods kill 67 as rescuers race to find missing

More money pledged for flood-stricken Brazil

Indonesia volcano belches ash tower three miles into sky

TECH SPACE
Burkina calls national meetings to set next steps in transition

Bid to end deadly cooking methods which stoke global warming

EU concerned about 'post-election violence' in Chad

Chad troops out in force after junta chief elected president

TECH SPACE
Amazonian chief at UN to combat traditional knowledge piracy

In US national parks, a historical wound begins to heal

Chimps learn and improve tool-using skills even as adults

Exploring the Socioeconomic Drivers Behind Plummeting Fertility Rates

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.