Distributed System
简介
Distributed systems play a crucial role in AI training, especially for large-scale machine learning models like the ones created by OpenAI’s GPT-3, GPT-4, and other advanced models.
motication
- Training Speed: Training large AI models is computationally intensive and can take a long time if done on a single machine.
- Huge Traning Data
- Large memory consuming
系统常见瓶颈
Communication: time-consuming and energy-consuming
- Solution: cache mode(1)
Database: Distributed data organization
for example in JindoFS
filesystem (for AI?)
Name | Company |
---|---|
JindoFS | aliyun |
S3A FileSystem | based on Amazon Web Services (AWS) S3 (Simple Storage Service) |
ByteFUSE | bytedance |
News Learning
Alluxio Enterprise AI 拥有去中心化元数据的分布式系统架构,可消除访问海量小文件(常见于AI 负载)时的性能瓶颈。
Alluxio Enterprise AI的分布式缓存功能使得AI引擎能够通过高性能Alluxio缓存(而非缓慢的数据湖存储)来读写数据。and call it DORA (Decentralized Object Repository Architecture)
需要进一步的研究学习
暂无
遇到的问题
暂无
开题缘由、总结、反思、吐槽~~
参考文献
上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。
无