Distributed System

导言

自从chatgpt大火之后,业界对分布式训练,和分布式系统(AI Infra/AI Infrastructure)的关注又大幅度增加了。但是我在这方面还是个小白, 很多是什么, 为什么的问题有待学习。

简介

Distributed systems play a crucial role in AI training, especially for large-scale machine learning models like the ones created by OpenAI’s GPT-3, GPT-4, and other advanced models.

motication

  1. Training Speed: Training large AI models is computationally intensive and can take a long time if done on a single machine.
  2. Huge Traning Data
  3. Large memory consuming

系统常见瓶颈

  1. Communication: time-consuming and energy-consuming

    1. Solution: cache mode(1)
  2. Database: Distributed data organization

  3. for example in JindoFS

filesystem (for AI?)

Name Company
JindoFS aliyun
S3A FileSystem based on Amazon Web Services (AWS) S3 (Simple Storage Service)
ByteFUSE bytedance

News Learning

Alluxio Enterprise AI 拥有去中心化元数据的分布式系统架构,可消除访问海量小文件(常见于AI 负载)时的性能瓶颈。

decentralized

a centralized system or architecture where data and control are managed from a single central point. In a centralized system, there is a single location or entity that controls and maintains critical functions or resources, which can include data management, decision-making, or metadata management. Centralized systems can become performance bottlenecks when dealing with large-scale and distributed workloads, as they may introduce a single point of failure or performance limitation.

metadata

metadata provides information about the attributes of the data, such as its size, type, creation date, access permissions, and other properties. For example, in a file system, metadata for a file might include information like its name, size, location, and the timestamps of its creation and last modification. Efficient management of metadata is crucial in distributed file systems, as access to metadata can often become a performance bottleneck, especially when dealing with a large number of small files, as is common in AI workloads.

Alluxio Enterprise AI的分布式缓存功能使得AI引擎能够通过高性能Alluxio缓存(而非缓慢的数据湖存储)来读写数据。and call it DORA (Decentralized Object Repository Architecture)

数据湖 Data Lake

“数据湖”(Data Lake)是一种数据存储架构,通常用于存储大量不同类型和格式的原始数据,而不要求对数据进行预处理或结构化。这种数据存储方法旨在为数据科学家、分析师和应用程序开发人员提供一个集中的存储库,以便他们可以以需要的方式分析和处理数据。
以下是数据湖的一些关键特点:

  1. 原始数据存储: 数据湖通常存储原始、未经处理的数据,包括结构化数据(如数据库表)、半结构化数据(如日志文件或JSON文件)和非结构化数据(如图像、音频和文本文件)。
  2. 灵活性: 数据湖支持各种数据处理工具和框架,因此用户可以根据需求选择合适的工具和方法来处理数据。
  3. 扩展性: 数据湖通常是分布式的,允许存储大规模数据,并且可以轻松扩展以满足不断增长的数据需求。
  4. 低成本: 数据湖通常建立在低成本的存储基础设施上,因为它不要求对数据进行预处理或结构化。
  5. 数据访问控制: 数据湖通常提供数据安全和访问控制功能,以确保敏感数据得到适当的保护。

需要进一步的研究学习

暂无

遇到的问题

暂无

开题缘由、总结、反思、吐槽~~

参考文献

上面回答部分来自ChatGPT-3.5,没有进行正确性的交叉校验。

Author

Shaojie Tan

Posted on

2023-10-20

Updated on

2025-01-30

Licensed under