Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [DeepSeek R1] Add DeepSeekV3 Base + Weight Conversion #2171

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DavidLandup0
Copy link
Collaborator

Adds DeepSeekV3 base and weight conversion script.

The architecture itself builds and runs, but requires massive RAM. Example of a one-block model running on some tokens below (5s/token):

image

Needs more refactoring and simplification.

WIP/TODOs

  • The weight download takes around 880GB of disk space. Then instantiating the model and torch weights both in memory requires massive RAM. Figure out if this can be done iteratively.
  • Figure out Keras weight sharding + this
  • Move from ModelArgs dataclass syntax into config.json style config



@dataclass
class ModelArgs:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comes from the original impl - currently here for sanity checking. Will be removed in lieu of json configs.

return logits


if __name__ == "__main__":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check main call - will be removed.

rank = 0


class Embedding(layers.Layer):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Remove custom class and just use layers.Embedding.

return linear(x, self.weight, self.bias)


class ColumnParallelLinear(Linear):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the custom XParallel classes if we don't use torch.dist which boils most of them back to the standard implementations?

@divyashreepathihalli divyashreepathihalli self-requested a review March 28, 2025 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant