WIP [DeepSeek R1] Add DeepSeekV3 Base + Weight Conversion #2171

DavidLandup0 · 2025-03-27T12:50:10Z

Adds DeepSeekV3 base and weight conversion script.

The architecture itself builds and runs, but requires massive RAM. Example of a one-block model running on some tokens below (5s/token):

Needs more refactoring and simplification.

The weight download takes around 880GB of disk space. Then instantiating the model and torch weights both in memory requires massive RAM. Figure out if this can be done iteratively.
Figure out Keras weight sharding + this
Move from ModelArgs dataclass syntax into config.json style config

DavidLandup0 · 2025-03-27T12:58:53Z

keras_hub/src/models/deepseek_r1/deepseek_backbone.py

+
+
+@dataclass
+class ModelArgs:


Comes from the original impl - currently here for sanity checking. Will be removed in lieu of json configs.

DavidLandup0 · 2025-03-27T12:59:13Z

keras_hub/src/models/deepseek_r1/deepseek_backbone.py

+        return logits
+
+
+if __name__ == "__main__":


Sanity check main call - will be removed.

DavidLandup0 · 2025-03-27T12:59:29Z

keras_hub/src/models/deepseek_r1/deepseek_layers.py

+rank = 0
+
+
+class Embedding(layers.Layer):


TODO: Remove custom class and just use layers.Embedding.

DavidLandup0 · 2025-03-27T13:00:15Z

keras_hub/src/models/deepseek_r1/deepseek_layers.py

+        return linear(x, self.weight, self.bias)
+
+
+class ColumnParallelLinear(Linear):


Do we still need the custom XParallel classes if we don't use torch.dist which boils most of them back to the standard implementations?

DavidLandup0 added 3 commits March 1, 2025 17:11

Initial commit

e275e05

full port

e26a0c0

Add more complete weight conversion script

f0f5c55

DavidLandup0 commented Mar 27, 2025

View reviewed changes

divyashreepathihalli self-requested a review March 28, 2025 03:57

Provide feedback

		return linear(x, self.weight, self.bias)


		class ColumnParallelLinear(Linear):