Critical Vulnerability in NVIDIA Merlin Transformers4Rec Library Enables Remote Code Execution with Root Privileges

A significant security flaw has been identified in NVIDIA’s Merlin Transformers4Rec library, designated as CVE-2025-23298. This vulnerability allows unauthenticated attackers to execute arbitrary code remotely with root-level privileges by exploiting unsafe deserialization processes within the model checkpoint loader. This discovery highlights the ongoing security challenges associated with machine learning (ML) and artificial intelligence (AI) frameworks that utilize Python’s pickle serialization.

Understanding the Vulnerability

The core of this vulnerability lies in the `load_model_trainer_states_from_checkpoint` function within the Merlin Transformers4Rec library. This function employs PyTorch’s `torch.load()` method without implementing safety parameters. Underneath, `torch.load()` utilizes Python’s pickle module, which permits the deserialization of arbitrary objects. This mechanism becomes a vector for attackers to embed malicious code within crafted checkpoint files. When these files are loaded, the embedded code is executed, potentially leading to full system compromise.

In the vulnerable implementation, the `cloudpickle` library is used to load the model class directly:

“`python
model_class = cloudpickle.load(f)
“`

This approach grants attackers complete control over the deserialization process. By defining a custom `__reduce__` method, a malicious checkpoint can execute arbitrary system commands upon loading. For instance, an attacker could use `os.system()` to download and execute a remote script, thereby gaining control over the system.

Potential Impact and Attack Surface

The implications of this vulnerability are extensive. ML practitioners frequently share pre-trained model checkpoints through public repositories or cloud storage platforms. In many cases, production ML pipelines operate with elevated privileges. Consequently, a successful exploit not only compromises the host system but can also escalate to root-level access, providing attackers with full control over the affected environment.

To illustrate the severity, researchers have demonstrated that loading a maliciously crafted checkpoint via the vulnerable function triggers the embedded shell command before any model weight restoration occurs. This results in immediate remote code execution under the context of the ML service, potentially leading to data breaches, system disruptions, and unauthorized access to sensitive information.

NVIDIA’s Response and Mitigation Measures

In response to this critical vulnerability, NVIDIA has implemented a patch in Pull Request #802. The update replaces the raw pickle calls with a custom `load()` function that whitelists approved classes, thereby enhancing input validation and mitigating the risk of arbitrary code execution. The patched loader in `serialization.py` enforces stricter input validation protocols. Additionally, developers are encouraged to use the `weights_only=True` parameter in `torch.load()` to prevent the deserialization of untrusted objects.

Best Practices for Developers and Organizations

To safeguard against such vulnerabilities, developers and organizations should adhere to the following best practices:

1. Avoid Using Pickle with Untrusted Data: The pickle module should never be used to deserialize untrusted data due to its inherent security risks. Instead, restrict deserialization to known, safe classes.

2. Adopt Safer Serialization Formats: Consider using alternative serialization formats such as Safetensors or ONNX, which offer safer model persistence options and reduce the risk of code execution vulnerabilities.

3. Implement Cryptographic Signing: Enforce cryptographic signing of model files to ensure their integrity and authenticity. This practice helps prevent tampering and unauthorized modifications.

4. Sandbox Deserialization Processes: Isolate deserialization processes within secure environments to limit the potential impact of any malicious code execution.

5. Conduct Regular Security Audits: Include ML pipelines in routine security audits to identify and address potential vulnerabilities proactively.

Technical Details and Risk Assessment

– Affected Products: NVIDIA Merlin Transformers4Rec versions up to and including v1.5.0.

– Impact: Remote code execution with root privileges.

– Exploit Prerequisites: Loading an attacker-supplied model checkpoint via `torch.load()`.

– CVSS 3.1 Score: 9.8 (Critical)

Broader Implications and Community Recommendations

This vulnerability underscores the need for the ML and AI community to prioritize security-first design principles. The reliance on pickle-based mechanisms poses significant risks, and until such dependencies are eliminated, similar vulnerabilities are likely to persist. Vigilance, robust input validation, and a zero-trust approach are essential to protect production ML systems from supply-chain attacks and remote code execution threats.

Conclusion

The discovery of CVE-2025-23298 in NVIDIA’s Merlin Transformers4Rec library serves as a critical reminder of the security challenges inherent in ML and AI frameworks. By implementing the recommended mitigation measures and adhering to best practices, developers and organizations can enhance the security posture of their ML pipelines and safeguard against potential exploits.