July 18, 2024
1 Solar System Way, Planet Earth, USA

NVIDIA Launches Open Synthetic Data Generation Pipeline for Training Large Language Models

NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data to train large language models (LLM) for commercial applications in healthcare, finance, manufacturing, retail, and any other industry. .

High-quality training data plays a critical role in the performance, accuracy, and response quality of a custom LLM, but robust data sets can be prohibitively expensive and difficult to access.

Through a single permit open model licenseNemotron-4 340B offers developers a free, scalable way to generate synthetic data that can help create powerful LLMs.

The Nemotron-4 340B family includes base, instruction, and reward models that form a pipeline to generate synthetic data used to train and refine LLM. The models are optimized to work with NVIDIA NeMo, an open source framework for end-to-end model training, including data curation, personalization, and evaluation. They are also optimized for inference with open source. NVIDIA TensorRT-LLM library.

Nemotron-4 340B can now be downloaded from hugging face. Developers will soon be able to access models on ai.nvidia.comwhere they will be packaged as a NVIDIA NIM Microservice with a standard application programming interface that can be deployed anywhere.

Navigating Nemotron to generate synthetic data

LLMs can help developers generate synthetic training data in scenarios where access to large and diverse labeled data sets is limited.

He Instruction Nemotron-4 340B The model creates diverse synthetic data that mimics the characteristics of real-world data, helping to improve data quality to increase the performance and robustness of personalized LLMs across multiple domains.

Then, to improve the quality of data generated by AI, developers can use the Nemotron-4 340B Reward model to filter high-quality responses. Nemotron-4 340B Reward scores responses on five attributes: usefulness, correctness, coherence, complexity, and verbosity. It currently ranks first in the Hugging Face RewardBench Leaderboardcreated by AI2to evaluate the capabilities, safety and dangers of reward models.

nemotron synthetic data generation pipeline diagram
In this synthetic data generation process, (1) the Nemotron-4 340B Instruct model is used for the first time to produce text-based synthetic results. An evaluator model, (2) Nemotron-4 340B Reward, then evaluates this generated text, providing feedback that guides iterative improvements and ensures that the synthetic data is accurate, relevant, and aligned with specific requirements.

Researchers can also create their own instruction or reward models by customizing the Nemotron-4 Base 340B model using its proprietary data, combined with the included HelpSteer2 Dataset.

Fine-tuning with NeMo, optimization for inference with TensorRT-LLM

Using open source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruction and reward models to generate synthetic data and score responses.

All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.

The Nemotron-4 340B base, trained with 9 billion tokens, can be customized using the NeMo framework to fit specific use cases or domains. This tuning process takes advantage of a large amount of pre-training data and produces more accurate results for specific downstream tasks.

There are a variety of customization methods available through the NeMo framework, including supervised fine-tuning methods and parameter-efficient tuning methods such as low-range adaptation or LoRA.

To improve model quality, developers can align their models with NeMo Aligner and data sets annotated by Nemotron-4 340B Reward. Alignment is a key step in LLM training, where a model's behavior is fine-tuned using algorithms such as reinforcement learning from human feedback (RLHF) to ensure its results are safe, accurate, contextually appropriate and consistent. with the planned objectives.

Companies looking for enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM via cloud native. NVIDIA AI Company software platform, which provides accelerated and efficient execution times for basic generative AI models.

Model security assessment and first steps

The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adverse testing, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model results to ensure that the synthetically generated data is suitable, safe, and accurate for their use case.

For more information about model safety and protection evaluation, please read the model card.

Download Nemotron-4 340B models via hugging face. For more details, read the research work on the model and data set.

See warning regarding software product information.

    Leave feedback about this

    • Quality
    • Price
    • Service


    Add Field


    Add Field
    Choose Image
    Choose Video