In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks. In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, performance after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws". These are usually written as (number of parameters, dataset size, computing cost, loss). In most cases, the size of the model is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. In sparse models, during every inference, only a fraction of the parameters are used. In comparison, most other kinds of neural networks, such as Transformer networks, always use all their parameters during every inference. The size of the training dataset is usually quantified by the number of data points it contains. Larger training datasets are typically preferred as they provide a richer and more diverse source of information for the model to learn from. This in turn can lead to improved generalization performance when the model is applied to unseen data. However, increasing the size of the training dataset also increases the computational resources and time required for model training. With the "pretrain, then finetune" method used in most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes would have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. In some cases, a small amount of high quality data suffices for finetuning, and more data does not improve performance. The cost of training is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required to train the model).