|
78 | 78 | # Embeddings are trained in RecSys through the following process: |
79 | 79 | # |
80 | 80 | # * **Input/lookup indices are fed into the model, as unique IDs**. IDs are |
81 | | -# hashed to the total size of the embedding table to prevent issues when |
82 | | -# the ID > number of rows |
| 81 | +# hashed to the total size of the embedding table to prevent issues when |
| 82 | +# the ID > number of rows |
83 | 83 | # |
84 | 84 | # * Embeddings are then retrieved and **pooled, such as taking the sum or |
85 | 85 | # mean of the embeddings**. This is required as there can be a variable number of |
|
220 | 220 | # ------------------------------ |
221 | 221 | # |
222 | 222 | # This section goes over TorchRec Modules and data types including such |
223 | | -# entities as ``EmbeddingCollection``and ``EmbeddingBagCollection``, |
| 223 | +# entities as ``EmbeddingCollection`` and ``EmbeddingBagCollection``, |
224 | 224 | # ``JaggedTensor``, ``KeyedJaggedTensor``, ``KeyedTensor`` and more. |
225 | 225 | # |
226 | 226 | # From ``EmbeddingBag`` to ``EmbeddingBagCollection`` |
@@ -918,17 +918,18 @@ def _wait_impl(self) -> torch.Tensor: |
918 | 918 | # very sensitive to **performance and size of the model**. Running just |
919 | 919 | # the trained model in a Python environment is incredibly inefficient. |
920 | 920 | # There are two key differences between inference and training |
921 | | -# environments: \* **Quantization**: Inference models are typically |
922 | | -# quantized, where model parameters lose precision for lower latency in |
923 | | -# predictions and reduced model size. For example FP32 (4 bytes) in |
924 | | -# trained model to INT8 (1 byte) for each embedding weight. This is also |
925 | | -# necessary given the vast scale of embedding tables, as we want to use as |
926 | | -# few devices as possible for inference to minimize latency. |
| 921 | +# environments: |
| 922 | +# * **Quantization**: Inference models are typically |
| 923 | +# quantized, where model parameters lose precision for lower latency in |
| 924 | +# predictions and reduced model size. For example FP32 (4 bytes) in |
| 925 | +# trained model to INT8 (1 byte) for each embedding weight. This is also |
| 926 | +# necessary given the vast scale of embedding tables, as we want to use as |
| 927 | +# few devices as possible for inference to minimize latency. |
927 | 928 | # |
928 | 929 | # * **C++ environment**: Inference latency is very important, so in order to ensure |
929 | | -# ample performance, the model is typically ran in a C++ environment, |
930 | | -# along with the situations where we don't have a Python runtime, like on |
931 | | -# device. |
| 930 | +# ample performance, the model is typically ran in a C++ environment, |
| 931 | +# along with the situations where we don't have a Python runtime, like on |
| 932 | +# device. |
932 | 933 | # |
933 | 934 | # TorchRec provides primitives for converting a TorchRec model into being |
934 | 935 | # inference ready with: |
|
0 commit comments