默认是 float64
>>> import numpy as np
>>> a = np.random.randn(3)
>>> a.dtype
dtype('float64')
指定 float32
>>> b = np.random.randn(3).astype(np.float32)
>>> b.dtype
dtype('float32')
将权重数据用 16 位精度保存时,只需要 32 位时的一半容量。 因此,仅在保存学习好的权重时,将其变换为 16 位浮点数。
Google TPU support 8 bit computation。
先安装 CuDA:https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html
然后验证安装, 得知 Cuda 版本为 11.1
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:54:10_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.relgpu_drvr455TC455_06.29190527_0
接下来,安装对应版本的 cupy,注意版本必须对应:https://docs.cupy.dev/en/latest/install.html
pip install cupy-cuda111
测试 cupy
$ python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cupy as cp
>>> x = cp.arange(6).reshape(2, 3).astype('f')
>>> x
array([[0., 1., 2.],
[3., 4., 5.]], dtype=float32)
