[2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-02-21 20:44:51,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. INFO 02-21 20:45:11 __init__.py:190] Automatically detected platform cuda. [2025-02-21 20:45:26,419] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,420] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,420] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,421] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,421] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-02-21 20:45:26,421] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,423] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:26,424] [INFO] [comm.py:652:init_distributed] cdb=None [2025-02-21 20:45:27,878] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-21 20:45:27,878] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-21 20:45:27,878] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-02-21 20:45:27,982] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 [2025-02-21 20:45:27,983] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` [2025-02-21 20:45:28,131] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-104:984037:984037 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:984037 [0] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984037:984037 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 p-phy-ctyun-gz-a800-node-prod-200-104:984042:984042 [5] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984038:984038 [1] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984046:984046 [6] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984040:984040 [3] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984041:984041 [4] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984042:984042 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:984038 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984046:984046 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984041:984041 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984040:984040 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:984042 [5] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984041:984041 [4] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984040:984040 [3] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984046:984046 [6] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984038:984038 [1] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Using network IBext_v8 [2025-02-21 20:45:34,371] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 7 You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` p-phy-ctyun-gz-a800-node-prod-200-104:984039:984039 [2] NCCL INFO cudaDriverVersion 12040 p-phy-ctyun-gz-a800-node-prod-200-104:984039:984039 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984039:984039 [2] NCCL INFO Bootstrap : Using bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO P2P plugin IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [RO]; OOB bond0:10.9.200.104<0> p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Using non-device net plugin version 0 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Using network IBext_v8 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO ncclCommInitRank comm 0x55c5fe29c1b0 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO ncclCommInitRank comm 0x55c9fcafa6e0 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO ncclCommInitRank comm 0x55592163e1b0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO ncclCommInitRank comm 0x5654f360c840 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO ncclCommInitRank comm 0x55ec422042d0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO ncclCommInitRank comm 0x55797c1b7150 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO ncclCommInitRank comm 0x55e5f0b70590 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0x9cfcb0e1e75eb1 - Init START p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO NVLS multicast support is not available on dev 3 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO NVLS multicast support is not available on dev 5 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO NVLS multicast support is not available on dev 4 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO NVLS multicast support is not available on dev 0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO NVLS multicast support is not available on dev 1 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO NVLS multicast support is not available on dev 6 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO NVLS multicast support is not available on dev 2 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO comm 0x55e5f0b70590 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO comm 0x55797c1b7150 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO comm 0x55592163e1b0 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO comm 0x55ec422042d0 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO comm 0x55c5fe29c1b0 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO comm 0x55c9fcafa6e0 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO comm 0x5654f360c840 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984040:992616 [3] NCCL INFO ncclCommInitRank comm 0x5654f360c840 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984037:992546 [0] NCCL INFO ncclCommInitRank comm 0x55797c1b7150 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984046:992613 [6] NCCL INFO ncclCommInitRank comm 0x55e5f0b70590 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984042:992614 [5] NCCL INFO ncclCommInitRank comm 0x55ec422042d0 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984039:992636 [2] NCCL INFO ncclCommInitRank comm 0x55c9fcafa6e0 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984041:992617 [4] NCCL INFO ncclCommInitRank comm 0x55c5fe29c1b0 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. p-phy-ctyun-gz-a800-node-prod-200-104:984038:992615 [1] NCCL INFO ncclCommInitRank comm 0x55592163e1b0 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 commId 0x9cfcb0e1e75eb1 - Init COMPLETE [2025-02-21 20:45:38,305] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 730, num_elems = 8.29B Loading checkpoint shards: 0%| | 0/5 [00:00 [2025-02-21 20:46:27,611] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] dump_state ................... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-02-21 20:46:27,612] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 2 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-02-21 20:46:27,613] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] optimizer_name ............... None [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] optimizer_params ............. None [2025-02-21 20:46:27,614] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-02-21 20:46:27,615] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-02-21 20:46:27,615] [INFO] [config.py:1003:print] pld_params ................... False [2025-02-21 20:46:27,615] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-02-21 20:46:27,615] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-02-21 20:46:27,615] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] train_batch_size ............. 14 [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1 [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False [2025-02-21 20:46:27,618] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] world_size ................... 7 [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-02-21 20:46:27,619] [INFO] [config.py:1003:print] zero_optimization_stage ...... 3 [2025-02-21 20:46:27,619] [INFO] [config.py:989:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "none", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 14, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "zero_optimization.reduce_bucket_size": 1.284506e+07, "zero_optimization.stage3_param_persistence_threshold": 3.584000e+04, "zero_optimization.stage3_prefetch_bucket_size": 1.156055e+07 } INFO 02-21 20:47:26 config.py:542] This model supports multiple tasks: {'generate', 'score', 'classify', 'embed', 'reward'}. Defaulting to 'generate'. WARNING 02-21 20:47:26 arg_utils.py:1079] --enable-prefix-caching is currently not supported for multimodal models in v0 and has been disabled. INFO 02-21 20:47:26 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda:7, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/vlm/pretrain_model/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, INFO 02-21 20:47:29 cuda.py:230] Using Flash Attention backend. INFO 02-21 20:47:30 model_runner.py:1110] Starting to load model /home/vlm/pretrain_model/Qwen2-VL-7B-Instruct... INFO 02-21 20:47:31 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00 32768). Running this sequence through the model will result in indexing errors WARNING 02-21 20:47:58 profiling.py:187] The context length (32768) of the model is too short to hold the multi-modal embeddings in the worst case (49152 tokens in total, out of which {'image': 16384, 'video': 32768} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`. INFO 02-21 20:48:03 worker.py:267] Memory profiling takes 15.85 seconds INFO 02-21 20:48:03 worker.py:267] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.70) = 55.53GiB INFO 02-21 20:48:03 worker.py:267] model weights take 0.00GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 0.00GiB; the rest of the memory reserved for KV Cache is 55.53GiB. INFO 02-21 20:48:03 executor_base.py:110] # CUDA blocks: 64982, # CPU blocks: 4681 INFO 02-21 20:48:03 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 31.73x INFO 02-21 20:48:06 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 0%| | 0/35 [00:005->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO comm 0x7f8f7006f360 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO comm 0x7f56d8070110 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO comm 0x7efb58070490 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO comm 0x7fc91406f7a0 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0 p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO P2P Chunksize set to 524288 p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Connected all rings p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/IPC/read p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO Connected all trees p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512 p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO 16 coll channels, 16 collnet channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer p-phy-ctyun-gz-a800-node-prod-200-104:984040:1005856 [3] NCCL INFO ncclCommSplit comm 0x7f8f7006f360 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 59000 parent 0x5654f360c840 color -1326228412 key 3 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984038:1005853 [1] NCCL INFO ncclCommSplit comm 0x7efb58070490 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2d000 parent 0x55592163e1b0 color -1326228412 key 1 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984046:1005854 [6] NCCL INFO ncclCommSplit comm 0x7fc1f40707d0 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId bf000 parent 0x55e5f0b70590 color -1326228412 key 6 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984041:1005851 [4] NCCL INFO ncclCommSplit comm 0x7f2ce4070610 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 8d000 parent 0x55c5fe29c1b0 color -1326228412 key 4 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984037:1005855 [0] NCCL INFO ncclCommSplit comm 0x7fc91406f7a0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 27000 parent 0x55797c1b7150 color -1326228412 key 0 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984039:1005852 [2] NCCL INFO ncclCommSplit comm 0x7f56d8070110 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 54000 parent 0x55c9fcafa6e0 color -1326228412 key 2 commId 0x32885bee1eab349 - Init COMPLETE p-phy-ctyun-gz-a800-node-prod-200-104:984042:1005857 [5] NCCL INFO ncclCommSplit comm 0x7f9c0006f480 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 92000 parent 0x55ec422042d0 color -1326228412 key 5 commId 0x32885bee1eab349 - Init COMPLETE 0%| | 1/1610 [00:33<15:10:06, 33.94s/it] {'loss': -0.0, 'grad_norm': 1.9835944476886183, 'learning_rate': 9.993788819875776e-07, 'completion_length': 177.9464340209961, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.7500000596046448, 'reward': 1.1428571939468384, 'reward_std': 0.5743394494056702, 'kl': 0.0, 'epoch': 0.0} 0%| | 1/1610 [00:33<15:10:06, 33.94s/it] 0%| | 2/1610 [00:56<12:08:51, 27.20s/it] {'loss': 0.0, 'grad_norm': 2.037886881407789, 'learning_rate': 9.987577639751552e-07, 'completion_length': 179.98214721679688, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 0.7500000596046448, 'reward': 1.0714285969734192, 'reward_std': 0.5914213359355927, 'kl': 0.00018835067749023438, 'epoch': 0.01} 0%| | 2/1610 [00:56<12:08:51, 27.20s/it] 0%| | 3/1610 [01:15<10:35:08, 23.71s/it] {'loss': 0.0, 'grad_norm': 2.141410730750218, 'learning_rate': 9.981366459627329e-07, 'completion_length': 161.75000381469727, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.2500000596046448, 'reward_std': 0.46981075406074524, 'kl': 0.000659942626953125, 'epoch': 0.01} 0%| | 3/1610 [01:16<10:35:08, 23.71s/it] 0%| | 4/1610 [07:40<74:06:16, 166.11s/it] {'loss': 0.0, 'grad_norm': 4.439634883322468, 'learning_rate': 9.975155279503105e-07, 'completion_length': 109.9464340209961, 'rewards/accuracy_reward': 0.2321428656578064, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.160714328289032, 'reward_std': 0.3193712383508682, 'kl': 0.0010051727294921875, 'epoch': 0.01} 0%| | 4/1610 [07:40<74:06:16, 166.11s/it] 0%| | 5/1610 [07:58<50:12:12, 112.61s/it] {'loss': 0.0, 'grad_norm': 1.8915437656233771, 'learning_rate': 9.968944099378881e-07, 'completion_length': 136.55357360839844, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3750000596046448, 'reward_std': 0.3606105446815491, 'kl': 0.0009822845458984375, 'epoch': 0.02} 0%| | 5/1610 [07:58<50:12:12, 112.61s/it] 0%| | 6/1610 [08:16<35:58:02, 80.72s/it] {'loss': 0.0001, 'grad_norm': 1.789714712182082, 'learning_rate': 9.962732919254658e-07, 'completion_length': 152.73214721679688, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.29123930633068085, 'kl': 0.002899169921875, 'epoch': 0.02} 0%| | 6/1610 [08:16<35:58:02, 80.72s/it] 0%| | 7/1610 [08:36<27:05:33, 60.84s/it] {'loss': 0.0001, 'grad_norm': 2.0398610870259617, 'learning_rate': 9.956521739130434e-07, 'completion_length': 169.4464340209961, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.1964285969734192, 'reward_std': 0.30228936672210693, 'kl': 0.0035552978515625, 'epoch': 0.02} 0%| | 7/1610 [08:36<27:05:33, 60.84s/it] 0%| | 8/1610 [08:54<20:58:09, 47.12s/it] {'loss': 0.0001, 'grad_norm': 2.7672505199764443, 'learning_rate': 9.95031055900621e-07, 'completion_length': 134.12500762939453, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.910714328289032, 'reward': 1.321428656578064, 'reward_std': 0.4119163155555725, 'kl': 0.00308990478515625, 'epoch': 0.02} 0%| | 8/1610 [08:54<20:58:09, 47.12s/it] 1%| | 9/1610 [09:08<16:17:16, 36.62s/it] {'loss': 0.0003, 'grad_norm': 2.558930599074516, 'learning_rate': 9.944099378881986e-07, 'completion_length': 99.39286422729492, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3392857909202576, 'reward_std': 0.2610500454902649, 'kl': 0.0063629150390625, 'epoch': 0.03} 1%| | 9/1610 [09:08<16:17:16, 36.62s/it] 1%| | 10/1610 [09:26<13:45:27, 30.95s/it] {'loss': 0.0003, 'grad_norm': 1.93016537421735, 'learning_rate': 9.937888198757763e-07, 'completion_length': 150.0357208251953, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.30228936672210693, 'kl': 0.0065460205078125, 'epoch': 0.03} 1%| | 10/1610 [09:26<13:45:27, 30.95s/it] 1%| | 11/1610 [09:47<12:21:00, 27.81s/it] {'loss': 0.0002, 'grad_norm': 1.4878196295514643, 'learning_rate': 9.93167701863354e-07, 'completion_length': 154.17858123779297, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3214285969734192, 'reward_std': 0.24241764843463898, 'kl': 0.006134033203125, 'epoch': 0.03} 1%| | 11/1610 [09:47<12:21:00, 27.81s/it] 1%| | 12/1610 [10:04<10:52:39, 24.51s/it] {'loss': 0.0004, 'grad_norm': 1.888604617942588, 'learning_rate': 9.925465838509315e-07, 'completion_length': 131.9464340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.2937234044075012, 'kl': 0.00982666015625, 'epoch': 0.04} 1%| | 12/1610 [10:04<10:52:39, 24.51s/it] 1%| | 13/1610 [10:19<9:40:51, 21.82s/it] {'loss': 0.0004, 'grad_norm': 2.5085809567658104, 'learning_rate': 9.919254658385092e-07, 'completion_length': 119.66071701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071798324585, 'kl': 0.011138916015625, 'epoch': 0.04} 1%| | 13/1610 [10:19<9:40:51, 21.82s/it] 1%| | 14/1610 [10:35<8:52:43, 20.03s/it] {'loss': 0.0005, 'grad_norm': 9.334173874775587, 'learning_rate': 9.91304347826087e-07, 'completion_length': 136.5714340209961, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.1785714365541935, 'kl': 0.0118408203125, 'epoch': 0.04} 1%| | 14/1610 [10:35<8:52:43, 20.03s/it] 1%| | 15/1610 [10:50<8:11:19, 18.48s/it] {'loss': 0.0008, 'grad_norm': 1.405300875678505, 'learning_rate': 9.906832298136647e-07, 'completion_length': 109.10715103149414, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.1539071872830391, 'kl': 0.0191650390625, 'epoch': 0.05} 1%| | 15/1610 [10:50<8:11:19, 18.48s/it] 1%| | 16/1610 [11:03<7:24:44, 16.74s/it] {'loss': 0.0007, 'grad_norm': 2.5468552911455404, 'learning_rate': 9.900621118012423e-07, 'completion_length': 105.8214340209961, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.26657506078481674, 'kl': 0.016571044921875, 'epoch': 0.05} 1%| | 16/1610 [11:03<7:24:44, 16.74s/it] 1%| | 17/1610 [11:22<7:42:00, 17.40s/it] {'loss': 0.0008, 'grad_norm': 6.135021064883836, 'learning_rate': 9.8944099378882e-07, 'completion_length': 125.92857360839844, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.285714328289032, 'reward_std': 0.3078143745660782, 'kl': 0.02001953125, 'epoch': 0.05} 1%| | 17/1610 [11:22<7:42:00, 17.40s/it] 1%| | 18/1610 [11:37<7:29:11, 16.93s/it] {'loss': 0.001, 'grad_norm': 1.7582308621222174, 'learning_rate': 9.888198757763976e-07, 'completion_length': 125.5535774230957, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.26657506078481674, 'kl': 0.02374267578125, 'epoch': 0.06} 1%| | 18/1610 [11:37<7:29:11, 16.93s/it] 1%| | 19/1610 [11:57<7:51:15, 17.77s/it] {'loss': 0.0008, 'grad_norm': 2.4926236288077033, 'learning_rate': 9.881987577639752e-07, 'completion_length': 132.4107208251953, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.32695358991622925, 'kl': 0.02008056640625, 'epoch': 0.06} 1%| | 19/1610 [11:57<7:51:15, 17.77s/it] 1%| | 20/1610 [12:12<7:24:48, 16.79s/it] {'loss': 0.0007, 'grad_norm': 1.3503339630780713, 'learning_rate': 9.875776397515528e-07, 'completion_length': 138.50000762939453, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.26657505333423615, 'kl': 0.018096923828125, 'epoch': 0.06} 1%| | 20/1610 [12:12<7:24:48, 16.79s/it] 1%|▏ | 21/1610 [12:32<7:52:21, 17.84s/it] {'loss': 0.0009, 'grad_norm': 2.2201675887354715, 'learning_rate': 9.869565217391304e-07, 'completion_length': 152.08929061889648, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3392857313156128, 'reward_std': 0.3762020468711853, 'kl': 0.0213623046875, 'epoch': 0.07} 1%|▏ | 21/1610 [12:32<7:52:21, 17.84s/it] 1%|▏ | 22/1610 [12:53<8:15:13, 18.71s/it] {'loss': 0.0007, 'grad_norm': 1.7267337133305747, 'learning_rate': 9.86335403726708e-07, 'completion_length': 139.1785774230957, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3214285969734192, 'reward_std': 0.20117834210395813, 'kl': 0.01708984375, 'epoch': 0.07} 1%|▏ | 22/1610 [12:53<8:15:13, 18.71s/it] 1%|▏ | 23/1610 [13:08<7:48:34, 17.72s/it] {'loss': 0.001, 'grad_norm': 2.2770908405873826, 'learning_rate': 9.857142857142857e-07, 'completion_length': 112.71429061889648, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.25552502274513245, 'kl': 0.02410888671875, 'epoch': 0.07} 1%|▏ | 23/1610 [13:08<7:48:34, 17.72s/it] 1%|▏ | 24/1610 [13:25<7:40:08, 17.41s/it] {'loss': 0.0007, 'grad_norm': 7.403333043042662, 'learning_rate': 9.850931677018633e-07, 'completion_length': 179.80358123779297, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.3601037263870239, 'kl': 0.01641845703125, 'epoch': 0.07} 1%|▏ | 24/1610 [13:25<7:40:08, 17.41s/it] 2%|▏ | 25/1610 [13:41<7:32:05, 17.11s/it] {'loss': 0.0008, 'grad_norm': 1.8354109952986744, 'learning_rate': 9.84472049689441e-07, 'completion_length': 126.75000762939453, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.3822338879108429, 'kl': 0.0196533203125, 'epoch': 0.08} 2%|▏ | 25/1610 [13:41<7:32:05, 17.11s/it] 2%|▏ | 26/1610 [13:57<7:23:40, 16.81s/it] {'loss': 0.0007, 'grad_norm': 1.5435875630008662, 'learning_rate': 9.838509316770186e-07, 'completion_length': 116.89286422729492, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.1539071798324585, 'kl': 0.018096923828125, 'epoch': 0.08} 2%|▏ | 26/1610 [13:57<7:23:40, 16.81s/it] 2%|▏ | 27/1610 [14:17<7:48:24, 17.75s/it] {'loss': 0.001, 'grad_norm': 1.4112941025219816, 'learning_rate': 9.832298136645962e-07, 'completion_length': 120.35715103149414, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.3324786126613617, 'kl': 0.02581787109375, 'epoch': 0.08} 2%|▏ | 27/1610 [14:17<7:48:24, 17.75s/it] 2%|▏ | 28/1610 [14:33<7:29:52, 17.06s/it] {'loss': 0.0009, 'grad_norm': 3.2979781550639165, 'learning_rate': 9.826086956521739e-07, 'completion_length': 100.35714721679688, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.02288818359375, 'epoch': 0.09} 2%|▏ | 28/1610 [14:33<7:29:52, 17.06s/it] 2%|▏ | 29/1610 [14:48<7:16:12, 16.55s/it] {'loss': 0.0009, 'grad_norm': 1.959123948179761, 'learning_rate': 9.819875776397515e-07, 'completion_length': 105.96429061889648, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.18409645557403564, 'kl': 0.02178955078125, 'epoch': 0.09} 2%|▏ | 29/1610 [14:48<7:16:12, 16.55s/it] 2%|▏ | 30/1610 [15:06<7:28:52, 17.05s/it] {'loss': 0.001, 'grad_norm': 2.3642777521040554, 'learning_rate': 9.813664596273291e-07, 'completion_length': 124.39286041259766, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.24794264882802963, 'kl': 0.0242919921875, 'epoch': 0.09} 2%|▏ | 30/1610 [15:06<7:28:52, 17.05s/it] 2%|▏ | 31/1610 [15:20<6:59:00, 15.92s/it] {'loss': 0.0008, 'grad_norm': 4.164741966268985, 'learning_rate': 9.807453416149068e-07, 'completion_length': 98.1964340209961, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.32695358991622925, 'kl': 0.01971435546875, 'epoch': 0.1} 2%|▏ | 31/1610 [15:20<6:59:00, 15.92s/it] 2%|▏ | 32/1610 [15:34<6:49:53, 15.59s/it] {'loss': 0.0008, 'grad_norm': 2.0823084757754824, 'learning_rate': 9.801242236024844e-07, 'completion_length': 114.12500381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.33800364285707474, 'kl': 0.0208740234375, 'epoch': 0.1} 2%|▏ | 32/1610 [15:34<6:49:53, 15.59s/it] 2%|▏ | 33/1610 [15:54<7:23:18, 16.87s/it] {'loss': 0.0008, 'grad_norm': 1.8283944103426535, 'learning_rate': 9.79503105590062e-07, 'completion_length': 127.91072082519531, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571428656578064, 'reward_std': 0.31633032858371735, 'kl': 0.0191650390625, 'epoch': 0.1} 2%|▏ | 33/1610 [15:54<7:23:18, 16.87s/it] 2%|▏ | 34/1610 [16:10<7:15:20, 16.57s/it] {'loss': 0.0007, 'grad_norm': 2.257990032928918, 'learning_rate': 9.788819875776397e-07, 'completion_length': 113.44643020629883, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.2142857238650322, 'kl': 0.018310546875, 'epoch': 0.11} 2%|▏ | 34/1610 [16:10<7:15:20, 16.57s/it] 2%|▏ | 35/1610 [16:26<7:06:49, 16.26s/it] {'loss': 0.0008, 'grad_norm': 1.7492510734277473, 'learning_rate': 9.782608695652173e-07, 'completion_length': 126.91071701049805, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.23086076974868774, 'kl': 0.0205078125, 'epoch': 0.11} 2%|▏ | 35/1610 [16:26<7:06:49, 16.26s/it] 2%|▏ | 36/1610 [16:39<6:41:19, 15.30s/it] {'loss': 0.0008, 'grad_norm': 2.406751753932597, 'learning_rate': 9.77639751552795e-07, 'completion_length': 128.73214721679688, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.21981073170900345, 'kl': 0.0211181640625, 'epoch': 0.11} 2%|▏ | 36/1610 [16:39<6:41:19, 15.30s/it] 2%|▏ | 37/1610 [16:56<7:00:20, 16.03s/it] {'loss': 0.0008, 'grad_norm': 2.5559111411940023, 'learning_rate': 9.770186335403726e-07, 'completion_length': 120.98215103149414, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.313846230506897, 'kl': 0.02093505859375, 'epoch': 0.11} 2%|▏ | 37/1610 [16:56<7:00:20, 16.03s/it] 2%|▏ | 38/1610 [17:11<6:45:40, 15.48s/it] {'loss': 0.0007, 'grad_norm': 2.5960815611389454, 'learning_rate': 9.763975155279502e-07, 'completion_length': 122.85715103149414, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.26657508313655853, 'kl': 0.0184326171875, 'epoch': 0.12} 2%|▏ | 38/1610 [17:11<6:45:40, 15.48s/it] 2%|▏ | 39/1610 [17:29<7:07:44, 16.34s/it] {'loss': 0.0011, 'grad_norm': 3.093219951875275, 'learning_rate': 9.757763975155278e-07, 'completion_length': 119.14286422729492, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3750000596046448, 'reward_std': 0.41242313385009766, 'kl': 0.02740478515625, 'epoch': 0.12} 2%|▏ | 39/1610 [17:29<7:07:44, 16.34s/it] 2%|▏ | 40/1610 [17:44<6:58:19, 15.99s/it] {'loss': 0.0008, 'grad_norm': 1.7545869525928355, 'learning_rate': 9.751552795031055e-07, 'completion_length': 98.5535774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2253357470035553, 'kl': 0.01953125, 'epoch': 0.12} 2%|▏ | 40/1610 [17:44<6:58:19, 15.99s/it] 3%|▎ | 41/1610 [18:00<7:00:36, 16.08s/it] {'loss': 0.001, 'grad_norm': 1.5725460540013483, 'learning_rate': 9.745341614906833e-07, 'completion_length': 119.08929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.2857142984867096, 'kl': 0.02508544921875, 'epoch': 0.13} 3%|▎ | 41/1610 [18:00<7:00:36, 16.08s/it] 3%|▎ | 42/1610 [18:17<7:01:07, 16.11s/it] {'loss': 0.001, 'grad_norm': 2.370513719641832, 'learning_rate': 9.73913043478261e-07, 'completion_length': 129.00000381469727, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.31333939731121063, 'kl': 0.0245361328125, 'epoch': 0.13} 3%|▎ | 42/1610 [18:17<7:01:07, 16.11s/it] 3%|▎ | 43/1610 [18:32<6:53:38, 15.84s/it] {'loss': 0.0008, 'grad_norm': 3.1721344979436434, 'learning_rate': 9.732919254658386e-07, 'completion_length': 103.37500381469727, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.379242941737175, 'kl': 0.01971435546875, 'epoch': 0.13} 3%|▎ | 43/1610 [18:32<6:53:38, 15.84s/it] 3%|▎ | 44/1610 [18:50<7:08:37, 16.42s/it] {'loss': 0.0009, 'grad_norm': 2.1237862377249628, 'learning_rate': 9.726708074534162e-07, 'completion_length': 141.30357360839844, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.410714328289032, 'reward_std': 0.2610500454902649, 'kl': 0.02374267578125, 'epoch': 0.14} 3%|▎ | 44/1610 [18:50<7:08:37, 16.42s/it] 3%|▎ | 45/1610 [19:04<6:49:35, 15.70s/it] {'loss': 0.0008, 'grad_norm': 3.6596105543242015, 'learning_rate': 9.720496894409938e-07, 'completion_length': 120.6785774230957, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.30228933691978455, 'kl': 0.02020263671875, 'epoch': 0.14} 3%|▎ | 45/1610 [19:04<6:49:35, 15.70s/it] 3%|▎ | 46/1610 [19:16<6:24:35, 14.75s/it] {'loss': 0.001, 'grad_norm': 1.8456061943422162, 'learning_rate': 9.714285714285715e-07, 'completion_length': 120.3214340209961, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857313156128, 'reward_std': 0.1896214708685875, 'kl': 0.02545166015625, 'epoch': 0.14} 3%|▎ | 46/1610 [19:16<6:24:35, 14.75s/it] 3%|▎ | 47/1610 [19:32<6:32:03, 15.05s/it] {'loss': 0.0009, 'grad_norm': 1.8558760066147566, 'learning_rate': 9.708074534161491e-07, 'completion_length': 119.91071701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.4119163304567337, 'kl': 0.021484375, 'epoch': 0.15} 3%|▎ | 47/1610 [19:32<6:32:03, 15.05s/it] 3%|▎ | 48/1610 [19:49<6:44:06, 15.52s/it] {'loss': 0.0008, 'grad_norm': 1.526516874354635, 'learning_rate': 9.701863354037265e-07, 'completion_length': 123.12500381469727, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.1896214634180069, 'kl': 0.01959228515625, 'epoch': 0.15} 3%|▎ | 48/1610 [19:49<6:44:06, 15.52s/it] 3%|▎ | 49/1610 [20:01<6:20:29, 14.62s/it] {'loss': 0.0013, 'grad_norm': 1.4303928938827741, 'learning_rate': 9.695652173913042e-07, 'completion_length': 110.62500381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.2253357544541359, 'kl': 0.03240966796875, 'epoch': 0.15} 3%|▎ | 49/1610 [20:01<6:20:29, 14.62s/it] 3%|▎ | 50/1610 [20:22<7:11:11, 16.58s/it] {'loss': 0.001, 'grad_norm': 3.5131914444511683, 'learning_rate': 9.68944099378882e-07, 'completion_length': 140.21429443359375, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2321429252624512, 'reward_std': 0.37371791899204254, 'kl': 0.02392578125, 'epoch': 0.16} 3%|▎ | 50/1610 [20:22<7:11:11, 16.58s/it] 3%|▎ | 51/1610 [20:37<6:56:49, 16.04s/it] {'loss': 0.001, 'grad_norm': 4.4613164761344954, 'learning_rate': 9.683229813664596e-07, 'completion_length': 130.23214721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000001192092896, 'reward_std': 0.3078143820166588, 'kl': 0.0247802734375, 'epoch': 0.16} 3%|▎ | 51/1610 [20:37<6:56:49, 16.04s/it] 3%|▎ | 52/1610 [20:51<6:37:12, 15.30s/it] {'loss': 0.0011, 'grad_norm': 2.3114126768645438, 'learning_rate': 9.677018633540373e-07, 'completion_length': 117.50000381469727, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.1428571492433548, 'kl': 0.0286865234375, 'epoch': 0.16} 3%|▎ | 52/1610 [20:51<6:37:12, 15.30s/it] 3%|▎ | 53/1610 [21:05<6:32:04, 15.11s/it] {'loss': 0.0011, 'grad_norm': 1.5934314364808058, 'learning_rate': 9.67080745341615e-07, 'completion_length': 129.58929061889648, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.1539071872830391, 'kl': 0.02789306640625, 'epoch': 0.16} 3%|▎ | 53/1610 [21:05<6:32:04, 15.11s/it] 3%|▎ | 54/1610 [21:19<6:18:26, 14.59s/it] {'loss': 0.0012, 'grad_norm': 2.074611234416378, 'learning_rate': 9.664596273291925e-07, 'completion_length': 126.9285774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.3324785977602005, 'kl': 0.02923583984375, 'epoch': 0.17} 3%|▎ | 54/1610 [21:19<6:18:26, 14.59s/it] 3%|▎ | 55/1610 [21:34<6:22:25, 14.76s/it] {'loss': 0.001, 'grad_norm': 1.5246415162114049, 'learning_rate': 9.658385093167702e-07, 'completion_length': 128.5535774230957, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.23689261823892593, 'kl': 0.02593994140625, 'epoch': 0.17} 3%|▎ | 55/1610 [21:34<6:22:25, 14.76s/it] 3%|▎ | 56/1610 [21:50<6:30:51, 15.09s/it] {'loss': 0.0013, 'grad_norm': 2.414805477143957, 'learning_rate': 9.652173913043478e-07, 'completion_length': 119.60715103149414, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.1785714402794838, 'kl': 0.03167724609375, 'epoch': 0.17} 3%|▎ | 56/1610 [21:50<6:30:51, 15.09s/it] 4%|▎ | 57/1610 [22:05<6:31:55, 15.14s/it] {'loss': 0.0012, 'grad_norm': 2.4810290860560205, 'learning_rate': 9.645962732919254e-07, 'completion_length': 123.12500381469727, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6071429252624512, 'reward_std': 0.28819840401411057, 'kl': 0.02874755859375, 'epoch': 0.18} 4%|▎ | 57/1610 [22:05<6:31:55, 15.14s/it] 4%|▎ | 58/1610 [22:21<6:40:27, 15.48s/it] {'loss': 0.0013, 'grad_norm': 8.048109489999325, 'learning_rate': 9.63975155279503e-07, 'completion_length': 121.85714721679688, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.3324786275625229, 'kl': 0.0330810546875, 'epoch': 0.18} 4%|▎ | 58/1610 [22:21<6:40:27, 15.48s/it] 4%|▎ | 59/1610 [22:33<6:09:17, 14.29s/it] {'loss': 0.0012, 'grad_norm': 1.7526841735592378, 'learning_rate': 9.633540372670807e-07, 'completion_length': 103.12500381469727, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25552502274513245, 'kl': 0.02972412109375, 'epoch': 0.18} 4%|▎ | 59/1610 [22:33<6:09:17, 14.29s/it] 4%|▎ | 60/1610 [22:45<5:55:05, 13.75s/it] {'loss': 0.0012, 'grad_norm': 1.9219430489150418, 'learning_rate': 9.627329192546583e-07, 'completion_length': 112.07143783569336, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.19514650478959084, 'kl': 0.02972412109375, 'epoch': 0.19} 4%|▎ | 60/1610 [22:45<5:55:05, 13.75s/it] 4%|▍ | 61/1610 [23:02<6:15:18, 14.54s/it] {'loss': 0.001, 'grad_norm': 1.4883427377888805, 'learning_rate': 9.62111801242236e-07, 'completion_length': 128.10714721679688, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5357143878936768, 'reward_std': 0.2690591663122177, 'kl': 0.0255126953125, 'epoch': 0.19} 4%|▍ | 61/1610 [23:02<6:15:18, 14.54s/it] 4%|▍ | 62/1610 [23:16<6:15:43, 14.56s/it] {'loss': 0.0015, 'grad_norm': 2.4942924145817607, 'learning_rate': 9.614906832298136e-07, 'completion_length': 129.14286041259766, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071798324585, 'kl': 0.03656005859375, 'epoch': 0.19} 4%|▍ | 62/1610 [23:16<6:15:43, 14.56s/it] 4%|▍ | 63/1610 [23:34<6:40:01, 15.51s/it] {'loss': 0.0011, 'grad_norm': 1.3081114175265038, 'learning_rate': 9.608695652173912e-07, 'completion_length': 131.6964340209961, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.24794266000390053, 'kl': 0.0274658203125, 'epoch': 0.2} 4%|▍ | 63/1610 [23:34<6:40:01, 15.51s/it] 4%|▍ | 64/1610 [23:50<6:47:26, 15.81s/it] {'loss': 0.0012, 'grad_norm': 2.1020300798855143, 'learning_rate': 9.602484472049689e-07, 'completion_length': 128.21429443359375, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2678571939468384, 'reward_std': 0.30228933691978455, 'kl': 0.02960205078125, 'epoch': 0.2} 4%|▍ | 64/1610 [23:50<6:47:26, 15.81s/it] 4%|▍ | 65/1610 [24:06<6:42:04, 15.61s/it] {'loss': 0.0014, 'grad_norm': 1.4478889968788813, 'learning_rate': 9.596273291925465e-07, 'completion_length': 123.30357360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1896214783191681, 'kl': 0.03399658203125, 'epoch': 0.2} 4%|▍ | 65/1610 [24:06<6:42:04, 15.61s/it] 4%|▍ | 66/1610 [24:25<7:12:42, 16.82s/it] {'loss': 0.001, 'grad_norm': 1.395042005939562, 'learning_rate': 9.590062111801241e-07, 'completion_length': 172.42858123779297, 'rewards/accuracy_reward': 0.30357144214212894, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2678571939468384, 'reward_std': 0.23689262568950653, 'kl': 0.02532958984375, 'epoch': 0.2} 4%|▍ | 66/1610 [24:25<7:12:42, 16.82s/it] 4%|▍ | 67/1610 [24:43<7:17:59, 17.03s/it] {'loss': 0.001, 'grad_norm': 1.7807286691185726, 'learning_rate': 9.583850931677018e-07, 'completion_length': 136.1607208251953, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.2857142984867096, 'kl': 0.0255126953125, 'epoch': 0.21} 4%|▍ | 67/1610 [24:43<7:17:59, 17.03s/it] 4%|▍ | 68/1610 [25:01<7:26:00, 17.35s/it] {'loss': 0.0009, 'grad_norm': 2.610218939883586, 'learning_rate': 9.577639751552796e-07, 'completion_length': 139.10714721679688, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.26353414356708527, 'kl': 0.02191162109375, 'epoch': 0.21} 4%|▍ | 68/1610 [25:01<7:26:00, 17.35s/it] 4%|▍ | 69/1610 [25:19<7:34:27, 17.69s/it] {'loss': 0.0009, 'grad_norm': 1.099019301274905, 'learning_rate': 9.571428571428572e-07, 'completion_length': 154.25000762939453, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.21981073915958405, 'kl': 0.02191162109375, 'epoch': 0.21} 4%|▍ | 69/1610 [25:19<7:34:27, 17.69s/it] 4%|▍ | 70/1610 [25:35<7:19:21, 17.12s/it] {'loss': 0.0013, 'grad_norm': 3.669585233730253, 'learning_rate': 9.565217391304349e-07, 'completion_length': 126.4464340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981073170900345, 'kl': 0.0333251953125, 'epoch': 0.22} 4%|▍ | 70/1610 [25:35<7:19:21, 17.12s/it] 4%|▍ | 71/1610 [25:51<7:08:57, 16.72s/it] {'loss': 0.0011, 'grad_norm': 2.0785814343966877, 'learning_rate': 9.559006211180125e-07, 'completion_length': 148.2857208251953, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.3078143671154976, 'kl': 0.0264892578125, 'epoch': 0.22} 4%|▍ | 71/1610 [25:51<7:08:57, 16.72s/it] 4%|▍ | 72/1610 [26:05<6:48:11, 15.92s/it] {'loss': 0.0011, 'grad_norm': 1.3092013586073554, 'learning_rate': 9.5527950310559e-07, 'completion_length': 114.21429061889648, 'rewards/accuracy_reward': 0.1785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.1785714626312256, 'reward_std': 0.19514648616313934, 'kl': 0.02789306640625, 'epoch': 0.22} 4%|▍ | 72/1610 [26:05<6:48:11, 15.92s/it] 5%|▍ | 73/1610 [26:22<6:53:22, 16.14s/it] {'loss': 0.001, 'grad_norm': 1.5908337308227767, 'learning_rate': 9.546583850931676e-07, 'completion_length': 135.41072463989258, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857313156128, 'reward_std': 0.30228935927152634, 'kl': 0.02459716796875, 'epoch': 0.23} 5%|▍ | 73/1610 [26:22<6:53:22, 16.14s/it] 5%|▍ | 74/1610 [26:33<6:14:55, 14.65s/it] {'loss': 0.0008, 'grad_norm': 2.1804802058091415, 'learning_rate': 9.540372670807452e-07, 'completion_length': 94.23214721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2721000760793686, 'kl': 0.0211181640625, 'epoch': 0.23} 5%|▍ | 74/1610 [26:33<6:14:55, 14.65s/it] 5%|▍ | 75/1610 [26:46<6:00:19, 14.08s/it] {'loss': 0.001, 'grad_norm': 2.4310994297953554, 'learning_rate': 9.534161490683229e-07, 'completion_length': 120.4285774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.02459716796875, 'epoch': 0.23} 5%|▍ | 75/1610 [26:46<6:00:19, 14.08s/it] 5%|▍ | 76/1610 [27:02<6:16:21, 14.72s/it] {'loss': 0.0009, 'grad_norm': 1.617099967992136, 'learning_rate': 9.527950310559006e-07, 'completion_length': 145.50000762939453, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.3324786275625229, 'kl': 0.0233154296875, 'epoch': 0.24} 5%|▍ | 76/1610 [27:02<6:16:21, 14.72s/it] 5%|▍ | 77/1610 [27:14<5:56:48, 13.97s/it] {'loss': 0.0012, 'grad_norm': 1.4283778614249865, 'learning_rate': 9.521739130434783e-07, 'completion_length': 112.87500381469727, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.02960205078125, 'epoch': 0.24} 5%|▍ | 77/1610 [27:14<5:56:48, 13.97s/it] 5%|▍ | 78/1610 [27:27<5:49:50, 13.70s/it] {'loss': 0.001, 'grad_norm': 1.4991873893891443, 'learning_rate': 9.515527950310559e-07, 'completion_length': 129.48215103149414, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.2363857962191105, 'kl': 0.02410888671875, 'epoch': 0.24} 5%|▍ | 78/1610 [27:27<5:49:50, 13.70s/it] 5%|▍ | 79/1610 [27:43<6:04:02, 14.27s/it] {'loss': 0.001, 'grad_norm': 1.4821563207094695, 'learning_rate': 9.509316770186336e-07, 'completion_length': 114.39286422729492, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.2610500380396843, 'kl': 0.0242919921875, 'epoch': 0.25} 5%|▍ | 79/1610 [27:43<6:04:02, 14.27s/it] 5%|▍ | 80/1610 [28:00<6:27:19, 15.19s/it] {'loss': 0.001, 'grad_norm': 1.9118044129827811, 'learning_rate': 9.503105590062112e-07, 'completion_length': 161.10715103149414, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.3404877483844757, 'kl': 0.0240478515625, 'epoch': 0.25} 5%|▍ | 80/1610 [28:00<6:27:19, 15.19s/it] 5%|▌ | 81/1610 [28:15<6:24:57, 15.11s/it] {'loss': 0.0012, 'grad_norm': 1.040701337925503, 'learning_rate': 9.496894409937888e-07, 'completion_length': 132.01786422729492, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.1649572253227234, 'kl': 0.0308837890625, 'epoch': 0.25} 5%|▌ | 81/1610 [28:15<6:24:57, 15.11s/it] 5%|▌ | 82/1610 [28:31<6:31:38, 15.38s/it] {'loss': 0.0011, 'grad_norm': 0.8551174479407697, 'learning_rate': 9.490683229813665e-07, 'completion_length': 158.94644165039062, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.02752685546875, 'epoch': 0.25} 5%|▌ | 82/1610 [28:31<6:31:38, 15.38s/it] 5%|▌ | 83/1610 [28:51<7:03:53, 16.66s/it] {'loss': 0.0012, 'grad_norm': 1.927301969409563, 'learning_rate': 9.48447204968944e-07, 'completion_length': 175.25000762939453, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.3435286581516266, 'kl': 0.0296630859375, 'epoch': 0.26} 5%|▌ | 83/1610 [28:51<7:03:53, 16.66s/it] 5%|▌ | 84/1610 [29:04<6:42:17, 15.82s/it] {'loss': 0.001, 'grad_norm': 1.6837159036131255, 'learning_rate': 9.478260869565216e-07, 'completion_length': 123.75000381469727, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.23086079210042953, 'kl': 0.026123046875, 'epoch': 0.26} 5%|▌ | 84/1610 [29:04<6:42:17, 15.82s/it] 5%|▌ | 85/1610 [29:23<7:00:31, 16.55s/it] {'loss': 0.001, 'grad_norm': 1.7804525830937628, 'learning_rate': 9.472049689440993e-07, 'completion_length': 154.14286422729492, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3214285969734192, 'reward_std': 0.4179481789469719, 'kl': 0.02618408203125, 'epoch': 0.26} 5%|▌ | 85/1610 [29:23<7:00:31, 16.55s/it] 5%|▌ | 86/1610 [29:40<7:07:41, 16.84s/it] {'loss': 0.0011, 'grad_norm': 1.2576147898184085, 'learning_rate': 9.46583850931677e-07, 'completion_length': 127.48214721679688, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.028564453125, 'epoch': 0.27} 5%|▌ | 86/1610 [29:40<7:07:41, 16.84s/it] 5%|▌ | 87/1610 [30:02<7:42:12, 18.21s/it] {'loss': 0.0007, 'grad_norm': 1.007390183605123, 'learning_rate': 9.459627329192546e-07, 'completion_length': 188.7678680419922, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178572535514832, 'reward_std': 0.32391267269849777, 'kl': 0.018280029296875, 'epoch': 0.27} 5%|▌ | 87/1610 [30:02<7:42:12, 18.21s/it] 5%|▌ | 88/1610 [30:18<7:30:31, 17.76s/it] {'loss': 0.0012, 'grad_norm': 2.201107121100972, 'learning_rate': 9.453416149068323e-07, 'completion_length': 120.87500762939453, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.2253357619047165, 'kl': 0.02880859375, 'epoch': 0.27} 5%|▌ | 88/1610 [30:18<7:30:31, 17.76s/it] 6%|▌ | 89/1610 [30:34<7:14:39, 17.15s/it] {'loss': 0.0008, 'grad_norm': 2.200392639929192, 'learning_rate': 9.447204968944099e-07, 'completion_length': 124.9464340209961, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.26657507568597794, 'kl': 0.021240234375, 'epoch': 0.28} 6%|▌ | 89/1610 [30:34<7:14:39, 17.15s/it] 6%|▌ | 90/1610 [30:54<7:35:58, 18.00s/it] {'loss': 0.0008, 'grad_norm': 3.8919920280989064, 'learning_rate': 9.440993788819875e-07, 'completion_length': 164.80358123779297, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3214285969734192, 'reward_std': 0.17098906636238098, 'kl': 0.02008056640625, 'epoch': 0.28} 6%|▌ | 90/1610 [30:54<7:35:58, 18.00s/it] 6%|▌ | 91/1610 [31:10<7:24:03, 17.54s/it] {'loss': 0.0011, 'grad_norm': 1.7254125208770426, 'learning_rate': 9.434782608695652e-07, 'completion_length': 133.76786041259766, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.23086078464984894, 'kl': 0.02862548828125, 'epoch': 0.28} 6%|▌ | 91/1610 [31:10<7:24:03, 17.54s/it] 6%|▌ | 92/1610 [31:27<7:18:20, 17.33s/it] {'loss': 0.001, 'grad_norm': 1.8534496300240908, 'learning_rate': 9.428571428571428e-07, 'completion_length': 139.19643020629883, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.26657507568597794, 'kl': 0.02490234375, 'epoch': 0.29} 6%|▌ | 92/1610 [31:27<7:18:20, 17.33s/it] 6%|▌ | 93/1610 [31:43<7:04:34, 16.79s/it] {'loss': 0.0012, 'grad_norm': 1.4918715641855143, 'learning_rate': 9.422360248447204e-07, 'completion_length': 121.66072082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.26657505333423615, 'kl': 0.02874755859375, 'epoch': 0.29} 6%|▌ | 93/1610 [31:43<7:04:34, 16.79s/it] 6%|▌ | 94/1610 [31:59<6:56:22, 16.48s/it] {'loss': 0.0024, 'grad_norm': 11.173102432784912, 'learning_rate': 9.41614906832298e-07, 'completion_length': 107.08929061889648, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.379242941737175, 'kl': 0.06060791015625, 'epoch': 0.29} 6%|▌ | 94/1610 [31:59<6:56:22, 16.48s/it] 6%|▌ | 95/1610 [32:17<7:12:52, 17.14s/it] {'loss': 0.0009, 'grad_norm': 1.6098381393770185, 'learning_rate': 9.409937888198758e-07, 'completion_length': 148.4464340209961, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3928571939468384, 'reward_std': 0.3596269488334656, 'kl': 0.02203369140625, 'epoch': 0.3} 6%|▌ | 95/1610 [32:17<7:12:52, 17.14s/it] 6%|▌ | 96/1610 [32:33<7:05:26, 16.86s/it] {'loss': 0.001, 'grad_norm': 2.2411964576045533, 'learning_rate': 9.403726708074534e-07, 'completion_length': 146.10715103149414, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.24241764098405838, 'kl': 0.02569580078125, 'epoch': 0.3} 6%|▌ | 96/1610 [32:33<7:05:26, 16.86s/it] 6%|▌ | 97/1610 [32:51<7:11:17, 17.10s/it] {'loss': 0.0011, 'grad_norm': 1.8414146831347717, 'learning_rate': 9.39751552795031e-07, 'completion_length': 172.3928680419922, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.26657506823539734, 'kl': 0.02801513671875, 'epoch': 0.3} 6%|▌ | 97/1610 [32:51<7:11:17, 17.10s/it] 6%|▌ | 98/1610 [33:10<7:21:35, 17.52s/it] {'loss': 0.0011, 'grad_norm': 2.5357109424377917, 'learning_rate': 9.391304347826087e-07, 'completion_length': 167.23214721679688, 'rewards/accuracy_reward': 0.2678571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2321429252624512, 'reward_std': 0.21124480664730072, 'kl': 0.0272216796875, 'epoch': 0.3} 6%|▌ | 98/1610 [33:10<7:21:35, 17.52s/it] 6%|▌ | 99/1610 [33:24<6:58:08, 16.60s/it] {'loss': 0.0011, 'grad_norm': 2.231196405453189, 'learning_rate': 9.385093167701863e-07, 'completion_length': 112.3035774230957, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.30228935927152634, 'kl': 0.02655029296875, 'epoch': 0.31} 6%|▌ | 99/1610 [33:24<6:58:08, 16.60s/it] 6%|▌ | 100/1610 [33:44<7:21:00, 17.52s/it] {'loss': 0.001, 'grad_norm': 1.6900361852938919, 'learning_rate': 9.37888198757764e-07, 'completion_length': 163.08929443359375, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2967643290758133, 'kl': 0.024169921875, 'epoch': 0.31} 6%|▌ | 100/1610 [33:44<7:21:00, 17.52s/it] 6%|▋ | 101/1610 [36:49<28:25:16, 67.80s/it] {'loss': 0.0009, 'grad_norm': 1.592463459224565, 'learning_rate': 9.372670807453416e-07, 'completion_length': 144.6607208251953, 'rewards/accuracy_reward': 0.3928571715950966, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.20670335739850998, 'kl': 0.02349853515625, 'epoch': 0.31} 6%|▋ | 101/1610 [36:49<28:25:16, 67.80s/it] 6%|▋ | 102/1610 [37:14<23:01:23, 54.96s/it] {'loss': 0.001, 'grad_norm': 1.3153407725525705, 'learning_rate': 9.366459627329192e-07, 'completion_length': 149.76786422729492, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464285969734192, 'reward_std': 0.2610500380396843, 'kl': 0.02490234375, 'epoch': 0.32} 6%|▋ | 102/1610 [37:14<23:01:23, 54.96s/it] 6%|▋ | 103/1610 [37:47<20:14:04, 48.34s/it] {'loss': 0.0011, 'grad_norm': 2.3701486965321297, 'learning_rate': 9.360248447204968e-07, 'completion_length': 124.41072463989258, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1785714402794838, 'kl': 0.0284423828125, 'epoch': 0.32} 6%|▋ | 103/1610 [37:47<20:14:04, 48.34s/it] 6%|▋ | 104/1610 [38:27<19:13:11, 45.94s/it] {'loss': 0.0011, 'grad_norm': 2.0525355840525994, 'learning_rate': 9.354037267080745e-07, 'completion_length': 135.4464340209961, 'rewards/accuracy_reward': 0.3392857387661934, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.3324786201119423, 'kl': 0.02789306640625, 'epoch': 0.32} 6%|▋ | 104/1610 [38:27<19:13:11, 45.94s/it] 7%|▋ | 105/1610 [39:05<18:10:38, 43.48s/it] {'loss': 0.0011, 'grad_norm': 18.26423789597157, 'learning_rate': 9.347826086956522e-07, 'completion_length': 109.3214340209961, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2253357656300068, 'kl': 0.02655029296875, 'epoch': 0.33} 7%|▋ | 105/1610 [39:05<18:10:38, 43.48s/it] 7%|▋ | 106/1610 [39:48<18:09:05, 43.45s/it] {'loss': 0.001, 'grad_norm': 2.3912033532059387, 'learning_rate': 9.341614906832299e-07, 'completion_length': 114.53572082519531, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.321428656578064, 'reward_std': 0.3007388412952423, 'kl': 0.02569580078125, 'epoch': 0.33} 7%|▋ | 106/1610 [39:48<18:09:05, 43.45s/it] 7%|▋ | 107/1610 [40:34<18:25:30, 44.13s/it] {'loss': 0.0012, 'grad_norm': 1.7827883062294063, 'learning_rate': 9.335403726708074e-07, 'completion_length': 116.78572082519531, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500454902649, 'kl': 0.03082275390625, 'epoch': 0.33} 7%|▋ | 107/1610 [40:34<18:25:30, 44.13s/it] 7%|▋ | 108/1610 [41:21<18:45:25, 44.96s/it] {'loss': 0.0012, 'grad_norm': 1.2833866389689867, 'learning_rate': 9.32919254658385e-07, 'completion_length': 136.21429443359375, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.2142857238650322, 'kl': 0.0311279296875, 'epoch': 0.34} 7%|▋ | 108/1610 [41:21<18:45:25, 44.96s/it] 7%|▋ | 109/1610 [42:10<19:13:35, 46.11s/it] {'loss': 0.0013, 'grad_norm': 0.9741675149553913, 'learning_rate': 9.322981366459626e-07, 'completion_length': 114.01786041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.14838216453790665, 'kl': 0.03179931640625, 'epoch': 0.34} 7%|▋ | 109/1610 [42:10<19:13:35, 46.11s/it] 7%|▋ | 110/1610 [42:41<17:20:14, 41.61s/it] {'loss': 0.0013, 'grad_norm': 2.499159048768316, 'learning_rate': 9.316770186335403e-07, 'completion_length': 137.2857208251953, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4285714626312256, 'reward_std': 0.40943221747875214, 'kl': 0.03131103515625, 'epoch': 0.34} 7%|▋ | 110/1610 [42:41<17:20:14, 41.61s/it] 7%|▋ | 111/1610 [42:58<14:20:27, 34.44s/it] {'loss': 0.0012, 'grad_norm': 1.1332832438249656, 'learning_rate': 9.310559006211179e-07, 'completion_length': 98.6785774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.02899169921875, 'epoch': 0.34} 7%|▋ | 111/1610 [42:58<14:20:27, 34.44s/it] 7%|▋ | 112/1610 [43:28<13:45:44, 33.07s/it] {'loss': 0.0011, 'grad_norm': 2.3386224298948375, 'learning_rate': 9.304347826086955e-07, 'completion_length': 137.8035774230957, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.23086076974868774, 'kl': 0.02630615234375, 'epoch': 0.35} 7%|▋ | 112/1610 [43:28<13:45:44, 33.07s/it] 7%|▋ | 113/1610 [44:00<13:38:09, 32.79s/it] {'loss': 0.0011, 'grad_norm': 2.4781922920239583, 'learning_rate': 9.298136645962732e-07, 'completion_length': 145.96429061889648, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.4260072708129883, 'kl': 0.02679443359375, 'epoch': 0.35} 7%|▋ | 113/1610 [44:00<13:38:09, 32.79s/it] 7%|▋ | 114/1610 [44:37<14:02:47, 33.80s/it] {'loss': 0.0011, 'grad_norm': 1.6439361324960495, 'learning_rate': 9.291925465838509e-07, 'completion_length': 141.6607208251953, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.02783203125, 'epoch': 0.35} 7%|▋ | 114/1610 [44:37<14:02:47, 33.80s/it] 7%|▋ | 115/1610 [45:05<13:24:21, 32.28s/it] {'loss': 0.0016, 'grad_norm': 2.2209470729789693, 'learning_rate': 9.285714285714285e-07, 'completion_length': 133.35715103149414, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3928571939468384, 'reward_std': 0.24695908278226852, 'kl': 0.039306640625, 'epoch': 0.36} 7%|▋ | 115/1610 [45:05<13:24:21, 32.28s/it] 7%|▋ | 116/1610 [45:27<12:01:10, 28.96s/it] {'loss': 0.0012, 'grad_norm': 1.5411450963502151, 'learning_rate': 9.279503105590062e-07, 'completion_length': 120.33929061889648, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.30228935927152634, 'kl': 0.03125, 'epoch': 0.36} 7%|▋ | 116/1610 [45:27<12:01:10, 28.96s/it] 7%|▋ | 117/1610 [45:55<11:58:09, 28.86s/it] {'loss': 0.0011, 'grad_norm': 2.0990966665802846, 'learning_rate': 9.273291925465838e-07, 'completion_length': 113.98214340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.30228933691978455, 'kl': 0.02862548828125, 'epoch': 0.36} 7%|▋ | 117/1610 [45:55<11:58:09, 28.86s/it] 7%|▋ | 118/1610 [46:26<12:12:01, 29.44s/it] {'loss': 0.0014, 'grad_norm': 2.179248373215832, 'learning_rate': 9.267080745341614e-07, 'completion_length': 104.91072082519531, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.2253357470035553, 'kl': 0.03436279296875, 'epoch': 0.37} 7%|▋ | 118/1610 [46:26<12:12:01, 29.44s/it] 7%|▋ | 119/1610 [47:00<12:46:50, 30.86s/it] {'loss': 0.0014, 'grad_norm': 3.5444368449086467, 'learning_rate': 9.260869565217391e-07, 'completion_length': 110.28572082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.034912109375, 'epoch': 0.37} 7%|▋ | 119/1610 [47:00<12:46:50, 30.86s/it] 7%|▋ | 120/1610 [47:29<12:28:06, 30.13s/it] {'loss': 0.0012, 'grad_norm': 3.215415531905321, 'learning_rate': 9.254658385093167e-07, 'completion_length': 118.55357360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.26657506078481674, 'kl': 0.0303955078125, 'epoch': 0.37} 7%|▋ | 120/1610 [47:29<12:28:06, 30.13s/it] 8%|▊ | 121/1610 [48:02<12:49:06, 30.99s/it] {'loss': 0.0016, 'grad_norm': 1.5368042064670797, 'learning_rate': 9.248447204968943e-07, 'completion_length': 107.5535774230957, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.0404052734375, 'epoch': 0.38} 8%|▊ | 121/1610 [48:02<12:49:06, 30.99s/it] 8%|▊ | 122/1610 [48:29<12:23:40, 29.99s/it] {'loss': 0.0016, 'grad_norm': 3.714668596581508, 'learning_rate': 9.24223602484472e-07, 'completion_length': 81.57143020629883, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.2967643141746521, 'kl': 0.0411376953125, 'epoch': 0.38} 8%|▊ | 122/1610 [48:29<12:23:40, 29.99s/it] 8%|▊ | 123/1610 [48:57<12:08:12, 29.38s/it] {'loss': 0.0014, 'grad_norm': 2.4637864943870746, 'learning_rate': 9.236024844720497e-07, 'completion_length': 93.85714721679688, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.31333939731121063, 'kl': 0.0352783203125, 'epoch': 0.38} 8%|▊ | 123/1610 [48:57<12:08:12, 29.38s/it] 8%|▊ | 124/1610 [49:30<12:34:20, 30.46s/it] {'loss': 0.0016, 'grad_norm': 2.147233229847565, 'learning_rate': 9.229813664596273e-07, 'completion_length': 114.96429443359375, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2610500380396843, 'kl': 0.038818359375, 'epoch': 0.39} 8%|▊ | 124/1610 [49:30<12:34:20, 30.46s/it] 8%|▊ | 125/1610 [49:55<11:51:42, 28.76s/it] {'loss': 0.0015, 'grad_norm': 1.9452511689675787, 'learning_rate': 9.22360248447205e-07, 'completion_length': 80.1785774230957, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.29123930633068085, 'kl': 0.037109375, 'epoch': 0.39} 8%|▊ | 125/1610 [49:55<11:51:42, 28.76s/it] 8%|▊ | 126/1610 [50:26<12:07:18, 29.41s/it] {'loss': 0.0015, 'grad_norm': 2.170170157270434, 'learning_rate': 9.217391304347826e-07, 'completion_length': 103.08928680419922, 'rewards/accuracy_reward': 0.1964285857975483, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.1785715222358704, 'reward_std': 0.21222838014364243, 'kl': 0.0369873046875, 'epoch': 0.39} 8%|▊ | 126/1610 [50:26<12:07:18, 29.41s/it] 8%|▊ | 127/1610 [50:50<11:29:10, 27.88s/it] {'loss': 0.0012, 'grad_norm': 1.3699572662941604, 'learning_rate': 9.211180124223602e-07, 'completion_length': 127.32143783569336, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.0294189453125, 'epoch': 0.39} 8%|▊ | 127/1610 [50:50<11:29:10, 27.88s/it] 8%|▊ | 128/1610 [51:12<10:42:34, 26.02s/it] {'loss': 0.0015, 'grad_norm': 1.9419381531322537, 'learning_rate': 9.204968944099379e-07, 'completion_length': 110.28571701049805, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571428656578064, 'reward_std': 0.2253357470035553, 'kl': 0.03662109375, 'epoch': 0.4} 8%|▊ | 128/1610 [51:12<10:42:34, 26.02s/it] 8%|▊ | 129/1610 [51:31<9:54:07, 24.07s/it] {'loss': 0.0015, 'grad_norm': 3.825737550774467, 'learning_rate': 9.198757763975155e-07, 'completion_length': 111.57143020629883, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.29123930633068085, 'kl': 0.036376953125, 'epoch': 0.4} 8%|▊ | 129/1610 [51:31<9:54:07, 24.07s/it] 8%|▊ | 130/1610 [51:51<9:21:07, 22.75s/it] {'loss': 0.0014, 'grad_norm': 1.5766185373904134, 'learning_rate': 9.19254658385093e-07, 'completion_length': 115.92857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500380396843, 'kl': 0.0343017578125, 'epoch': 0.4} 8%|▊ | 130/1610 [51:51<9:21:07, 22.75s/it] 8%|▊ | 131/1610 [52:09<8:43:38, 21.24s/it] {'loss': 0.0014, 'grad_norm': 1.440810468867519, 'learning_rate': 9.186335403726707e-07, 'completion_length': 102.91071701049805, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.1896214708685875, 'kl': 0.0340576171875, 'epoch': 0.41} 8%|▊ | 131/1610 [52:09<8:43:38, 21.24s/it] 8%|▊ | 132/1610 [52:29<8:37:23, 21.00s/it] {'loss': 0.0014, 'grad_norm': 1.7081365732607146, 'learning_rate': 9.180124223602484e-07, 'completion_length': 121.44643020629883, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.21981074661016464, 'kl': 0.0361328125, 'epoch': 0.41} 8%|▊ | 132/1610 [52:29<8:37:23, 21.00s/it] 8%|▊ | 133/1610 [52:48<8:20:39, 20.34s/it] {'loss': 0.0014, 'grad_norm': 3.8424134373801886, 'learning_rate': 9.17391304347826e-07, 'completion_length': 115.03572082519531, 'rewards/accuracy_reward': 0.2857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.285714328289032, 'reward_std': 0.32695360481739044, 'kl': 0.035400390625, 'epoch': 0.41} 8%|▊ | 133/1610 [52:48<8:20:39, 20.34s/it] 8%|▊ | 134/1610 [53:07<8:09:08, 19.88s/it] {'loss': 0.0013, 'grad_norm': 1.3191288688349434, 'learning_rate': 9.167701863354037e-07, 'completion_length': 113.92857360839844, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1071428619325161, 'kl': 0.03271484375, 'epoch': 0.42} 8%|▊ | 134/1610 [53:07<8:09:08, 19.88s/it] 8%|▊ | 135/1610 [53:25<7:55:34, 19.35s/it] {'loss': 0.0014, 'grad_norm': 3.184761066932027, 'learning_rate': 9.161490683229813e-07, 'completion_length': 117.16072082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.40943220257759094, 'kl': 0.03607177734375, 'epoch': 0.42} 8%|▊ | 135/1610 [53:25<7:55:34, 19.35s/it] 8%|▊ | 136/1610 [53:48<8:23:50, 20.51s/it] {'loss': 0.0013, 'grad_norm': 1.609846339389642, 'learning_rate': 9.155279503105589e-07, 'completion_length': 131.91072463989258, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2610500454902649, 'kl': 0.03253173828125, 'epoch': 0.42} 8%|▊ | 136/1610 [53:48<8:23:50, 20.51s/it] 9%|▊ | 137/1610 [54:10<8:36:31, 21.04s/it] {'loss': 0.0013, 'grad_norm': 1.0557597833949925, 'learning_rate': 9.149068322981366e-07, 'completion_length': 139.6607208251953, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.03131103515625, 'epoch': 0.43} 9%|▊ | 137/1610 [54:10<8:36:31, 21.04s/it] 9%|▊ | 138/1610 [54:33<8:49:17, 21.57s/it] {'loss': 0.0014, 'grad_norm': 2.177760383300279, 'learning_rate': 9.142857142857142e-07, 'completion_length': 110.00000762939453, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.26657506078481674, 'kl': 0.035400390625, 'epoch': 0.43} 9%|▊ | 138/1610 [54:33<8:49:17, 21.57s/it] 9%|▊ | 139/1610 [54:56<8:57:33, 21.93s/it] {'loss': 0.0015, 'grad_norm': 1.769017146075186, 'learning_rate': 9.136645962732918e-07, 'completion_length': 114.87500381469727, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.29123930633068085, 'kl': 0.0379638671875, 'epoch': 0.43} 9%|▊ | 139/1610 [54:56<8:57:33, 21.93s/it] 9%|▊ | 140/1610 [55:19<9:07:57, 22.37s/it] {'loss': 0.0014, 'grad_norm': 1.7171708260240757, 'learning_rate': 9.130434782608695e-07, 'completion_length': 123.57143020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2610500454902649, 'kl': 0.03558349609375, 'epoch': 0.43} 9%|▊ | 140/1610 [55:19<9:07:57, 22.37s/it] 9%|▉ | 141/1610 [55:49<9:57:15, 24.39s/it] {'loss': 0.0015, 'grad_norm': 2.416587724250142, 'learning_rate': 9.124223602484472e-07, 'completion_length': 138.25, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1896214708685875, 'kl': 0.036865234375, 'epoch': 0.44} 9%|▉ | 141/1610 [55:49<9:57:15, 24.39s/it] 9%|▉ | 142/1610 [56:21<10:58:54, 26.93s/it] {'loss': 0.0016, 'grad_norm': 1.4887269231064861, 'learning_rate': 9.118012422360248e-07, 'completion_length': 90.64286041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838217198848724, 'kl': 0.0411376953125, 'epoch': 0.44} 9%|▉ | 142/1610 [56:21<10:58:54, 26.93s/it] 9%|▉ | 143/1610 [57:08<13:24:43, 32.91s/it] {'loss': 0.0014, 'grad_norm': 2.313867817400797, 'learning_rate': 9.111801242236025e-07, 'completion_length': 118.58929443359375, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1785714402794838, 'kl': 0.0343017578125, 'epoch': 0.44} 9%|▉ | 143/1610 [57:08<13:24:43, 32.91s/it] 9%|▉ | 144/1610 [57:47<14:04:14, 34.55s/it] {'loss': 0.0015, 'grad_norm': 1.606484097847731, 'learning_rate': 9.105590062111801e-07, 'completion_length': 99.00000381469727, 'rewards/accuracy_reward': 0.4285714402794838, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.038330078125, 'epoch': 0.45} 9%|▉ | 144/1610 [57:47<14:04:14, 34.55s/it] 9%|▉ | 145/1610 [58:30<15:08:31, 37.21s/it] {'loss': 0.0014, 'grad_norm': 1.402863445402668, 'learning_rate': 9.099378881987577e-07, 'completion_length': 111.96429061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.1896214671432972, 'kl': 0.0362548828125, 'epoch': 0.45} 9%|▉ | 145/1610 [58:30<15:08:31, 37.21s/it] 9%|▉ | 146/1610 [59:21<16:51:11, 41.44s/it] {'loss': 0.0015, 'grad_norm': 1.4205362591671498, 'learning_rate': 9.093167701863354e-07, 'completion_length': 138.01786041259766, 'rewards/accuracy_reward': 0.2857143059372902, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.2678571939468384, 'reward_std': 0.30228936672210693, 'kl': 0.0377197265625, 'epoch': 0.45} 9%|▉ | 146/1610 [59:21<16:51:11, 41.44s/it] 9%|▉ | 147/1610 [1:00:06<17:14:23, 42.42s/it] {'loss': 0.0014, 'grad_norm': 1.6296049012546114, 'learning_rate': 9.08695652173913e-07, 'completion_length': 123.64286041259766, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.0338134765625, 'epoch': 0.46} 9%|▉ | 147/1610 [1:00:06<17:14:23, 42.42s/it] 9%|▉ | 148/1610 [1:00:45<16:51:02, 41.49s/it] {'loss': 0.0015, 'grad_norm': 2.2999333542958094, 'learning_rate': 9.080745341614906e-07, 'completion_length': 96.5714340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.32391268014907837, 'kl': 0.0372314453125, 'epoch': 0.46} 9%|▉ | 148/1610 [1:00:45<16:51:02, 41.49s/it] 9%|▉ | 149/1610 [1:01:31<17:17:57, 42.63s/it] {'loss': 0.0016, 'grad_norm': 1.4949761512703619, 'learning_rate': 9.074534161490683e-07, 'completion_length': 122.14286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2006715089082718, 'kl': 0.0401611328125, 'epoch': 0.46} 9%|▉ | 149/1610 [1:01:31<17:17:57, 42.63s/it] 9%|▉ | 150/1610 [1:02:20<18:04:50, 44.58s/it] {'loss': 0.0014, 'grad_norm': 2.5007635933833487, 'learning_rate': 9.06832298136646e-07, 'completion_length': 121.32143783569336, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.3324786201119423, 'kl': 0.035400390625, 'epoch': 0.47} 9%|▉ | 150/1610 [1:02:20<18:04:50, 44.58s/it] 9%|▉ | 151/1610 [1:02:54<16:49:05, 41.50s/it] {'loss': 0.0014, 'grad_norm': 1.412143138183067, 'learning_rate': 9.062111801242236e-07, 'completion_length': 123.08929061889648, 'rewards/accuracy_reward': 0.2500000149011612, 'rewards/format_reward': 1.0, 'reward': 1.2500000596046448, 'reward_std': 0.26657506078481674, 'kl': 0.0350341796875, 'epoch': 0.47} 9%|▉ | 151/1610 [1:02:54<16:49:05, 41.50s/it] 9%|▉ | 152/1610 [1:03:36<16:49:54, 41.56s/it] {'loss': 0.0014, 'grad_norm': 1.7673629494180574, 'learning_rate': 9.055900621118013e-07, 'completion_length': 136.8035774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.29123930633068085, 'kl': 0.0357666015625, 'epoch': 0.47} 9%|▉ | 152/1610 [1:03:36<16:49:54, 41.56s/it] 10%|▉ | 153/1610 [1:04:33<18:41:12, 46.17s/it] {'loss': 0.0014, 'grad_norm': 1.716490504211055, 'learning_rate': 9.049689440993789e-07, 'completion_length': 165.58929443359375, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.35204461216926575, 'kl': 0.03515625, 'epoch': 0.48} 10%|▉ | 153/1610 [1:04:33<18:41:12, 46.17s/it] 10%|▉ | 154/1610 [1:05:30<19:59:45, 49.44s/it] {'loss': 0.0013, 'grad_norm': 1.8384455477740567, 'learning_rate': 9.043478260869564e-07, 'completion_length': 134.9464340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.24241765588521957, 'kl': 0.033203125, 'epoch': 0.48} 10%|▉ | 154/1610 [1:05:30<19:59:45, 49.44s/it] 10%|▉ | 155/1610 [1:06:21<20:10:35, 49.92s/it] {'loss': 0.0018, 'grad_norm': 1.28095566435932, 'learning_rate': 9.037267080745341e-07, 'completion_length': 122.05358123779297, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.046142578125, 'epoch': 0.48} 10%|▉ | 155/1610 [1:06:21<20:10:35, 49.92s/it] 10%|▉ | 156/1610 [1:07:07<19:40:28, 48.71s/it] {'loss': 0.0017, 'grad_norm': 2.349336968925378, 'learning_rate': 9.031055900621117e-07, 'completion_length': 122.25000381469727, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.26657506078481674, 'kl': 0.0413818359375, 'epoch': 0.48} 10%|▉ | 156/1610 [1:07:07<19:40:28, 48.71s/it] 10%|▉ | 157/1610 [1:07:51<19:03:41, 47.23s/it] {'loss': 0.0017, 'grad_norm': 1.967531441400542, 'learning_rate': 9.024844720496893e-07, 'completion_length': 100.12500381469727, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.25552502274513245, 'kl': 0.0416259765625, 'epoch': 0.49} 10%|▉ | 157/1610 [1:07:51<19:03:41, 47.23s/it] 10%|▉ | 158/1610 [1:08:44<19:51:38, 49.24s/it] {'loss': 0.0014, 'grad_norm': 1.194702825625421, 'learning_rate': 9.01863354037267e-07, 'completion_length': 155.0178680419922, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857313156128, 'reward_std': 0.2253357619047165, 'kl': 0.0341796875, 'epoch': 0.49} 10%|▉ | 158/1610 [1:08:45<19:51:38, 49.24s/it] 10%|▉ | 159/1610 [1:09:35<20:03:34, 49.77s/it] {'loss': 0.0015, 'grad_norm': 0.8928620540306376, 'learning_rate': 9.012422360248447e-07, 'completion_length': 114.21428680419922, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.1428571492433548, 'kl': 0.03662109375, 'epoch': 0.49} 10%|▉ | 159/1610 [1:09:36<20:03:34, 49.77s/it] 10%|▉ | 160/1610 [1:10:52<23:19:49, 57.92s/it] {'loss': 0.0015, 'grad_norm': 1.3778915155949636, 'learning_rate': 9.006211180124223e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.2781319320201874, 'kl': 0.036376953125, 'epoch': 0.5} 10%|▉ | 160/1610 [1:10:52<23:19:49, 57.92s/it] 10%|█ | 161/1610 [1:11:39<21:57:18, 54.55s/it] {'loss': 0.0016, 'grad_norm': 1.9139149027963787, 'learning_rate': 9e-07, 'completion_length': 115.32143783569336, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2831501364707947, 'kl': 0.0400390625, 'epoch': 0.5} 10%|█ | 161/1610 [1:11:39<21:57:18, 54.55s/it] 10%|█ | 162/1610 [1:12:36<22:11:14, 55.16s/it] {'loss': 0.0015, 'grad_norm': 1.699158434453846, 'learning_rate': 8.993788819875776e-07, 'completion_length': 145.7857208251953, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.26657506078481674, 'kl': 0.036376953125, 'epoch': 0.5} 10%|█ | 162/1610 [1:12:36<22:11:14, 55.16s/it] 10%|█ | 163/1610 [1:13:27<21:40:07, 53.91s/it] {'loss': 0.0017, 'grad_norm': 1.7462044818314424, 'learning_rate': 8.987577639751552e-07, 'completion_length': 116.10714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2721000760793686, 'kl': 0.04150390625, 'epoch': 0.51} 10%|█ | 163/1610 [1:13:27<21:40:07, 53.91s/it] 10%|█ | 164/1610 [1:14:07<20:03:46, 49.95s/it] {'loss': 0.0019, 'grad_norm': 1.872473118409347, 'learning_rate': 8.981366459627329e-07, 'completion_length': 85.37500381469727, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981073170900345, 'kl': 0.0469970703125, 'epoch': 0.51} 10%|█ | 164/1610 [1:14:07<20:03:46, 49.95s/it] 10%|█ | 165/1610 [1:14:59<20:12:42, 50.35s/it] {'loss': 0.0019, 'grad_norm': 2.9685957143447137, 'learning_rate': 8.975155279503105e-07, 'completion_length': 139.42858123779297, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.42048226296901703, 'kl': 0.048828125, 'epoch': 0.51} 10%|█ | 165/1610 [1:14:59<20:12:42, 50.35s/it] 10%|█ | 166/1610 [1:15:49<20:14:06, 50.45s/it] {'loss': 0.0016, 'grad_norm': 1.3681815282526473, 'learning_rate': 8.968944099378881e-07, 'completion_length': 148.23214721679688, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.24794265627861023, 'kl': 0.0390625, 'epoch': 0.52} 10%|█ | 166/1610 [1:15:49<20:14:06, 50.45s/it] 10%|█ | 167/1610 [1:16:24<18:16:56, 45.61s/it] {'loss': 0.0018, 'grad_norm': 1.378015395072373, 'learning_rate': 8.962732919254658e-07, 'completion_length': 141.5357208251953, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.23638580739498138, 'kl': 0.0445556640625, 'epoch': 0.52} 10%|█ | 167/1610 [1:16:24<18:16:56, 45.61s/it] 10%|█ | 168/1610 [1:17:00<17:07:42, 42.76s/it] {'loss': 0.0016, 'grad_norm': 4.230353352740978, 'learning_rate': 8.956521739130435e-07, 'completion_length': 142.4821548461914, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.03955078125, 'epoch': 0.52} 10%|█ | 168/1610 [1:17:00<17:07:42, 42.76s/it] 10%|█ | 169/1610 [1:17:29<15:32:53, 38.84s/it] {'loss': 0.0019, 'grad_norm': 1.7693430733937168, 'learning_rate': 8.950310559006211e-07, 'completion_length': 124.89286041259766, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.25552503019571304, 'kl': 0.0467529296875, 'epoch': 0.52} 10%|█ | 169/1610 [1:17:29<15:32:53, 38.84s/it] 11%|█ | 170/1610 [1:18:15<16:20:43, 40.86s/it] {'loss': 0.0018, 'grad_norm': 2.6272044156124217, 'learning_rate': 8.944099378881988e-07, 'completion_length': 123.12500381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.044677734375, 'epoch': 0.53} 11%|█ | 170/1610 [1:18:15<16:20:43, 40.86s/it] 11%|█ | 171/1610 [1:18:41<14:31:12, 36.33s/it] {'loss': 0.0022, 'grad_norm': 2.090833778584861, 'learning_rate': 8.937888198757764e-07, 'completion_length': 118.25000381469727, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.3596269488334656, 'kl': 0.0540771484375, 'epoch': 0.53} 11%|█ | 171/1610 [1:18:41<14:31:12, 36.33s/it] 11%|█ | 172/1610 [1:19:22<15:09:22, 37.94s/it] {'loss': 0.0021, 'grad_norm': 1.2661975281331226, 'learning_rate': 8.93167701863354e-07, 'completion_length': 142.48215103149414, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6607143878936768, 'reward_std': 0.21981072798371315, 'kl': 0.052734375, 'epoch': 0.53} 11%|█ | 172/1610 [1:19:23<15:09:22, 37.94s/it] 11%|█ | 173/1610 [1:19:48<13:36:16, 34.08s/it] {'loss': 0.002, 'grad_norm': 1.3359159604357784, 'learning_rate': 8.925465838509317e-07, 'completion_length': 104.64286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.21981073915958405, 'kl': 0.0496826171875, 'epoch': 0.54} 11%|█ | 173/1610 [1:19:48<13:36:16, 34.08s/it] 11%|█ | 174/1610 [1:20:11<12:22:10, 31.01s/it] {'loss': 0.0021, 'grad_norm': 2.1896742209578464, 'learning_rate': 8.919254658385093e-07, 'completion_length': 113.0714340209961, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.0513916015625, 'epoch': 0.54} 11%|█ | 174/1610 [1:20:11<12:22:10, 31.01s/it] 11%|█ | 175/1610 [1:20:44<12:30:42, 31.39s/it] {'loss': 0.0015, 'grad_norm': 1.6319226489703886, 'learning_rate': 8.913043478260869e-07, 'completion_length': 171.12500762939453, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.39632484316825867, 'kl': 0.03692626953125, 'epoch': 0.54} 11%|█ | 175/1610 [1:20:44<12:30:42, 31.39s/it] 11%|█ | 176/1610 [1:21:15<12:27:52, 31.29s/it] {'loss': 0.0019, 'grad_norm': 1.7013172491768813, 'learning_rate': 8.906832298136646e-07, 'completion_length': 152.42857360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6250000596046448, 'reward_std': 0.2831501215696335, 'kl': 0.047607421875, 'epoch': 0.55} 11%|█ | 176/1610 [1:21:15<12:27:52, 31.29s/it] 11%|█ | 177/1610 [1:21:43<12:05:02, 30.36s/it] {'loss': 0.0018, 'grad_norm': 1.8957412359623698, 'learning_rate': 8.900621118012423e-07, 'completion_length': 167.30358123779297, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2857143878936768, 'reward_std': 0.2967643216252327, 'kl': 0.0455322265625, 'epoch': 0.55} 11%|█ | 177/1610 [1:21:43<12:05:02, 30.36s/it] 11%|█ | 178/1610 [1:22:16<12:21:53, 31.08s/it] {'loss': 0.0018, 'grad_norm': 1.6345678772239065, 'learning_rate': 8.894409937888198e-07, 'completion_length': 161.8214340209961, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3928572535514832, 'reward_std': 0.2967643216252327, 'kl': 0.0450439453125, 'epoch': 0.55} 11%|█ | 178/1610 [1:22:16<12:21:53, 31.08s/it] 11%|█ | 179/1610 [1:22:46<12:12:38, 30.72s/it] {'loss': 0.0017, 'grad_norm': 1.8306944915039576, 'learning_rate': 8.888198757763975e-07, 'completion_length': 149.4464340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285715222358704, 'reward_std': 0.3721673935651779, 'kl': 0.0428466796875, 'epoch': 0.56} 11%|█ | 179/1610 [1:22:46<12:12:38, 30.72s/it] 11%|█ | 180/1610 [1:23:14<11:54:17, 29.97s/it] {'loss': 0.0015, 'grad_norm': 1.3990005086521196, 'learning_rate': 8.881987577639751e-07, 'completion_length': 165.64286041259766, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3392857313156128, 'reward_std': 0.29123930633068085, 'kl': 0.0384521484375, 'epoch': 0.56} 11%|█ | 180/1610 [1:23:14<11:54:17, 29.97s/it] 11%|█ | 181/1610 [1:23:44<11:53:40, 29.97s/it] {'loss': 0.0014, 'grad_norm': 1.5560527804152116, 'learning_rate': 8.875776397515527e-07, 'completion_length': 165.37500762939453, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3750000596046448, 'reward_std': 0.2610500454902649, 'kl': 0.034912109375, 'epoch': 0.56} 11%|█ | 181/1610 [1:23:44<11:53:40, 29.97s/it] 11%|█▏ | 182/1610 [1:24:12<11:44:16, 29.59s/it] {'loss': 0.002, 'grad_norm': 1.5279400870823736, 'learning_rate': 8.869565217391303e-07, 'completion_length': 184.1607208251953, 'rewards/accuracy_reward': 0.12500000931322575, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.0535715222358704, 'reward_std': 0.30626385286450386, 'kl': 0.0499267578125, 'epoch': 0.57} 11%|█▏ | 182/1610 [1:24:12<11:44:16, 29.59s/it] 11%|█▏ | 183/1610 [1:24:43<11:49:55, 29.85s/it] {'loss': 0.0012, 'grad_norm': 1.7501607953811784, 'learning_rate': 8.86335403726708e-07, 'completion_length': 201.17858123779297, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.2678571939468384, 'reward_std': 0.21981074661016464, 'kl': 0.031005859375, 'epoch': 0.57} 11%|█▏ | 183/1610 [1:24:43<11:49:55, 29.85s/it] 11%|█▏ | 184/1610 [1:25:26<13:20:28, 33.68s/it] {'loss': 0.0012, 'grad_norm': 1.5571337866208854, 'learning_rate': 8.857142857142856e-07, 'completion_length': 152.26786041259766, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.19514648616313934, 'kl': 0.03033447265625, 'epoch': 0.57} 11%|█▏ | 184/1610 [1:25:26<13:20:28, 33.68s/it] 11%|█▏ | 185/1610 [1:26:16<15:17:21, 38.63s/it] {'loss': 0.0013, 'grad_norm': 2.4476499927947826, 'learning_rate': 8.850931677018632e-07, 'completion_length': 159.50000762939453, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.19514650478959084, 'kl': 0.032958984375, 'epoch': 0.57} 11%|█▏ | 185/1610 [1:26:16<15:17:21, 38.63s/it] 12%|█▏ | 186/1610 [1:27:11<17:14:13, 43.58s/it] {'loss': 0.0017, 'grad_norm': 1.3361734178928875, 'learning_rate': 8.84472049689441e-07, 'completion_length': 151.12500762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.2610500454902649, 'kl': 0.04296875, 'epoch': 0.58} 12%|█▏ | 186/1610 [1:27:11<17:14:13, 43.58s/it] 12%|█▏ | 187/1610 [1:27:57<17:33:07, 44.40s/it] {'loss': 0.0015, 'grad_norm': 1.764154865173313, 'learning_rate': 8.838509316770186e-07, 'completion_length': 154.25000762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.2610500380396843, 'kl': 0.03857421875, 'epoch': 0.58} 12%|█▏ | 187/1610 [1:27:57<17:33:07, 44.40s/it] 12%|█▏ | 188/1610 [1:28:38<17:08:56, 43.42s/it] {'loss': 0.0015, 'grad_norm': 2.114423269966926, 'learning_rate': 8.832298136645962e-07, 'completion_length': 159.6071548461914, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.3078143745660782, 'kl': 0.036865234375, 'epoch': 0.58} 12%|█▏ | 188/1610 [1:28:38<17:08:56, 43.42s/it] 12%|█▏ | 189/1610 [1:29:22<17:13:35, 43.64s/it] {'loss': 0.0015, 'grad_norm': 1.9883292383851816, 'learning_rate': 8.826086956521739e-07, 'completion_length': 130.17857360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.2253357619047165, 'kl': 0.0369873046875, 'epoch': 0.59} 12%|█▏ | 189/1610 [1:29:22<17:13:35, 43.64s/it] 12%|█▏ | 190/1610 [1:30:14<18:09:19, 46.03s/it] {'loss': 0.0018, 'grad_norm': 1.7468776018170304, 'learning_rate': 8.819875776397515e-07, 'completion_length': 182.42858123779297, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714285969734192, 'reward_std': 0.29924844205379486, 'kl': 0.0438232421875, 'epoch': 0.59} 12%|█▏ | 190/1610 [1:30:14<18:09:19, 46.03s/it] 12%|█▏ | 191/1610 [1:31:54<24:28:19, 62.09s/it] {'loss': 0.0015, 'grad_norm': 1.3124990851780354, 'learning_rate': 8.813664596273291e-07, 'completion_length': 182.21429443359375, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4464285969734192, 'reward_std': 0.21981073170900345, 'kl': 0.03729248046875, 'epoch': 0.59} 12%|█▏ | 191/1610 [1:31:54<24:28:19, 62.09s/it] 12%|█▏ | 192/1610 [1:32:38<22:19:19, 56.67s/it] {'loss': 0.0012, 'grad_norm': 1.8634897663781187, 'learning_rate': 8.807453416149068e-07, 'completion_length': 151.9464340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1785714402794838, 'kl': 0.02935791015625, 'epoch': 0.6} 12%|█▏ | 192/1610 [1:32:38<22:19:19, 56.67s/it] 12%|█▏ | 193/1610 [1:33:20<20:35:20, 52.31s/it] {'loss': 0.0013, 'grad_norm': 2.7474537292121664, 'learning_rate': 8.801242236024844e-07, 'completion_length': 128.0535774230957, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.29123931378126144, 'kl': 0.0333251953125, 'epoch': 0.6} 12%|█▏ | 193/1610 [1:33:20<20:35:20, 52.31s/it] 12%|█▏ | 194/1610 [1:34:00<19:12:07, 48.82s/it] {'loss': 0.0014, 'grad_norm': 1.9921873958207663, 'learning_rate': 8.79503105590062e-07, 'completion_length': 119.0714340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1428571529686451, 'kl': 0.035888671875, 'epoch': 0.6} 12%|█▏ | 194/1610 [1:34:00<19:12:07, 48.82s/it] 12%|█▏ | 195/1610 [1:34:48<19:00:09, 48.35s/it] {'loss': 0.0013, 'grad_norm': 1.4892415233776906, 'learning_rate': 8.788819875776398e-07, 'completion_length': 164.94643783569336, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.1785714365541935, 'kl': 0.032470703125, 'epoch': 0.61} 12%|█▏ | 195/1610 [1:34:48<19:00:09, 48.35s/it] 12%|█▏ | 196/1610 [1:35:22<17:18:27, 44.06s/it] {'loss': 0.0013, 'grad_norm': 1.321519869393502, 'learning_rate': 8.782608695652174e-07, 'completion_length': 159.17857360839844, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.3078143745660782, 'kl': 0.03265380859375, 'epoch': 0.61} 12%|█▏ | 196/1610 [1:35:22<17:18:27, 44.06s/it] 12%|█▏ | 197/1610 [1:35:55<16:00:51, 40.80s/it] {'loss': 0.0015, 'grad_norm': 2.2898425321924307, 'learning_rate': 8.77639751552795e-07, 'completion_length': 137.5357208251953, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.037841796875, 'epoch': 0.61} 12%|█▏ | 197/1610 [1:35:55<16:00:51, 40.80s/it] 12%|█▏ | 198/1610 [1:36:31<15:27:22, 39.41s/it] {'loss': 0.0018, 'grad_norm': 1.973599061554877, 'learning_rate': 8.770186335403727e-07, 'completion_length': 177.6607208251953, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.37769243121147156, 'kl': 0.046142578125, 'epoch': 0.61} 12%|█▏ | 198/1610 [1:36:31<15:27:22, 39.41s/it] 12%|█▏ | 199/1610 [1:37:02<14:27:18, 36.88s/it] {'loss': 0.0011, 'grad_norm': 2.302240328142959, 'learning_rate': 8.763975155279503e-07, 'completion_length': 163.92857360839844, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.4260072857141495, 'kl': 0.02874755859375, 'epoch': 0.62} 12%|█▏ | 199/1610 [1:37:02<14:27:18, 36.88s/it] 12%|█▏ | 200/1610 [1:37:27<13:04:10, 33.37s/it] {'loss': 0.0014, 'grad_norm': 1.6982041459552266, 'learning_rate': 8.757763975155279e-07, 'completion_length': 135.87500381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.25552502274513245, 'kl': 0.033935546875, 'epoch': 0.62} 12%|█▏ | 200/1610 [1:37:27<13:04:10, 33.37s/it] 12%|█▏ | 201/1610 [1:40:18<29:13:00, 74.65s/it] {'loss': 0.0015, 'grad_norm': 0.9781259629340419, 'learning_rate': 8.751552795031055e-07, 'completion_length': 137.08929061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1539071872830391, 'kl': 0.0372314453125, 'epoch': 0.62} 12%|█▏ | 201/1610 [1:40:18<29:13:00, 74.65s/it] 13%|█▎ | 202/1610 [1:40:43<23:19:43, 59.65s/it] {'loss': 0.0012, 'grad_norm': 1.2768964899455406, 'learning_rate': 8.745341614906831e-07, 'completion_length': 142.76786041259766, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.21981073170900345, 'kl': 0.03009033203125, 'epoch': 0.63} 13%|█▎ | 202/1610 [1:40:43<23:19:43, 59.65s/it] 13%|█▎ | 203/1610 [1:41:05<18:56:50, 48.48s/it] {'loss': 0.0017, 'grad_norm': 2.0278957184982276, 'learning_rate': 8.739130434782607e-07, 'completion_length': 142.83929443359375, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.3495604991912842, 'kl': 0.0435791015625, 'epoch': 0.63} 13%|█▎ | 203/1610 [1:41:05<18:56:50, 48.48s/it] 13%|█▎ | 204/1610 [1:41:28<15:51:30, 40.60s/it] {'loss': 0.0012, 'grad_norm': 2.133332903080185, 'learning_rate': 8.732919254658385e-07, 'completion_length': 133.71428680419922, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.33800363540649414, 'kl': 0.03076171875, 'epoch': 0.63} 13%|█▎ | 204/1610 [1:41:28<15:51:30, 40.60s/it] 13%|█▎ | 205/1610 [1:41:53<14:03:08, 36.01s/it] {'loss': 0.0013, 'grad_norm': 1.7118083169301492, 'learning_rate': 8.726708074534161e-07, 'completion_length': 156.85715103149414, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.31029847264289856, 'kl': 0.03131103515625, 'epoch': 0.64} 13%|█▎ | 205/1610 [1:41:53<14:03:08, 36.01s/it] 13%|█▎ | 206/1610 [1:42:14<12:18:07, 31.54s/it] {'loss': 0.0013, 'grad_norm': 1.8316225741263736, 'learning_rate': 8.720496894409937e-07, 'completion_length': 106.3214340209961, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.23086076974868774, 'kl': 0.03179931640625, 'epoch': 0.64} 13%|█▎ | 206/1610 [1:42:14<12:18:07, 31.54s/it] 13%|█▎ | 207/1610 [1:42:33<10:50:19, 27.81s/it] {'loss': 0.0011, 'grad_norm': 1.429890084134299, 'learning_rate': 8.714285714285714e-07, 'completion_length': 110.25000381469727, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.21981073915958405, 'kl': 0.027587890625, 'epoch': 0.64} 13%|█▎ | 207/1610 [1:42:33<10:50:19, 27.81s/it] 13%|█▎ | 208/1610 [1:42:58<10:33:01, 27.09s/it] {'loss': 0.0014, 'grad_norm': 1.6325550670352769, 'learning_rate': 8.70807453416149e-07, 'completion_length': 140.25000762939453, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.379242941737175, 'kl': 0.0362548828125, 'epoch': 0.65} 13%|█▎ | 208/1610 [1:42:58<10:33:01, 27.09s/it] 13%|█▎ | 209/1610 [1:43:28<10:49:15, 27.81s/it] {'loss': 0.0014, 'grad_norm': 3.441326215268352, 'learning_rate': 8.701863354037266e-07, 'completion_length': 187.1964340209961, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.25248410552740097, 'kl': 0.03564453125, 'epoch': 0.65} 13%|█▎ | 209/1610 [1:43:28<10:49:15, 27.81s/it] 13%|█▎ | 210/1610 [1:43:53<10:29:27, 26.98s/it] {'loss': 0.0013, 'grad_norm': 2.534420544857561, 'learning_rate': 8.695652173913043e-07, 'completion_length': 117.16072082519531, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.3324786126613617, 'kl': 0.03192138671875, 'epoch': 0.65} 13%|█▎ | 210/1610 [1:43:53<10:29:27, 26.98s/it] 13%|█▎ | 211/1610 [1:44:15<9:53:32, 25.46s/it] {'loss': 0.0013, 'grad_norm': 1.6578178564482917, 'learning_rate': 8.689440993788819e-07, 'completion_length': 111.85714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.031494140625, 'epoch': 0.66} 13%|█▎ | 211/1610 [1:44:15<9:53:32, 25.46s/it] 13%|█▎ | 212/1610 [1:44:37<9:32:34, 24.57s/it] {'loss': 0.0016, 'grad_norm': 1.420161945842727, 'learning_rate': 8.683229813664595e-07, 'completion_length': 101.71429061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428656578064, 'kl': 0.0404052734375, 'epoch': 0.66} 13%|█▎ | 212/1610 [1:44:37<9:32:34, 24.57s/it] 13%|█▎ | 213/1610 [1:44:58<9:02:49, 23.31s/it] {'loss': 0.0014, 'grad_norm': 1.5828958966855227, 'learning_rate': 8.677018633540373e-07, 'completion_length': 101.94643020629883, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.18409644439816475, 'kl': 0.0355224609375, 'epoch': 0.66} 13%|█▎ | 213/1610 [1:44:58<9:02:49, 23.31s/it] 13%|█▎ | 214/1610 [1:45:22<9:06:15, 23.48s/it] {'loss': 0.0013, 'grad_norm': 1.4757074289967917, 'learning_rate': 8.670807453416149e-07, 'completion_length': 124.71429061889648, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.18409645557403564, 'kl': 0.033447265625, 'epoch': 0.66} 13%|█▎ | 214/1610 [1:45:22<9:06:15, 23.48s/it] 13%|█▎ | 215/1610 [1:45:44<8:57:08, 23.10s/it] {'loss': 0.0012, 'grad_norm': 0.8598799185427782, 'learning_rate': 8.664596273291925e-07, 'completion_length': 124.33929061889648, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.1181928962469101, 'kl': 0.03094482421875, 'epoch': 0.67} 13%|█▎ | 215/1610 [1:45:44<8:57:08, 23.10s/it] 13%|█▎ | 216/1610 [1:46:05<8:43:47, 22.54s/it] {'loss': 0.0021, 'grad_norm': 3.833106518458339, 'learning_rate': 8.658385093167702e-07, 'completion_length': 97.35714721679688, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.3214285969734192, 'reward_std': 0.25552502274513245, 'kl': 0.0521240234375, 'epoch': 0.67} 13%|█▎ | 216/1610 [1:46:05<8:43:47, 22.54s/it] 13%|█▎ | 217/1610 [1:46:28<8:45:41, 22.64s/it] {'loss': 0.0014, 'grad_norm': 1.782206082058303, 'learning_rate': 8.652173913043478e-07, 'completion_length': 105.71429061889648, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.2610500380396843, 'kl': 0.0350341796875, 'epoch': 0.67} 13%|█▎ | 217/1610 [1:46:28<8:45:41, 22.64s/it] 14%|█▎ | 218/1610 [1:47:06<10:30:49, 27.19s/it] {'loss': 0.0017, 'grad_norm': 1.5226694839305501, 'learning_rate': 8.645962732919254e-07, 'completion_length': 127.89286041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.23086076974868774, 'kl': 0.04345703125, 'epoch': 0.68} 14%|█▎ | 218/1610 [1:47:06<10:30:49, 27.19s/it] 14%|█▎ | 219/1610 [1:47:52<12:43:40, 32.94s/it] {'loss': 0.0013, 'grad_norm': 1.9663292884934696, 'learning_rate': 8.639751552795031e-07, 'completion_length': 134.2678680419922, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2363857924938202, 'kl': 0.032958984375, 'epoch': 0.68} 14%|█▎ | 219/1610 [1:47:52<12:43:40, 32.94s/it] 14%|█▎ | 220/1610 [1:51:06<31:23:24, 81.30s/it] {'loss': 0.0019, 'grad_norm': 2.1132582745363284, 'learning_rate': 8.633540372670807e-07, 'completion_length': 123.9285774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.30228933691978455, 'kl': 0.0467529296875, 'epoch': 0.68} 14%|█▎ | 220/1610 [1:51:06<31:23:24, 81.30s/it] 14%|█▎ | 221/1610 [1:53:49<40:50:39, 105.86s/it] {'loss': 0.0018, 'grad_norm': 1.66265600361063, 'learning_rate': 8.627329192546583e-07, 'completion_length': 135.1607208251953, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.11266788095235825, 'kl': 0.044921875, 'epoch': 0.69} 14%|█▎ | 221/1610 [1:53:49<40:50:39, 105.86s/it] 14%|█▍ | 222/1610 [1:57:08<51:31:31, 133.64s/it] {'loss': 0.0021, 'grad_norm': 2.2285200740473576, 'learning_rate': 8.621118012422361e-07, 'completion_length': 102.46429061889648, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.05322265625, 'epoch': 0.69} 14%|█▍ | 222/1610 [1:57:08<51:31:31, 133.64s/it] 14%|█▍ | 223/1610 [1:57:36<39:19:53, 102.09s/it] {'loss': 0.0027, 'grad_norm': 2.0446723832763225, 'learning_rate': 8.614906832298137e-07, 'completion_length': 104.62500381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214783191681, 'kl': 0.0672607421875, 'epoch': 0.69} 14%|█▍ | 223/1610 [1:57:37<39:19:53, 102.09s/it] 14%|█▍ | 224/1610 [1:58:38<34:36:43, 89.90s/it] {'loss': 0.0031, 'grad_norm': 2.068122079612577, 'learning_rate': 8.608695652173913e-07, 'completion_length': 106.96429443359375, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.076904296875, 'epoch': 0.7} 14%|█▍ | 224/1610 [1:58:38<34:36:43, 89.90s/it] 14%|█▍ | 225/1610 [1:59:52<32:44:59, 85.13s/it] {'loss': 0.0024, 'grad_norm': 2.80850896143313, 'learning_rate': 8.60248447204969e-07, 'completion_length': 98.01786422729492, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1785714402794838, 'kl': 0.059814453125, 'epoch': 0.7} 14%|█▍ | 225/1610 [1:59:52<32:44:59, 85.13s/it] 14%|█▍ | 226/1610 [2:00:38<28:12:39, 73.38s/it] {'loss': 0.0018, 'grad_norm': 1.3833655453306961, 'learning_rate': 8.596273291925465e-07, 'completion_length': 124.53572082519531, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.0460205078125, 'epoch': 0.7} 14%|█▍ | 226/1610 [2:00:38<28:12:39, 73.38s/it] 14%|█▍ | 227/1610 [2:02:32<32:55:02, 85.68s/it] {'loss': 0.002, 'grad_norm': 1.160130544980714, 'learning_rate': 8.590062111801241e-07, 'completion_length': 141.8571548461914, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.051025390625, 'epoch': 0.7} 14%|█▍ | 227/1610 [2:02:32<32:55:02, 85.68s/it] 14%|█▍ | 228/1610 [2:04:45<38:22:00, 99.94s/it] {'loss': 0.0036, 'grad_norm': 2.651800174987089, 'learning_rate': 8.583850931677018e-07, 'completion_length': 92.00000762939453, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2006715089082718, 'kl': 0.08984375, 'epoch': 0.71} 14%|█▍ | 228/1610 [2:04:45<38:22:00, 99.94s/it] 14%|█▍ | 229/1610 [2:07:36<46:25:10, 121.01s/it] {'loss': 0.0038, 'grad_norm': 2.979081441189684, 'learning_rate': 8.577639751552794e-07, 'completion_length': 101.0714340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.343528650701046, 'kl': 0.095703125, 'epoch': 0.71} 14%|█▍ | 229/1610 [2:07:36<46:25:10, 121.01s/it] 14%|█▍ | 230/1610 [2:08:35<39:16:18, 102.45s/it] {'loss': 0.0039, 'grad_norm': 1.6430428346104926, 'learning_rate': 8.57142857142857e-07, 'completion_length': 90.76786422729492, 'rewards/accuracy_reward': 0.357142873108387, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.098876953125, 'epoch': 0.71} 14%|█▍ | 230/1610 [2:08:35<39:16:18, 102.45s/it] 14%|█▍ | 231/1610 [2:09:11<31:38:52, 82.62s/it] {'loss': 0.0042, 'grad_norm': 1.3650591026161947, 'learning_rate': 8.565217391304348e-07, 'completion_length': 96.64286422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.10498046875, 'epoch': 0.72} 14%|█▍ | 231/1610 [2:09:11<31:38:52, 82.62s/it] 14%|█▍ | 232/1610 [2:11:14<36:18:38, 94.86s/it] {'loss': 0.0044, 'grad_norm': 1.628062180729294, 'learning_rate': 8.559006211180124e-07, 'completion_length': 83.75000381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216826319695, 'kl': 0.1103515625, 'epoch': 0.72} 14%|█▍ | 232/1610 [2:11:14<36:18:38, 94.86s/it] 14%|█▍ | 233/1610 [2:15:08<52:09:18, 136.35s/it] {'loss': 0.0046, 'grad_norm': 2.158487858377924, 'learning_rate': 8.5527950310559e-07, 'completion_length': 102.71428680419922, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.23086076974868774, 'kl': 0.114501953125, 'epoch': 0.72} 14%|█▍ | 233/1610 [2:15:08<52:09:18, 136.35s/it] 15%|█▍ | 234/1610 [2:17:11<50:39:04, 132.52s/it] {'loss': 0.0045, 'grad_norm': 2.1120797170348635, 'learning_rate': 8.546583850931677e-07, 'completion_length': 94.14286041259766, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2500000186264515, 'kl': 0.11279296875, 'epoch': 0.73} 15%|█▍ | 234/1610 [2:17:11<50:39:04, 132.52s/it] 15%|█▍ | 235/1610 [2:18:22<43:35:48, 114.14s/it] {'loss': 0.0043, 'grad_norm': 2.1183855349287537, 'learning_rate': 8.540372670807453e-07, 'completion_length': 101.50000762939453, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.21981073915958405, 'kl': 0.10888671875, 'epoch': 0.73} 15%|█▍ | 235/1610 [2:18:22<43:35:48, 114.14s/it] 15%|█▍ | 236/1610 [2:19:24<37:32:12, 98.35s/it] {'loss': 0.0047, 'grad_norm': 2.0584506846552584, 'learning_rate': 8.534161490683229e-07, 'completion_length': 103.4464340209961, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.23086076974868774, 'kl': 0.118408203125, 'epoch': 0.73} 15%|█▍ | 236/1610 [2:19:24<37:32:12, 98.35s/it] 15%|█▍ | 237/1610 [2:20:41<35:01:08, 91.82s/it] {'loss': 0.0047, 'grad_norm': 1.2301158993293704, 'learning_rate': 8.527950310559006e-07, 'completion_length': 120.8214340209961, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928572535514832, 'reward_std': 0.20117833465337753, 'kl': 0.1171875, 'epoch': 0.74} 15%|█▍ | 237/1610 [2:20:41<35:01:08, 91.82s/it] 15%|█▍ | 238/1610 [2:21:40<31:16:47, 82.08s/it] {'loss': 0.0052, 'grad_norm': 1.9011127310100169, 'learning_rate': 8.521739130434782e-07, 'completion_length': 100.85714721679688, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.23086079210042953, 'kl': 0.13037109375, 'epoch': 0.74} 15%|█▍ | 238/1610 [2:21:40<31:16:47, 82.08s/it] 15%|█▍ | 239/1610 [2:22:24<26:52:01, 70.55s/it] {'loss': 0.0041, 'grad_norm': 3.150743380108309, 'learning_rate': 8.515527950310558e-07, 'completion_length': 110.37500381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.3294376879930496, 'kl': 0.101806640625, 'epoch': 0.74} 15%|█▍ | 239/1610 [2:22:24<26:52:01, 70.55s/it] 15%|█▍ | 240/1610 [2:23:30<26:19:30, 69.18s/it] {'loss': 0.005, 'grad_norm': 2.6048505965831494, 'learning_rate': 8.509316770186336e-07, 'completion_length': 109.53572082519531, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.1896214634180069, 'kl': 0.124267578125, 'epoch': 0.75} 15%|█▍ | 240/1610 [2:23:30<26:19:30, 69.18s/it] 15%|█▍ | 241/1610 [2:24:06<22:33:53, 59.34s/it] {'loss': 0.0047, 'grad_norm': 2.7618291452165242, 'learning_rate': 8.503105590062112e-07, 'completion_length': 108.17857360839844, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.3550855293869972, 'kl': 0.116943359375, 'epoch': 0.75} 15%|█▍ | 241/1610 [2:24:06<22:33:53, 59.34s/it] 15%|█▌ | 242/1610 [2:24:43<19:59:25, 52.61s/it] {'loss': 0.0047, 'grad_norm': 4.374162057138571, 'learning_rate': 8.496894409937888e-07, 'completion_length': 92.19643020629883, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.29123930633068085, 'kl': 0.116455078125, 'epoch': 0.75} 15%|█▌ | 242/1610 [2:24:43<19:59:25, 52.61s/it] 15%|█▌ | 243/1610 [2:25:18<17:57:31, 47.29s/it] {'loss': 0.0039, 'grad_norm': 12.286451411459543, 'learning_rate': 8.490683229813665e-07, 'completion_length': 110.41072082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.3435286656022072, 'kl': 0.09765625, 'epoch': 0.75} 15%|█▌ | 243/1610 [2:25:18<17:57:31, 47.29s/it] 15%|█▌ | 244/1610 [2:26:23<19:58:39, 52.65s/it] {'loss': 0.0046, 'grad_norm': 1.9832828749596234, 'learning_rate': 8.484472049689441e-07, 'completion_length': 86.76786041259766, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.1428571492433548, 'kl': 0.11572265625, 'epoch': 0.76} 15%|█▌ | 244/1610 [2:26:23<19:58:39, 52.65s/it] 15%|█▌ | 245/1610 [2:27:57<24:41:46, 65.13s/it] {'loss': 0.0043, 'grad_norm': 1.3446663159640095, 'learning_rate': 8.478260869565217e-07, 'completion_length': 125.25000381469727, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.14838216826319695, 'kl': 0.108642578125, 'epoch': 0.76} 15%|█▌ | 245/1610 [2:27:57<24:41:46, 65.13s/it] 15%|█▌ | 246/1610 [2:29:35<28:25:32, 75.02s/it] {'loss': 0.0032, 'grad_norm': 3.3876165540413163, 'learning_rate': 8.472049689440994e-07, 'completion_length': 95.6964340209961, 'rewards/accuracy_reward': 0.357142873108387, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.24794265627861023, 'kl': 0.08056640625, 'epoch': 0.76} 15%|█▌ | 246/1610 [2:29:35<28:25:32, 75.02s/it] 15%|█▌ | 247/1610 [2:31:00<29:31:54, 78.00s/it] {'loss': 0.0035, 'grad_norm': 2.7537226585238885, 'learning_rate': 8.46583850931677e-07, 'completion_length': 105.87500762939453, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.25552502274513245, 'kl': 0.087158203125, 'epoch': 0.77} 15%|█▌ | 247/1610 [2:31:00<29:31:54, 78.00s/it] 15%|█▌ | 248/1610 [2:32:08<28:18:45, 74.84s/it] {'loss': 0.0026, 'grad_norm': 1.2020862171442421, 'learning_rate': 8.459627329192546e-07, 'completion_length': 140.0714340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4642857909202576, 'reward_std': 0.23385171964764595, 'kl': 0.0653076171875, 'epoch': 0.77} 15%|█▌ | 248/1610 [2:32:08<28:18:45, 74.84s/it] 15%|█▌ | 249/1610 [2:32:48<24:23:26, 64.52s/it] {'loss': 0.0023, 'grad_norm': 1.7908664897550592, 'learning_rate': 8.453416149068324e-07, 'completion_length': 142.67857360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1785714365541935, 'kl': 0.057373046875, 'epoch': 0.77} 15%|█▌ | 249/1610 [2:32:48<24:23:26, 64.52s/it] 16%|█▌ | 250/1610 [2:33:17<20:21:08, 53.87s/it] {'loss': 0.0025, 'grad_norm': 0.9164593685521049, 'learning_rate': 8.447204968944099e-07, 'completion_length': 108.92857360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.0631103515625, 'epoch': 0.78} 16%|█▌ | 250/1610 [2:33:17<20:21:08, 53.87s/it] 16%|█▌ | 251/1610 [2:33:45<17:27:02, 46.23s/it] {'loss': 0.0025, 'grad_norm': 1.7995025106243914, 'learning_rate': 8.440993788819875e-07, 'completion_length': 100.50000381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1539071835577488, 'kl': 0.063720703125, 'epoch': 0.78} 16%|█▌ | 251/1610 [2:33:45<17:27:02, 46.23s/it] 16%|█▌ | 252/1610 [2:34:14<15:23:18, 40.79s/it] {'loss': 0.0025, 'grad_norm': 5.520811352290147, 'learning_rate': 8.434782608695652e-07, 'completion_length': 117.76786041259766, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500454902649, 'kl': 0.06298828125, 'epoch': 0.78} 16%|█▌ | 252/1610 [2:34:14<15:23:18, 40.79s/it] 16%|█▌ | 253/1610 [2:34:49<14:48:49, 39.30s/it] {'loss': 0.0018, 'grad_norm': 1.8494966332141134, 'learning_rate': 8.428571428571428e-07, 'completion_length': 120.23215103149414, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409644067287445, 'kl': 0.0458984375, 'epoch': 0.79} 16%|█▌ | 253/1610 [2:34:49<14:48:49, 39.30s/it] 16%|█▌ | 254/1610 [2:35:20<13:47:24, 36.61s/it] {'loss': 0.0028, 'grad_norm': 1.9030501527072512, 'learning_rate': 8.422360248447204e-07, 'completion_length': 108.3214340209961, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.2610500380396843, 'kl': 0.0692138671875, 'epoch': 0.79} 16%|█▌ | 254/1610 [2:35:20<13:47:24, 36.61s/it] 16%|█▌ | 255/1610 [2:35:48<12:53:01, 34.23s/it] {'loss': 0.0024, 'grad_norm': 2.7889432813500936, 'learning_rate': 8.416149068322981e-07, 'completion_length': 109.4285774230957, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.2610500305891037, 'kl': 0.060791015625, 'epoch': 0.79} 16%|█▌ | 255/1610 [2:35:48<12:53:01, 34.23s/it] 16%|█▌ | 256/1610 [2:36:15<11:59:43, 31.89s/it] {'loss': 0.0018, 'grad_norm': 4.020880723178876, 'learning_rate': 8.409937888198757e-07, 'completion_length': 117.39286041259766, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.29123930633068085, 'kl': 0.0460205078125, 'epoch': 0.8} 16%|█▌ | 256/1610 [2:36:15<11:59:43, 31.89s/it] 16%|█▌ | 257/1610 [2:36:47<12:04:14, 32.12s/it] {'loss': 0.0022, 'grad_norm': 4.60179978284463, 'learning_rate': 8.403726708074533e-07, 'completion_length': 142.55357360839844, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.30228935927152634, 'kl': 0.0550537109375, 'epoch': 0.8} 16%|█▌ | 257/1610 [2:36:47<12:04:14, 32.12s/it] 16%|█▌ | 258/1610 [2:37:17<11:44:14, 31.25s/it] {'loss': 0.0021, 'grad_norm': 3.324294912248981, 'learning_rate': 8.397515527950311e-07, 'completion_length': 106.25000762939453, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.25552502274513245, 'kl': 0.0533447265625, 'epoch': 0.8} 16%|█▌ | 258/1610 [2:37:17<11:44:14, 31.25s/it] 16%|█▌ | 259/1610 [2:37:58<12:53:22, 34.35s/it] {'loss': 0.0018, 'grad_norm': 2.5905562283203483, 'learning_rate': 8.391304347826087e-07, 'completion_length': 151.2678680419922, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.33800364285707474, 'kl': 0.0462646484375, 'epoch': 0.8} 16%|█▌ | 259/1610 [2:37:58<12:53:22, 34.35s/it] 16%|█▌ | 260/1610 [2:38:39<13:34:58, 36.22s/it] {'loss': 0.0021, 'grad_norm': 4.660310598896255, 'learning_rate': 8.385093167701863e-07, 'completion_length': 155.08928680419922, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.2142857238650322, 'kl': 0.051513671875, 'epoch': 0.81} 16%|█▌ | 260/1610 [2:38:39<13:34:58, 36.22s/it] 16%|█▌ | 261/1610 [2:39:20<14:04:52, 37.58s/it] {'loss': 0.0021, 'grad_norm': 1.5431485506833815, 'learning_rate': 8.37888198757764e-07, 'completion_length': 170.9107208251953, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.21124479919672012, 'kl': 0.0535888671875, 'epoch': 0.81} 16%|█▌ | 261/1610 [2:39:20<14:04:52, 37.58s/it] 16%|█▋ | 262/1610 [2:39:54<13:45:51, 36.76s/it] {'loss': 0.0016, 'grad_norm': 4.396345547276898, 'learning_rate': 8.372670807453416e-07, 'completion_length': 108.87500762939453, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.1865805685520172, 'kl': 0.038818359375, 'epoch': 0.81} 16%|█▋ | 262/1610 [2:39:54<13:45:51, 36.76s/it] 16%|█▋ | 263/1610 [2:40:21<12:34:31, 33.61s/it] {'loss': 0.0019, 'grad_norm': 2.86285988844386, 'learning_rate': 8.366459627329192e-07, 'completion_length': 91.16071701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.3435286581516266, 'kl': 0.0467529296875, 'epoch': 0.82} 16%|█▋ | 263/1610 [2:40:21<12:34:31, 33.61s/it] 16%|█▋ | 264/1610 [2:40:52<12:16:07, 32.81s/it] {'loss': 0.0017, 'grad_norm': 1.4452645631603214, 'learning_rate': 8.360248447204969e-07, 'completion_length': 140.1964340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.18409645557403564, 'kl': 0.042724609375, 'epoch': 0.82} 16%|█▋ | 264/1610 [2:40:52<12:16:07, 32.81s/it] 16%|█▋ | 265/1610 [2:41:28<12:37:27, 33.79s/it] {'loss': 0.0023, 'grad_norm': 2.9296471628080094, 'learning_rate': 8.354037267080745e-07, 'completion_length': 109.08929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214634180069, 'kl': 0.0567626953125, 'epoch': 0.82} 16%|█▋ | 265/1610 [2:41:28<12:37:27, 33.79s/it] 17%|█▋ | 266/1610 [2:42:19<14:30:53, 38.88s/it] {'loss': 0.0022, 'grad_norm': 1.8130864259574881, 'learning_rate': 8.347826086956521e-07, 'completion_length': 140.87500762939453, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.25552503019571304, 'kl': 0.0552978515625, 'epoch': 0.83} 17%|█▋ | 266/1610 [2:42:19<14:30:53, 38.88s/it] 17%|█▋ | 267/1610 [2:43:21<17:09:30, 45.99s/it] {'loss': 0.0022, 'grad_norm': 1.8450590616438274, 'learning_rate': 8.341614906832299e-07, 'completion_length': 114.8035774230957, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.14838216453790665, 'kl': 0.0540771484375, 'epoch': 0.83} 17%|█▋ | 267/1610 [2:43:21<17:09:30, 45.99s/it] 17%|█▋ | 268/1610 [2:44:54<22:20:54, 59.95s/it] {'loss': 0.0021, 'grad_norm': 0.902564662395916, 'learning_rate': 8.335403726708075e-07, 'completion_length': 104.08929061889648, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.07695359364151955, 'kl': 0.051513671875, 'epoch': 0.83} 17%|█▋ | 268/1610 [2:44:54<22:20:54, 59.95s/it] 17%|█▋ | 269/1610 [2:46:15<24:46:33, 66.51s/it] {'loss': 0.0017, 'grad_norm': 0.696528721797528, 'learning_rate': 8.329192546583851e-07, 'completion_length': 127.48215103149414, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0416259765625, 'epoch': 0.84} 17%|█▋ | 269/1610 [2:46:16<24:46:33, 66.51s/it] 17%|█▋ | 270/1610 [2:47:55<28:28:23, 76.50s/it] {'loss': 0.0019, 'grad_norm': 1.0368493886203194, 'learning_rate': 8.322981366459628e-07, 'completion_length': 110.89286041259766, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0467529296875, 'epoch': 0.84} 17%|█▋ | 270/1610 [2:47:55<28:28:23, 76.50s/it] 17%|█▋ | 271/1610 [2:49:12<28:26:18, 76.46s/it] {'loss': 0.002, 'grad_norm': 2.627142483365279, 'learning_rate': 8.316770186335404e-07, 'completion_length': 123.5714340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1428571492433548, 'kl': 0.0509033203125, 'epoch': 0.84} 17%|█▋ | 271/1610 [2:49:12<28:26:18, 76.46s/it] 17%|█▋ | 272/1610 [2:50:49<30:45:48, 82.77s/it] {'loss': 0.0025, 'grad_norm': 4.940200894240968, 'learning_rate': 8.31055900621118e-07, 'completion_length': 103.78572082519531, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.40943220257759094, 'kl': 0.0621337890625, 'epoch': 0.84} 17%|█▋ | 272/1610 [2:50:49<30:45:48, 82.77s/it] 17%|█▋ | 273/1610 [2:52:39<33:45:42, 90.91s/it] {'loss': 0.0026, 'grad_norm': 4.554819892772285, 'learning_rate': 8.304347826086955e-07, 'completion_length': 126.71429443359375, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178572535514832, 'reward_std': 0.2937234118580818, 'kl': 0.066162109375, 'epoch': 0.85} 17%|█▋ | 273/1610 [2:52:39<33:45:42, 90.91s/it] 17%|█▋ | 274/1610 [2:53:53<31:49:02, 85.74s/it] {'loss': 0.0019, 'grad_norm': 2.273360345129114, 'learning_rate': 8.298136645962732e-07, 'completion_length': 100.58929061889648, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.2610500454902649, 'kl': 0.0469970703125, 'epoch': 0.85} 17%|█▋ | 274/1610 [2:53:53<31:49:02, 85.74s/it] 17%|█▋ | 275/1610 [2:55:03<30:03:20, 81.05s/it] {'loss': 0.0026, 'grad_norm': 2.3294211837210153, 'learning_rate': 8.291925465838508e-07, 'completion_length': 81.19643020629883, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.24191081523895264, 'kl': 0.065185546875, 'epoch': 0.85} 17%|█▋ | 275/1610 [2:55:03<30:03:20, 81.05s/it] 17%|█▋ | 276/1610 [2:56:45<32:20:50, 87.29s/it] {'loss': 0.003, 'grad_norm': 2.7332705383405647, 'learning_rate': 8.285714285714285e-07, 'completion_length': 93.33928680419922, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.2610500380396843, 'kl': 0.0751953125, 'epoch': 0.86} 17%|█▋ | 276/1610 [2:56:45<32:20:50, 87.29s/it] 17%|█▋ | 277/1610 [2:57:54<30:18:54, 81.87s/it] {'loss': 0.0023, 'grad_norm': 4.871146932933863, 'learning_rate': 8.279503105590062e-07, 'completion_length': 89.98214721679688, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.30228933691978455, 'kl': 0.056640625, 'epoch': 0.86} 17%|█▋ | 277/1610 [2:57:54<30:18:54, 81.87s/it] 17%|█▋ | 278/1610 [2:59:25<31:22:08, 84.78s/it] {'loss': 0.0025, 'grad_norm': 1.5643388563530525, 'learning_rate': 8.273291925465838e-07, 'completion_length': 154.5714340209961, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3214285969734192, 'reward_std': 0.26657508313655853, 'kl': 0.0625, 'epoch': 0.86} 17%|█▋ | 278/1610 [2:59:25<31:22:08, 84.78s/it] 17%|█▋ | 279/1610 [3:02:57<45:23:50, 122.79s/it] {'loss': 0.0022, 'grad_norm': 1.605070539126889, 'learning_rate': 8.267080745341614e-07, 'completion_length': 142.85714721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1896214708685875, 'kl': 0.0540771484375, 'epoch': 0.87} 17%|█▋ | 279/1610 [3:02:57<45:23:50, 122.79s/it] 17%|█▋ | 280/1610 [3:03:15<33:44:39, 91.34s/it] {'loss': 0.0022, 'grad_norm': 1.4625063880947269, 'learning_rate': 8.260869565217391e-07, 'completion_length': 109.00000381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.053955078125, 'epoch': 0.87} 17%|█▋ | 280/1610 [3:03:15<33:44:39, 91.34s/it] 17%|█▋ | 281/1610 [3:04:27<31:33:17, 85.48s/it] {'loss': 0.0016, 'grad_norm': 1.0691654766600525, 'learning_rate': 8.254658385093167e-07, 'completion_length': 130.76786422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.1428571529686451, 'kl': 0.0391845703125, 'epoch': 0.87} 17%|█▋ | 281/1610 [3:04:27<31:33:17, 85.48s/it] 18%|█▊ | 282/1610 [3:04:42<23:47:30, 64.50s/it] {'loss': 0.0019, 'grad_norm': 1.5755205580214913, 'learning_rate': 8.248447204968943e-07, 'completion_length': 125.5714340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.048095703125, 'epoch': 0.88} 18%|█▊ | 282/1610 [3:04:42<23:47:30, 64.50s/it] 18%|█▊ | 283/1610 [3:05:25<21:22:35, 57.99s/it] {'loss': 0.0018, 'grad_norm': 7.807074267141928, 'learning_rate': 8.24223602484472e-07, 'completion_length': 90.41071701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.045654296875, 'epoch': 0.88} 18%|█▊ | 283/1610 [3:05:25<21:22:35, 57.99s/it] 18%|█▊ | 284/1610 [3:06:44<23:41:51, 64.34s/it] {'loss': 0.0021, 'grad_norm': 1.9543026468519118, 'learning_rate': 8.236024844720496e-07, 'completion_length': 96.3035774230957, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.30228933691978455, 'kl': 0.052734375, 'epoch': 0.88} 18%|█▊ | 284/1610 [3:06:44<23:41:51, 64.34s/it] 18%|█▊ | 285/1610 [3:07:30<21:39:40, 58.85s/it] {'loss': 0.0024, 'grad_norm': 1.8238783372927867, 'learning_rate': 8.229813664596273e-07, 'completion_length': 109.51786041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.23086079210042953, 'kl': 0.06005859375, 'epoch': 0.89} 18%|█▊ | 285/1610 [3:07:30<21:39:40, 58.85s/it] 18%|█▊ | 286/1610 [3:08:24<21:05:35, 57.35s/it] {'loss': 0.0011, 'grad_norm': 0.9804766162070138, 'learning_rate': 8.22360248447205e-07, 'completion_length': 119.16071701049805, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7321429252624512, 'reward_std': 0.21981073915958405, 'kl': 0.02752685546875, 'epoch': 0.89} 18%|█▊ | 286/1610 [3:08:24<21:05:35, 57.35s/it] 18%|█▊ | 287/1610 [3:08:57<18:24:40, 50.10s/it] {'loss': 0.0025, 'grad_norm': 1.981148660875249, 'learning_rate': 8.217391304347826e-07, 'completion_length': 114.8035774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.0623779296875, 'epoch': 0.89} 18%|█▊ | 287/1610 [3:08:57<18:24:40, 50.10s/it] 18%|█▊ | 288/1610 [3:10:17<21:38:23, 58.93s/it] {'loss': 0.0021, 'grad_norm': 1.8128288296916524, 'learning_rate': 8.211180124223602e-07, 'completion_length': 129.23214721679688, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.26353415846824646, 'kl': 0.0528564453125, 'epoch': 0.89} 18%|█▊ | 288/1610 [3:10:17<21:38:23, 58.93s/it] 18%|█▊ | 289/1610 [3:11:18<21:55:48, 59.76s/it] {'loss': 0.0017, 'grad_norm': 1.4845659655413768, 'learning_rate': 8.204968944099379e-07, 'completion_length': 106.41071701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.25248410552740097, 'kl': 0.042724609375, 'epoch': 0.9} 18%|█▊ | 289/1610 [3:11:18<21:55:48, 59.76s/it] 18%|█▊ | 290/1610 [3:11:58<19:43:19, 53.79s/it] {'loss': 0.0016, 'grad_norm': 2.7129593597979893, 'learning_rate': 8.198757763975155e-07, 'completion_length': 117.46428680419922, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.2610500305891037, 'kl': 0.0399169921875, 'epoch': 0.9} 18%|█▊ | 290/1610 [3:11:58<19:43:19, 53.79s/it] 18%|█▊ | 291/1610 [3:12:22<16:26:23, 44.87s/it] {'loss': 0.0018, 'grad_norm': 1.3573313696924052, 'learning_rate': 8.192546583850931e-07, 'completion_length': 118.14286041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2253357619047165, 'kl': 0.0438232421875, 'epoch': 0.9} 18%|█▊ | 291/1610 [3:12:22<16:26:23, 44.87s/it] 18%|█▊ | 292/1610 [3:13:02<15:53:14, 43.39s/it] {'loss': 0.0018, 'grad_norm': 1.4502685818173522, 'learning_rate': 8.186335403726708e-07, 'completion_length': 132.50000762939453, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1785714402794838, 'kl': 0.0458984375, 'epoch': 0.91} 18%|█▊ | 292/1610 [3:13:02<15:53:14, 43.39s/it] 18%|█▊ | 293/1610 [3:13:39<15:06:10, 41.28s/it] {'loss': 0.0018, 'grad_norm': 2.3745347337912177, 'learning_rate': 8.180124223602484e-07, 'completion_length': 133.26786041259766, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.18409645557403564, 'kl': 0.044921875, 'epoch': 0.91} 18%|█▊ | 293/1610 [3:13:39<15:06:10, 41.28s/it] 18%|█▊ | 294/1610 [3:14:11<14:06:46, 38.61s/it] {'loss': 0.002, 'grad_norm': 5.391314381838979, 'learning_rate': 8.173913043478261e-07, 'completion_length': 125.16072082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.0494384765625, 'epoch': 0.91} 18%|█▊ | 294/1610 [3:14:11<14:06:46, 38.61s/it] 18%|█▊ | 295/1610 [3:14:55<14:39:28, 40.13s/it] {'loss': 0.0015, 'grad_norm': 1.0347869515426102, 'learning_rate': 8.167701863354038e-07, 'completion_length': 139.67858123779297, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.0377197265625, 'epoch': 0.92} 18%|█▊ | 295/1610 [3:14:55<14:39:28, 40.13s/it] 18%|█▊ | 296/1610 [3:15:31<14:10:14, 38.82s/it] {'loss': 0.0019, 'grad_norm': 3.995477012366362, 'learning_rate': 8.161490683229814e-07, 'completion_length': 125.01786422729492, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2500000074505806, 'kl': 0.0482177734375, 'epoch': 0.92} 18%|█▊ | 296/1610 [3:15:31<14:10:14, 38.82s/it] 18%|█▊ | 297/1610 [3:16:38<17:19:16, 47.49s/it] {'loss': 0.0022, 'grad_norm': 0.6539222622244231, 'learning_rate': 8.155279503105589e-07, 'completion_length': 147.26786422729492, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.05517578125, 'epoch': 0.92} 18%|█▊ | 297/1610 [3:16:38<17:19:16, 47.49s/it] 19%|█▊ | 298/1610 [3:17:01<14:37:46, 40.14s/it] {'loss': 0.0016, 'grad_norm': 2.04335774784418, 'learning_rate': 8.149068322981366e-07, 'completion_length': 111.17857360839844, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.040771484375, 'epoch': 0.93} 19%|█▊ | 298/1610 [3:17:01<14:37:46, 40.14s/it] 19%|█▊ | 299/1610 [3:17:44<14:56:14, 41.02s/it] {'loss': 0.002, 'grad_norm': 1.58951508313347, 'learning_rate': 8.142857142857142e-07, 'completion_length': 143.44643020629883, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.21981073170900345, 'kl': 0.0499267578125, 'epoch': 0.93} 19%|█▊ | 299/1610 [3:17:44<14:56:14, 41.02s/it] 19%|█▊ | 300/1610 [3:18:20<14:17:49, 39.29s/it] {'loss': 0.0021, 'grad_norm': 1.891389471747532, 'learning_rate': 8.136645962732918e-07, 'completion_length': 93.91071701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.0531005859375, 'epoch': 0.93} 19%|█▊ | 300/1610 [3:18:20<14:17:49, 39.29s/it] 19%|█▊ | 301/1610 [3:21:49<32:48:49, 90.24s/it] {'loss': 0.0023, 'grad_norm': 8.798191334027747, 'learning_rate': 8.130434782608695e-07, 'completion_length': 93.66071701049805, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.18409644439816475, 'kl': 0.0587158203125, 'epoch': 0.93} 19%|█▊ | 301/1610 [3:21:49<32:48:49, 90.24s/it] 19%|█▉ | 302/1610 [3:22:20<26:21:18, 72.54s/it] {'loss': 0.002, 'grad_norm': 1.576493467612092, 'learning_rate': 8.124223602484471e-07, 'completion_length': 119.92857360839844, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.050537109375, 'epoch': 0.94} 19%|█▉ | 302/1610 [3:22:20<26:21:18, 72.54s/it] 19%|█▉ | 303/1610 [3:24:38<33:26:54, 92.13s/it] {'loss': 0.002, 'grad_norm': 0.9469545818839638, 'learning_rate': 8.118012422360247e-07, 'completion_length': 106.10714721679688, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.17553051561117172, 'kl': 0.0494384765625, 'epoch': 0.94} 19%|█▉ | 303/1610 [3:24:38<33:26:54, 92.13s/it] 19%|█▉ | 304/1610 [3:28:10<46:31:37, 128.25s/it] {'loss': 0.002, 'grad_norm': 2.2437206837709582, 'learning_rate': 8.111801242236025e-07, 'completion_length': 101.51786422729492, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2967643216252327, 'kl': 0.05078125, 'epoch': 0.94} 19%|█▉ | 304/1610 [3:28:10<46:31:37, 128.25s/it] 19%|█▉ | 305/1610 [3:30:09<45:27:08, 125.39s/it] {'loss': 0.0021, 'grad_norm': 1.511438360593975, 'learning_rate': 8.105590062111801e-07, 'completion_length': 113.64286041259766, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.11266788095235825, 'kl': 0.0537109375, 'epoch': 0.95} 19%|█▉ | 305/1610 [3:30:09<45:27:08, 125.39s/it] 19%|█▉ | 306/1610 [3:31:37<41:23:02, 114.25s/it] {'loss': 0.0021, 'grad_norm': 1.102478935579657, 'learning_rate': 8.099378881987577e-07, 'completion_length': 111.37500381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1181928962469101, 'kl': 0.05224609375, 'epoch': 0.95} 19%|█▉ | 306/1610 [3:31:37<41:23:02, 114.25s/it] 19%|█▉ | 307/1610 [3:32:50<36:48:54, 101.71s/it] {'loss': 0.0015, 'grad_norm': 1.5501151664051787, 'learning_rate': 8.093167701863354e-07, 'completion_length': 117.41072082519531, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.036865234375, 'epoch': 0.95} 19%|█▉ | 307/1610 [3:32:50<36:48:54, 101.71s/it] 19%|█▉ | 308/1610 [3:34:07<34:08:31, 94.40s/it] {'loss': 0.0018, 'grad_norm': 1.6585683495061934, 'learning_rate': 8.08695652173913e-07, 'completion_length': 109.23215103149414, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.0462646484375, 'epoch': 0.96} 19%|█▉ | 308/1610 [3:34:07<34:08:31, 94.40s/it] 19%|█▉ | 309/1610 [3:34:55<29:02:04, 80.34s/it] {'loss': 0.0017, 'grad_norm': 1.3927036432656237, 'learning_rate': 8.080745341614906e-07, 'completion_length': 78.51786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.18409644439816475, 'kl': 0.0428466796875, 'epoch': 0.96} 19%|█▉ | 309/1610 [3:34:55<29:02:04, 80.34s/it] 19%|█▉ | 310/1610 [3:36:14<28:56:18, 80.14s/it] {'loss': 0.0017, 'grad_norm': 1.2483040589119034, 'learning_rate': 8.074534161490683e-07, 'completion_length': 107.10714721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.15943220257759094, 'kl': 0.043212890625, 'epoch': 0.96} 19%|█▉ | 310/1610 [3:36:14<28:56:18, 80.14s/it] 19%|█▉ | 311/1610 [3:37:32<28:41:07, 79.50s/it] {'loss': 0.0017, 'grad_norm': 0.9132353144695513, 'learning_rate': 8.068322981366459e-07, 'completion_length': 101.28571701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.0413818359375, 'epoch': 0.97} 19%|█▉ | 311/1610 [3:37:32<28:41:07, 79.50s/it] 19%|█▉ | 312/1610 [3:39:31<32:52:44, 91.19s/it] {'loss': 0.0022, 'grad_norm': 1.2964381014986122, 'learning_rate': 8.062111801242235e-07, 'completion_length': 130.4285774230957, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.2721000909805298, 'kl': 0.055419921875, 'epoch': 0.97} 19%|█▉ | 312/1610 [3:39:31<32:52:44, 91.19s/it] 19%|█▉ | 313/1610 [3:41:55<38:36:30, 107.16s/it] {'loss': 0.0024, 'grad_norm': 0.37557666693501107, 'learning_rate': 8.055900621118013e-07, 'completion_length': 119.26786422729492, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.059326171875, 'epoch': 0.97} 19%|█▉ | 313/1610 [3:41:55<38:36:30, 107.16s/it] 20%|█▉ | 314/1610 [3:43:58<40:16:08, 111.86s/it] {'loss': 0.0017, 'grad_norm': 0.6315446803993553, 'learning_rate': 8.049689440993789e-07, 'completion_length': 135.46429443359375, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.07695359364151955, 'kl': 0.04150390625, 'epoch': 0.98} 20%|█▉ | 314/1610 [3:43:58<40:16:08, 111.86s/it] 20%|█▉ | 315/1610 [3:45:59<41:12:33, 114.56s/it] {'loss': 0.0019, 'grad_norm': 2.257668652605337, 'learning_rate': 8.043478260869565e-07, 'completion_length': 120.25000381469727, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.3435286656022072, 'kl': 0.0472412109375, 'epoch': 0.98} 20%|█▉ | 315/1610 [3:45:59<41:12:33, 114.56s/it] 20%|█▉ | 316/1610 [3:46:30<32:13:03, 89.63s/it] {'loss': 0.0022, 'grad_norm': 1.669795582790065, 'learning_rate': 8.037267080745342e-07, 'completion_length': 136.28571701049805, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.307814359664917, 'kl': 0.0540771484375, 'epoch': 0.98} 20%|█▉ | 316/1610 [3:46:30<32:13:03, 89.63s/it] 20%|█▉ | 317/1610 [3:47:56<31:48:30, 88.56s/it] {'loss': 0.0022, 'grad_norm': 1.3533264908951914, 'learning_rate': 8.031055900621118e-07, 'completion_length': 106.3035774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.1896214708685875, 'kl': 0.056396484375, 'epoch': 0.98} 20%|█▉ | 317/1610 [3:47:57<31:48:30, 88.56s/it] 20%|█▉ | 318/1610 [3:49:49<34:25:16, 95.91s/it] {'loss': 0.0022, 'grad_norm': 1.7415160851933578, 'learning_rate': 8.024844720496894e-07, 'completion_length': 126.66072463989258, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.2610500454902649, 'kl': 0.0560302734375, 'epoch': 0.99} 20%|█▉ | 318/1610 [3:49:49<34:25:16, 95.91s/it] 20%|█▉ | 319/1610 [3:50:49<30:28:04, 84.96s/it] {'loss': 0.002, 'grad_norm': 1.6328325969144064, 'learning_rate': 8.018633540372671e-07, 'completion_length': 103.5535774230957, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.19514649361371994, 'kl': 0.0491943359375, 'epoch': 0.99} 20%|█▉ | 319/1610 [3:50:49<30:28:04, 84.96s/it] 20%|█▉ | 320/1610 [3:52:28<31:59:28, 89.28s/it] {'loss': 0.002, 'grad_norm': 2.247475116509527, 'learning_rate': 8.012422360248446e-07, 'completion_length': 110.08929061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1428571529686451, 'kl': 0.0494384765625, 'epoch': 0.99} 20%|█▉ | 320/1610 [3:52:28<31:59:28, 89.28s/it] 20%|█▉ | 321/1610 [3:55:07<39:23:54, 110.03s/it] {'loss': 0.002, 'grad_norm': 1.6201661463634809, 'learning_rate': 8.006211180124222e-07, 'completion_length': 170.12500762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071798324585, 'kl': 0.0494384765625, 'epoch': 1.0} 20%|█▉ | 321/1610 [3:55:07<39:23:54, 110.03s/it] 20%|██ | 322/1610 [3:55:36<30:41:44, 85.80s/it] {'loss': 0.002, 'grad_norm': 0.9102521047585209, 'learning_rate': 8e-07, 'completion_length': 127.01786422729492, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.051025390625, 'epoch': 1.0} 20%|██ | 322/1610 [3:55:36<30:41:44, 85.80s/it] 20%|██ | 323/1610 [3:56:36<27:56:37, 78.16s/it] {'loss': 0.0025, 'grad_norm': 1.7852302568024028, 'learning_rate': 7.993788819875776e-07, 'completion_length': 177.80357360839844, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.061279296875, 'epoch': 1.0} 20%|██ | 323/1610 [3:56:36<27:56:37, 78.16s/it] 20%|██ | 324/1610 [3:57:02<22:17:08, 62.39s/it] {'loss': 0.0022, 'grad_norm': 0.8578412391249799, 'learning_rate': 7.987577639751552e-07, 'completion_length': 192.00000762939453, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.2610500492155552, 'kl': 0.055908203125, 'epoch': 1.01} 20%|██ | 324/1610 [3:57:02<22:17:08, 62.39s/it] 20%|██ | 325/1610 [3:57:27<18:15:17, 51.14s/it] {'loss': 0.0024, 'grad_norm': 1.362180420794446, 'learning_rate': 7.981366459627329e-07, 'completion_length': 157.60714721679688, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6785714626312256, 'reward_std': 0.2253357470035553, 'kl': 0.060791015625, 'epoch': 1.01} 20%|██ | 325/1610 [3:57:27<18:15:17, 51.14s/it] 20%|██ | 326/1610 [3:57:50<15:15:35, 42.78s/it] {'loss': 0.0023, 'grad_norm': 0.8043047858056444, 'learning_rate': 7.975155279503105e-07, 'completion_length': 160.35715103149414, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.13981623202562332, 'kl': 0.05810546875, 'epoch': 1.01} 20%|██ | 326/1610 [3:57:50<15:15:35, 42.78s/it] 20%|██ | 327/1610 [3:58:12<13:02:23, 36.59s/it] {'loss': 0.0022, 'grad_norm': 2.042942070123853, 'learning_rate': 7.968944099378881e-07, 'completion_length': 148.4107208251953, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.29123932123184204, 'kl': 0.055908203125, 'epoch': 1.02} 20%|██ | 327/1610 [3:58:12<13:02:23, 36.59s/it] 20%|██ | 328/1610 [3:58:34<11:27:34, 32.18s/it] {'loss': 0.0021, 'grad_norm': 1.1286376087196102, 'learning_rate': 7.962732919254658e-07, 'completion_length': 151.32144165039062, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216826319695, 'kl': 0.052978515625, 'epoch': 1.02} 20%|██ | 328/1610 [3:58:34<11:27:34, 32.18s/it] 20%|██ | 329/1610 [3:58:58<10:35:34, 29.77s/it] {'loss': 0.0017, 'grad_norm': 3.1549487768640145, 'learning_rate': 7.956521739130434e-07, 'completion_length': 166.1428680419922, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.2721000760793686, 'kl': 0.0435791015625, 'epoch': 1.02} 20%|██ | 329/1610 [3:58:58<10:35:34, 29.77s/it] 20%|██ | 330/1610 [3:59:25<10:14:34, 28.81s/it] {'loss': 0.0026, 'grad_norm': 0.8761881651895262, 'learning_rate': 7.95031055900621e-07, 'completion_length': 167.3214340209961, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.14838216826319695, 'kl': 0.064453125, 'epoch': 1.02} 20%|██ | 330/1610 [3:59:25<10:14:34, 28.81s/it] 21%|██ | 331/1610 [3:59:46<9:24:17, 26.47s/it] {'loss': 0.0032, 'grad_norm': 2.421737828080679, 'learning_rate': 7.944099378881988e-07, 'completion_length': 141.7857208251953, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.07958984375, 'epoch': 1.03} 21%|██ | 331/1610 [3:59:46<9:24:17, 26.47s/it] 21%|██ | 332/1610 [4:00:11<9:15:05, 26.06s/it] {'loss': 0.002, 'grad_norm': 1.6085496921787406, 'learning_rate': 7.937888198757764e-07, 'completion_length': 147.64286041259766, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2836569547653198, 'kl': 0.0504150390625, 'epoch': 1.03} 21%|██ | 332/1610 [4:00:11<9:15:05, 26.06s/it] 21%|██ | 333/1610 [4:00:33<8:50:56, 24.95s/it] {'loss': 0.0029, 'grad_norm': 2.1174589329863123, 'learning_rate': 7.93167701863354e-07, 'completion_length': 125.01786422729492, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.30228933691978455, 'kl': 0.072021484375, 'epoch': 1.03} 21%|██ | 333/1610 [4:00:33<8:50:56, 24.95s/it] 21%|██ | 334/1610 [4:00:59<8:56:29, 25.23s/it] {'loss': 0.0021, 'grad_norm': 0.6911832927348471, 'learning_rate': 7.925465838509317e-07, 'completion_length': 177.9107208251953, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.10410194098949432, 'kl': 0.0528564453125, 'epoch': 1.04} 21%|██ | 334/1610 [4:00:59<8:56:29, 25.23s/it] 21%|██ | 335/1610 [4:01:18<8:16:30, 23.37s/it] {'loss': 0.0018, 'grad_norm': 0.9635434223805077, 'learning_rate': 7.919254658385093e-07, 'completion_length': 121.85714721679688, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.07695359364151955, 'kl': 0.044677734375, 'epoch': 1.04} 21%|██ | 335/1610 [4:01:18<8:16:30, 23.37s/it] 21%|██ | 336/1610 [4:01:41<8:15:38, 23.34s/it] {'loss': 0.0031, 'grad_norm': 2.7320711703500793, 'learning_rate': 7.913043478260869e-07, 'completion_length': 130.60714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.25552502274513245, 'kl': 0.077880859375, 'epoch': 1.04} 21%|██ | 336/1610 [4:01:41<8:15:38, 23.34s/it] 21%|██ | 337/1610 [4:02:03<8:02:49, 22.76s/it] {'loss': 0.002, 'grad_norm': 1.3316342423292586, 'learning_rate': 7.906832298136646e-07, 'completion_length': 139.1428680419922, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.0491943359375, 'epoch': 1.05} 21%|██ | 337/1610 [4:02:03<8:02:49, 22.76s/it] 21%|██ | 338/1610 [4:02:28<8:16:27, 23.42s/it] {'loss': 0.0041, 'grad_norm': 1.202021122274234, 'learning_rate': 7.900621118012422e-07, 'completion_length': 187.4821548461914, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.25552503019571304, 'kl': 0.102783203125, 'epoch': 1.05} 21%|██ | 338/1610 [4:02:28<8:16:27, 23.42s/it] 21%|██ | 339/1610 [4:02:45<7:38:11, 21.63s/it] {'loss': 0.003, 'grad_norm': 1.2077684819578405, 'learning_rate': 7.894409937888198e-07, 'completion_length': 155.10715103149414, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.07695359364151955, 'kl': 0.074951171875, 'epoch': 1.05} 21%|██ | 339/1610 [4:02:45<7:38:11, 21.63s/it] 21%|██ | 340/1610 [4:03:01<7:01:19, 19.91s/it] {'loss': 0.0022, 'grad_norm': 1.8418406132799476, 'learning_rate': 7.888198757763976e-07, 'completion_length': 124.37500762939453, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.1896214708685875, 'kl': 0.05615234375, 'epoch': 1.06} 21%|██ | 340/1610 [4:03:01<7:01:19, 19.91s/it] 21%|██ | 341/1610 [4:03:21<6:57:50, 19.76s/it] {'loss': 0.0029, 'grad_norm': 0.8891950161283559, 'learning_rate': 7.881987577639752e-07, 'completion_length': 210.92858123779297, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.22327841073274612, 'kl': 0.07373046875, 'epoch': 1.06} 21%|██ | 341/1610 [4:03:21<6:57:50, 19.76s/it] 21%|██ | 342/1610 [4:03:35<6:26:10, 18.27s/it] {'loss': 0.0024, 'grad_norm': 1.0661031786229085, 'learning_rate': 7.875776397515528e-07, 'completion_length': 142.37500762939453, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.11266787722706795, 'kl': 0.0604248046875, 'epoch': 1.06} 21%|██ | 342/1610 [4:03:35<6:26:10, 18.27s/it] 21%|██▏ | 343/1610 [4:03:55<6:35:29, 18.73s/it] {'loss': 0.0018, 'grad_norm': 0.8025987973703044, 'learning_rate': 7.869565217391305e-07, 'completion_length': 183.37500762939453, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0443115234375, 'epoch': 1.07} 21%|██▏ | 343/1610 [4:03:55<6:35:29, 18.73s/it] 21%|██▏ | 344/1610 [4:04:13<6:30:14, 18.49s/it] {'loss': 0.0018, 'grad_norm': 0.9178518036694633, 'learning_rate': 7.86335403726708e-07, 'completion_length': 150.46429443359375, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.0462646484375, 'epoch': 1.07} 21%|██▏ | 344/1610 [4:04:13<6:30:14, 18.49s/it] 21%|██▏ | 345/1610 [4:05:46<14:20:18, 40.80s/it] {'loss': 0.0021, 'grad_norm': 1.7621230944364277, 'learning_rate': 7.857142857142856e-07, 'completion_length': 169.9107208251953, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.30228933691978455, 'kl': 0.053466796875, 'epoch': 1.07} 21%|██▏ | 345/1610 [4:05:46<14:20:18, 40.80s/it] 21%|██▏ | 346/1610 [4:09:32<33:49:56, 96.36s/it] {'loss': 0.0025, 'grad_norm': 1.6635822628514707, 'learning_rate': 7.850931677018633e-07, 'completion_length': 173.87500762939453, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.2857142873108387, 'kl': 0.0635986328125, 'epoch': 1.07} 21%|██▏ | 346/1610 [4:09:32<33:49:56, 96.36s/it] 22%|██▏ | 347/1610 [4:11:50<38:09:03, 108.74s/it] {'loss': 0.0021, 'grad_norm': 1.9754078008795044, 'learning_rate': 7.844720496894409e-07, 'completion_length': 119.21429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.26657506078481674, 'kl': 0.0535888671875, 'epoch': 1.08} 22%|██▏ | 347/1610 [4:11:51<38:09:03, 108.74s/it] 22%|██▏ | 348/1610 [4:12:40<32:00:55, 91.33s/it] {'loss': 0.0029, 'grad_norm': 1.8704797892577278, 'learning_rate': 7.838509316770185e-07, 'completion_length': 165.62500762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.29123930633068085, 'kl': 0.07275390625, 'epoch': 1.08} 22%|██▏ | 348/1610 [4:12:40<32:00:55, 91.33s/it] 22%|██▏ | 349/1610 [4:17:02<49:56:15, 142.57s/it] {'loss': 0.0039, 'grad_norm': 1.3225862283725762, 'learning_rate': 7.832298136645963e-07, 'completion_length': 167.51786041259766, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.30228935182094574, 'kl': 0.097900390625, 'epoch': 1.08} 22%|██▏ | 349/1610 [4:17:02<49:56:15, 142.57s/it] 22%|██▏ | 350/1610 [4:18:52<46:24:20, 132.59s/it] {'loss': 0.0035, 'grad_norm': 2.4453491706112183, 'learning_rate': 7.826086956521739e-07, 'completion_length': 176.92858123779297, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3750000596046448, 'reward_std': 0.2826733663678169, 'kl': 0.086181640625, 'epoch': 1.09} 22%|██▏ | 350/1610 [4:18:52<46:24:20, 132.59s/it] 22%|██▏ | 351/1610 [4:21:14<47:21:44, 135.43s/it] {'loss': 0.0025, 'grad_norm': 0.536330425389689, 'learning_rate': 7.819875776397515e-07, 'completion_length': 178.55358123779297, 'rewards/accuracy_reward': 0.696428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.063720703125, 'epoch': 1.09} 22%|██▏ | 351/1610 [4:21:14<47:21:44, 135.43s/it] 22%|██▏ | 352/1610 [4:27:13<70:48:29, 202.63s/it] {'loss': 0.0027, 'grad_norm': 1.1396275569388854, 'learning_rate': 7.813664596273292e-07, 'completion_length': 166.19644165039062, 'rewards/accuracy_reward': 0.3928571715950966, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.067626953125, 'epoch': 1.09} 22%|██▏ | 352/1610 [4:27:13<70:48:29, 202.63s/it] 22%|██▏ | 353/1610 [4:32:53<85:09:22, 243.88s/it] {'loss': 0.0043, 'grad_norm': 1.3023532926793406, 'learning_rate': 7.807453416149068e-07, 'completion_length': 210.62500762939453, 'rewards/accuracy_reward': 0.3392857387661934, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.21981074661016464, 'kl': 0.107666015625, 'epoch': 1.1} 22%|██▏ | 353/1610 [4:32:54<85:09:22, 243.88s/it] 22%|██▏ | 354/1610 [4:37:11<86:33:32, 248.10s/it] {'loss': 0.002, 'grad_norm': 1.4848794605938014, 'learning_rate': 7.801242236024844e-07, 'completion_length': 123.16072082519531, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838217198848724, 'kl': 0.0489501953125, 'epoch': 1.1} 22%|██▏ | 354/1610 [4:37:11<86:33:32, 248.10s/it] 22%|██▏ | 355/1610 [4:39:08<72:44:58, 208.68s/it] {'loss': 0.0026, 'grad_norm': 1.9843529218290217, 'learning_rate': 7.79503105590062e-07, 'completion_length': 133.80358123779297, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2253357619047165, 'kl': 0.06494140625, 'epoch': 1.1} 22%|██▏ | 355/1610 [4:39:08<72:44:58, 208.68s/it] 22%|██▏ | 356/1610 [4:40:00<56:16:18, 161.55s/it] {'loss': 0.0023, 'grad_norm': 1.2187157099095938, 'learning_rate': 7.788819875776397e-07, 'completion_length': 132.1785774230957, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3392857313156128, 'reward_std': 0.23689261823892593, 'kl': 0.058349609375, 'epoch': 1.11} 22%|██▏ | 356/1610 [4:40:00<56:16:18, 161.55s/it] 22%|██▏ | 357/1610 [4:43:57<64:09:22, 184.33s/it] {'loss': 0.0022, 'grad_norm': 2.24937951162751, 'learning_rate': 7.782608695652173e-07, 'completion_length': 133.57143020629883, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.30228933691978455, 'kl': 0.0540771484375, 'epoch': 1.11} 22%|██▏ | 357/1610 [4:43:57<64:09:22, 184.33s/it] 22%|██▏ | 358/1610 [4:48:03<70:29:16, 202.68s/it] {'loss': 0.0026, 'grad_norm': 0.9807413211271282, 'learning_rate': 7.776397515527951e-07, 'completion_length': 138.03571701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.1071428619325161, 'kl': 0.0640869140625, 'epoch': 1.11} 22%|██▏ | 358/1610 [4:48:03<70:29:16, 202.68s/it] 22%|██▏ | 359/1610 [4:52:41<78:22:07, 225.52s/it] {'loss': 0.0042, 'grad_norm': 2.0171381674448474, 'learning_rate': 7.770186335403727e-07, 'completion_length': 139.5357208251953, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.33800363540649414, 'kl': 0.1044921875, 'epoch': 1.11} 22%|██▏ | 359/1610 [4:52:41<78:22:07, 225.52s/it] 22%|██▏ | 360/1610 [4:56:14<77:00:05, 221.76s/it] {'loss': 0.0018, 'grad_norm': 1.4525576522943602, 'learning_rate': 7.763975155279503e-07, 'completion_length': 113.3035774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.1785714402794838, 'kl': 0.0438232421875, 'epoch': 1.12} 22%|██▏ | 360/1610 [4:56:14<77:00:05, 221.76s/it] 22%|██▏ | 361/1610 [4:59:02<71:15:53, 205.41s/it] {'loss': 0.0016, 'grad_norm': 1.5945539461355227, 'learning_rate': 7.75776397515528e-07, 'completion_length': 104.66071701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.1428571529686451, 'kl': 0.0390625, 'epoch': 1.12} 22%|██▏ | 361/1610 [4:59:02<71:15:53, 205.41s/it] 22%|██▏ | 362/1610 [5:01:49<67:18:36, 194.16s/it] {'loss': 0.0017, 'grad_norm': 3.1983600527296696, 'learning_rate': 7.751552795031056e-07, 'completion_length': 107.1785774230957, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.14838216826319695, 'kl': 0.0418701171875, 'epoch': 1.12} 22%|██▏ | 362/1610 [5:01:50<67:18:36, 194.16s/it] 23%|██▎ | 363/1610 [5:06:44<77:40:15, 224.23s/it] {'loss': 0.0035, 'grad_norm': 1.9837974116225014, 'learning_rate': 7.745341614906832e-07, 'completion_length': 109.9464340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2967643439769745, 'kl': 0.088134765625, 'epoch': 1.13} 23%|██▎ | 363/1610 [5:06:44<77:40:15, 224.23s/it] 23%|██▎ | 364/1610 [5:10:33<78:06:46, 225.69s/it] {'loss': 0.0032, 'grad_norm': 1.9546972127532696, 'learning_rate': 7.739130434782608e-07, 'completion_length': 113.4285774230957, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1785714402794838, 'kl': 0.0810546875, 'epoch': 1.13} 23%|██▎ | 364/1610 [5:10:33<78:06:46, 225.69s/it] 23%|██▎ | 365/1610 [5:13:14<71:17:55, 206.16s/it] {'loss': 0.002, 'grad_norm': 0.6954358152771162, 'learning_rate': 7.732919254658385e-07, 'completion_length': 111.41072082519531, 'rewards/accuracy_reward': 0.1964285746216774, 'rewards/format_reward': 1.0, 'reward': 1.1964285969734192, 'reward_std': 0.07695358991622925, 'kl': 0.05029296875, 'epoch': 1.13} 23%|██▎ | 365/1610 [5:13:14<71:17:55, 206.16s/it] 23%|██▎ | 366/1610 [5:17:13<74:39:25, 216.05s/it] {'loss': 0.0024, 'grad_norm': 1.1387921712055051, 'learning_rate': 7.726708074534161e-07, 'completion_length': 123.51786422729492, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838215708732605, 'kl': 0.0601806640625, 'epoch': 1.14} 23%|██▎ | 366/1610 [5:17:13<74:39:25, 216.05s/it] 23%|██▎ | 367/1610 [5:20:24<71:59:15, 208.49s/it] {'loss': 0.0024, 'grad_norm': 2.7798464929762488, 'learning_rate': 7.720496894409939e-07, 'completion_length': 114.64286041259766, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.0599365234375, 'epoch': 1.14} 23%|██▎ | 367/1610 [5:20:24<71:59:15, 208.49s/it] 23%|██▎ | 368/1610 [5:24:29<75:46:41, 219.65s/it] {'loss': 0.0046, 'grad_norm': 1.4429908244640302, 'learning_rate': 7.714285714285714e-07, 'completion_length': 139.12500762939453, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.17651408910751343, 'kl': 0.1142578125, 'epoch': 1.14} 23%|██▎ | 368/1610 [5:24:29<75:46:41, 219.65s/it] 23%|██▎ | 369/1610 [5:28:18<76:37:00, 222.26s/it] {'loss': 0.0031, 'grad_norm': 1.3578081699497262, 'learning_rate': 7.70807453416149e-07, 'completion_length': 132.46429061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.17553051561117172, 'kl': 0.0762939453125, 'epoch': 1.15} 23%|██▎ | 369/1610 [5:28:18<76:37:00, 222.26s/it] 23%|██▎ | 370/1610 [5:31:17<72:06:32, 209.35s/it] {'loss': 0.002, 'grad_norm': 2.1378006240615557, 'learning_rate': 7.701863354037266e-07, 'completion_length': 103.85714721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.1539071872830391, 'kl': 0.0504150390625, 'epoch': 1.15} 23%|██▎ | 370/1610 [5:31:17<72:06:32, 209.35s/it] 23%|██▎ | 371/1610 [5:34:03<67:36:18, 196.43s/it] {'loss': 0.0026, 'grad_norm': 1.2965733975688627, 'learning_rate': 7.695652173913043e-07, 'completion_length': 109.98214340209961, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.23638580739498138, 'kl': 0.0640869140625, 'epoch': 1.15} 23%|██▎ | 371/1610 [5:34:03<67:36:18, 196.43s/it] 23%|██▎ | 372/1610 [5:37:35<69:07:41, 201.02s/it] {'loss': 0.0028, 'grad_norm': 2.244165736445397, 'learning_rate': 7.689440993788819e-07, 'completion_length': 124.39286041259766, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.2891819626092911, 'kl': 0.06884765625, 'epoch': 1.16} 23%|██▎ | 372/1610 [5:37:35<69:07:41, 201.02s/it] 23%|██▎ | 373/1610 [5:39:04<57:34:15, 167.55s/it] {'loss': 0.0023, 'grad_norm': 2.004562169890491, 'learning_rate': 7.683229813664595e-07, 'completion_length': 88.6964340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500454902649, 'kl': 0.0584716796875, 'epoch': 1.16} 23%|██▎ | 373/1610 [5:39:04<57:34:15, 167.55s/it] 23%|██▎ | 374/1610 [5:40:07<46:45:51, 136.21s/it] {'loss': 0.0034, 'grad_norm': 1.5180287739111917, 'learning_rate': 7.677018633540372e-07, 'completion_length': 120.1964340209961, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.2142857313156128, 'kl': 0.08544921875, 'epoch': 1.16} 23%|██▎ | 374/1610 [5:40:07<46:45:51, 136.21s/it] 23%|██▎ | 375/1610 [5:40:30<35:03:21, 102.19s/it] {'loss': 0.003, 'grad_norm': 2.815478157084132, 'learning_rate': 7.670807453416148e-07, 'completion_length': 147.0714340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2253357544541359, 'kl': 0.076171875, 'epoch': 1.16} 23%|██▎ | 375/1610 [5:40:30<35:03:21, 102.19s/it] 23%|██▎ | 376/1610 [5:41:49<32:39:48, 95.29s/it] {'loss': 0.0026, 'grad_norm': 1.9459767508021737, 'learning_rate': 7.664596273291925e-07, 'completion_length': 105.51786041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1896214708685875, 'kl': 0.064208984375, 'epoch': 1.17} 23%|██▎ | 376/1610 [5:41:49<32:39:48, 95.29s/it] 23%|██▎ | 377/1610 [5:44:27<39:00:06, 113.87s/it] {'loss': 0.0017, 'grad_norm': 3.057980040314074, 'learning_rate': 7.658385093167702e-07, 'completion_length': 113.0535774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.11266787722706795, 'kl': 0.0421142578125, 'epoch': 1.17} 23%|██▎ | 377/1610 [5:44:27<39:00:06, 113.87s/it] 23%|██▎ | 378/1610 [5:47:37<46:50:37, 136.88s/it] {'loss': 0.0032, 'grad_norm': 1.6283051290644666, 'learning_rate': 7.652173913043478e-07, 'completion_length': 134.60714721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.21981074661016464, 'kl': 0.0791015625, 'epoch': 1.17} 23%|██▎ | 378/1610 [5:47:37<46:50:37, 136.88s/it] 24%|██▎ | 379/1610 [5:52:05<60:17:16, 176.31s/it] {'loss': 0.0023, 'grad_norm': 1.495413896367883, 'learning_rate': 7.645962732919254e-07, 'completion_length': 121.87500381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.20117834210395813, 'kl': 0.0577392578125, 'epoch': 1.18} 24%|██▎ | 379/1610 [5:52:06<60:17:16, 176.31s/it] 24%|██▎ | 380/1610 [5:56:52<71:31:35, 209.35s/it] {'loss': 0.004, 'grad_norm': 1.59003211163333, 'learning_rate': 7.639751552795031e-07, 'completion_length': 126.82143783569336, 'rewards/accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.14838215708732605, 'kl': 0.098876953125, 'epoch': 1.18} 24%|██▎ | 380/1610 [5:56:52<71:31:35, 209.35s/it] 24%|██▎ | 381/1610 [5:57:13<52:09:41, 152.79s/it] {'loss': 0.0026, 'grad_norm': 1.0446657795079266, 'learning_rate': 7.633540372670807e-07, 'completion_length': 126.62500762939453, 'rewards/accuracy_reward': 0.3035714328289032, 'rewards/format_reward': 1.0, 'reward': 1.3035715222358704, 'reward_std': 0.07695359364151955, 'kl': 0.0640869140625, 'epoch': 1.18} 24%|██▎ | 381/1610 [5:57:13<52:09:41, 152.79s/it] 24%|██▎ | 382/1610 [5:59:57<53:18:44, 156.29s/it] {'loss': 0.0023, 'grad_norm': 1.9124377041967537, 'learning_rate': 7.627329192546583e-07, 'completion_length': 127.78571701049805, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0565185546875, 'epoch': 1.19} 24%|██▎ | 382/1610 [5:59:57<53:18:44, 156.29s/it] 24%|██▍ | 383/1610 [6:03:26<58:39:18, 172.09s/it] {'loss': 0.0043, 'grad_norm': 1.5410243127245054, 'learning_rate': 7.62111801242236e-07, 'completion_length': 140.85715103149414, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.29123930633068085, 'kl': 0.108642578125, 'epoch': 1.19} 24%|██▍ | 383/1610 [6:03:26<58:39:18, 172.09s/it] 24%|██▍ | 384/1610 [6:05:12<51:52:44, 152.34s/it] {'loss': 0.0026, 'grad_norm': 1.1865620494383433, 'learning_rate': 7.614906832298136e-07, 'completion_length': 122.73215103149414, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.065673828125, 'epoch': 1.19} 24%|██▍ | 384/1610 [6:05:12<51:52:44, 152.34s/it] 24%|██▍ | 385/1610 [6:08:47<58:13:19, 171.10s/it] {'loss': 0.0049, 'grad_norm': 3.024014189173442, 'learning_rate': 7.608695652173913e-07, 'completion_length': 161.89286422729492, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.571428656578064, 'reward_std': 0.28365693986415863, 'kl': 0.123291015625, 'epoch': 1.2} 24%|██▍ | 385/1610 [6:08:47<58:13:19, 171.10s/it] 24%|██▍ | 386/1610 [6:13:10<67:32:15, 198.64s/it] {'loss': 0.0041, 'grad_norm': 1.63843116282892, 'learning_rate': 7.60248447204969e-07, 'completion_length': 132.87500381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.2253357544541359, 'kl': 0.10205078125, 'epoch': 1.2} 24%|██▍ | 386/1610 [6:13:10<67:32:15, 198.64s/it] 24%|██▍ | 387/1610 [6:16:01<64:37:19, 190.22s/it] {'loss': 0.0042, 'grad_norm': 2.0075429713282285, 'learning_rate': 7.596273291925466e-07, 'completion_length': 142.4107208251953, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.104248046875, 'epoch': 1.2} 24%|██▍ | 387/1610 [6:16:01<64:37:19, 190.22s/it] 24%|██▍ | 388/1610 [6:17:31<54:23:36, 160.24s/it] {'loss': 0.0027, 'grad_norm': 2.1898913977311025, 'learning_rate': 7.590062111801242e-07, 'completion_length': 116.80358123779297, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.23086079210042953, 'kl': 0.066162109375, 'epoch': 1.2} 24%|██▍ | 388/1610 [6:17:31<54:23:36, 160.24s/it] 24%|██▍ | 389/1610 [6:19:15<48:35:08, 143.25s/it] {'loss': 0.002, 'grad_norm': 1.8795075332827107, 'learning_rate': 7.583850931677019e-07, 'completion_length': 131.83929443359375, 'rewards/accuracy_reward': 0.4107143059372902, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.23086077719926834, 'kl': 0.0491943359375, 'epoch': 1.21} 24%|██▍ | 389/1610 [6:19:15<48:35:08, 143.25s/it] 24%|██▍ | 390/1610 [6:20:41<42:44:48, 126.14s/it] {'loss': 0.0019, 'grad_norm': 1.1688213753731986, 'learning_rate': 7.577639751552795e-07, 'completion_length': 104.3035774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0474853515625, 'epoch': 1.21} 24%|██▍ | 390/1610 [6:20:41<42:44:48, 126.14s/it] 24%|██▍ | 391/1610 [6:22:21<40:06:36, 118.46s/it] {'loss': 0.0039, 'grad_norm': 1.4472084960789189, 'learning_rate': 7.57142857142857e-07, 'completion_length': 134.62500762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2253357544541359, 'kl': 0.0986328125, 'epoch': 1.21} 24%|██▍ | 391/1610 [6:22:21<40:06:36, 118.46s/it] 24%|██▍ | 392/1610 [6:23:59<37:59:41, 112.30s/it] {'loss': 0.0035, 'grad_norm': 2.8938104002627836, 'learning_rate': 7.565217391304347e-07, 'completion_length': 147.25000762939453, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.26657507568597794, 'kl': 0.08740234375, 'epoch': 1.22} 24%|██▍ | 392/1610 [6:23:59<37:59:41, 112.30s/it] 24%|██▍ | 393/1610 [6:25:52<38:02:55, 112.55s/it] {'loss': 0.0018, 'grad_norm': 1.5050591326751284, 'learning_rate': 7.559006211180123e-07, 'completion_length': 119.30357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.1428571492433548, 'kl': 0.0450439453125, 'epoch': 1.22} 24%|██▍ | 393/1610 [6:25:52<38:02:55, 112.55s/it] 24%|██▍ | 394/1610 [6:27:26<36:04:12, 106.79s/it] {'loss': 0.0023, 'grad_norm': 1.681766923802598, 'learning_rate': 7.5527950310559e-07, 'completion_length': 123.01786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5714285969734192, 'reward_std': 0.18409645557403564, 'kl': 0.05712890625, 'epoch': 1.22} 24%|██▍ | 394/1610 [6:27:26<36:04:12, 106.79s/it] 25%|██▍ | 395/1610 [6:29:33<38:05:41, 112.87s/it] {'loss': 0.0032, 'grad_norm': 3.8787511786109627, 'learning_rate': 7.546583850931677e-07, 'completion_length': 166.9464340209961, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5357143878936768, 'reward_std': 0.26657505333423615, 'kl': 0.08056640625, 'epoch': 1.23} 25%|██▍ | 395/1610 [6:29:33<38:05:41, 112.87s/it] 25%|██▍ | 396/1610 [6:31:14<36:52:36, 109.35s/it] {'loss': 0.0028, 'grad_norm': 2.8820534297901, 'learning_rate': 7.540372670807453e-07, 'completion_length': 139.5357208251953, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.2610500529408455, 'kl': 0.070068359375, 'epoch': 1.23} 25%|██▍ | 396/1610 [6:31:14<36:52:36, 109.35s/it] 25%|██▍ | 397/1610 [6:31:56<30:01:21, 89.10s/it] {'loss': 0.0031, 'grad_norm': 1.3673831528239377, 'learning_rate': 7.534161490683229e-07, 'completion_length': 131.69643783569336, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.0780029296875, 'epoch': 1.23} 25%|██▍ | 397/1610 [6:31:56<30:01:21, 89.10s/it] 25%|██▍ | 398/1610 [6:34:27<36:17:12, 107.78s/it] {'loss': 0.0023, 'grad_norm': 1.3671184618578642, 'learning_rate': 7.527950310559006e-07, 'completion_length': 116.00000762939453, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1428571492433548, 'kl': 0.05712890625, 'epoch': 1.24} 25%|██▍ | 398/1610 [6:34:27<36:17:12, 107.78s/it] 25%|██▍ | 399/1610 [6:40:19<60:53:32, 181.02s/it] {'loss': 0.0029, 'grad_norm': 1.254264467335996, 'learning_rate': 7.521739130434782e-07, 'completion_length': 156.76786041259766, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.21981074661016464, 'kl': 0.07177734375, 'epoch': 1.24} 25%|██▍ | 399/1610 [6:40:19<60:53:32, 181.02s/it] 25%|██▍ | 400/1610 [6:42:26<55:21:23, 164.70s/it] {'loss': 0.0041, 'grad_norm': 1.3138899315071562, 'learning_rate': 7.515527950310558e-07, 'completion_length': 146.21429443359375, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.1019287109375, 'epoch': 1.24} 25%|██▍ | 400/1610 [6:42:26<55:21:23, 164.70s/it] 25%|██▍ | 401/1610 [6:47:29<69:16:12, 206.26s/it] {'loss': 0.0034, 'grad_norm': 1.7810688327315942, 'learning_rate': 7.509316770186335e-07, 'completion_length': 104.1964340209961, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.083984375, 'epoch': 1.25} 25%|██▍ | 401/1610 [6:47:29<69:16:12, 206.26s/it] 25%|██▍ | 402/1610 [6:48:29<54:32:29, 162.54s/it] {'loss': 0.0014, 'grad_norm': 3.692678411614389, 'learning_rate': 7.503105590062111e-07, 'completion_length': 95.03571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.2610500454902649, 'kl': 0.03424072265625, 'epoch': 1.25} 25%|██▍ | 402/1610 [6:48:30<54:32:29, 162.54s/it] 25%|██▌ | 403/1610 [6:50:03<47:30:46, 141.71s/it] {'loss': 0.0036, 'grad_norm': 1.7895541794680156, 'learning_rate': 7.496894409937888e-07, 'completion_length': 125.42858123779297, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25552503019571304, 'kl': 0.090576171875, 'epoch': 1.25} 25%|██▌ | 403/1610 [6:50:03<47:30:46, 141.71s/it] 25%|██▌ | 404/1610 [6:51:22<41:14:08, 123.09s/it] {'loss': 0.002, 'grad_norm': 1.7975549811923712, 'learning_rate': 7.490683229813665e-07, 'completion_length': 104.28572082519531, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.14838216453790665, 'kl': 0.0489501953125, 'epoch': 1.25} 25%|██▌ | 404/1610 [6:51:22<41:14:08, 123.09s/it] 25%|██▌ | 405/1610 [6:53:15<40:08:52, 119.94s/it] {'loss': 0.0042, 'grad_norm': 2.2534022919888947, 'learning_rate': 7.484472049689441e-07, 'completion_length': 134.5535774230957, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.2580091208219528, 'kl': 0.1043701171875, 'epoch': 1.26} 25%|██▌ | 405/1610 [6:53:15<40:08:52, 119.94s/it] 25%|██▌ | 406/1610 [6:55:47<43:19:12, 129.53s/it] {'loss': 0.0035, 'grad_norm': 1.5672513178654135, 'learning_rate': 7.478260869565217e-07, 'completion_length': 171.42858123779297, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.23086076974868774, 'kl': 0.087890625, 'epoch': 1.26} 25%|██▌ | 406/1610 [6:55:47<43:19:12, 129.53s/it] 25%|██▌ | 407/1610 [6:57:15<39:05:47, 117.00s/it] {'loss': 0.0016, 'grad_norm': 1.8570156603974144, 'learning_rate': 7.472049689440994e-07, 'completion_length': 114.6785774230957, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.0399169921875, 'epoch': 1.26} 25%|██▌ | 407/1610 [6:57:15<39:05:47, 117.00s/it] 25%|██▌ | 408/1610 [6:59:42<42:05:42, 126.07s/it] {'loss': 0.002, 'grad_norm': 1.4923722527732406, 'learning_rate': 7.46583850931677e-07, 'completion_length': 118.67857360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.0511474609375, 'epoch': 1.27} 25%|██▌ | 408/1610 [6:59:42<42:05:42, 126.07s/it] 25%|██▌ | 409/1610 [7:02:18<45:07:11, 135.25s/it] {'loss': 0.0027, 'grad_norm': 1.8125225778452752, 'learning_rate': 7.459627329192546e-07, 'completion_length': 125.0535774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214671432972, 'kl': 0.067626953125, 'epoch': 1.27} 25%|██▌ | 409/1610 [7:02:19<45:07:11, 135.25s/it] 25%|██▌ | 410/1610 [7:06:24<56:06:49, 168.34s/it] {'loss': 0.0031, 'grad_norm': 1.903030017572788, 'learning_rate': 7.453416149068323e-07, 'completion_length': 138.67857360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.30228936672210693, 'kl': 0.0775146484375, 'epoch': 1.27} 25%|██▌ | 410/1610 [7:06:24<56:06:49, 168.34s/it] 26%|██▌ | 411/1610 [7:06:43<41:07:23, 123.47s/it] {'loss': 0.0018, 'grad_norm': 1.9956724863691113, 'learning_rate': 7.447204968944099e-07, 'completion_length': 101.28571701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.26657505333423615, 'kl': 0.0450439453125, 'epoch': 1.28} 26%|██▌ | 411/1610 [7:06:43<41:07:23, 123.47s/it] 26%|██▌ | 412/1610 [7:07:01<30:34:49, 91.89s/it] {'loss': 0.0028, 'grad_norm': 3.6675289344944986, 'learning_rate': 7.440993788819876e-07, 'completion_length': 96.37500381469727, 'rewards/accuracy_reward': 0.3392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.1785714402794838, 'kl': 0.06884765625, 'epoch': 1.28} 26%|██▌ | 412/1610 [7:07:01<30:34:49, 91.89s/it] 26%|██▌ | 413/1610 [7:07:20<23:14:28, 69.90s/it] {'loss': 0.0018, 'grad_norm': 2.8503217732140578, 'learning_rate': 7.434782608695653e-07, 'completion_length': 103.87500381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.29123931378126144, 'kl': 0.0452880859375, 'epoch': 1.28} 26%|██▌ | 413/1610 [7:07:20<23:14:28, 69.90s/it] 26%|██▌ | 414/1610 [7:07:39<18:11:22, 54.75s/it] {'loss': 0.002, 'grad_norm': 1.513837232451069, 'learning_rate': 7.428571428571429e-07, 'completion_length': 123.35715103149414, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.0504150390625, 'epoch': 1.29} 26%|██▌ | 414/1610 [7:07:39<18:11:22, 54.75s/it] 26%|██▌ | 415/1610 [7:08:01<14:53:23, 44.86s/it] {'loss': 0.0019, 'grad_norm': 1.3138410309701771, 'learning_rate': 7.422360248447204e-07, 'completion_length': 129.48215103149414, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.21981073170900345, 'kl': 0.0467529296875, 'epoch': 1.29} 26%|██▌ | 415/1610 [7:08:01<14:53:23, 44.86s/it] 26%|██▌ | 416/1610 [7:08:20<12:22:46, 37.33s/it] {'loss': 0.0017, 'grad_norm': 0.9241328130736972, 'learning_rate': 7.416149068322981e-07, 'completion_length': 104.53572082519531, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0421142578125, 'epoch': 1.29} 26%|██▌ | 416/1610 [7:08:20<12:22:46, 37.33s/it] 26%|██▌ | 417/1610 [7:08:40<10:36:18, 32.00s/it] {'loss': 0.0024, 'grad_norm': 1.454763343108544, 'learning_rate': 7.409937888198757e-07, 'completion_length': 129.16071701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838216826319695, 'kl': 0.06103515625, 'epoch': 1.3} 26%|██▌ | 417/1610 [7:08:40<10:36:18, 32.00s/it] 26%|██▌ | 418/1610 [7:08:56<8:57:23, 27.05s/it] {'loss': 0.0016, 'grad_norm': 1.983256544992785, 'learning_rate': 7.403726708074533e-07, 'completion_length': 94.14286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1785714402794838, 'kl': 0.039306640625, 'epoch': 1.3} 26%|██▌ | 418/1610 [7:08:56<8:57:23, 27.05s/it] 26%|██▌ | 419/1610 [7:09:45<11:11:25, 33.83s/it] {'loss': 0.0028, 'grad_norm': 1.1540098538979766, 'learning_rate': 7.39751552795031e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.07080078125, 'epoch': 1.3} 26%|██▌ | 419/1610 [7:09:45<11:11:25, 33.83s/it] 26%|██▌ | 420/1610 [7:11:24<17:40:15, 53.46s/it] {'loss': 0.0024, 'grad_norm': 2.3415239922295004, 'learning_rate': 7.391304347826086e-07, 'completion_length': 125.98214340209961, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.25552501529455185, 'kl': 0.0592041015625, 'epoch': 1.3} 26%|██▌ | 420/1610 [7:11:25<17:40:15, 53.46s/it] 26%|██▌ | 421/1610 [7:12:38<19:39:27, 59.52s/it] {'loss': 0.0019, 'grad_norm': 1.0736068536871588, 'learning_rate': 7.385093167701863e-07, 'completion_length': 112.71429061889648, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1071428619325161, 'kl': 0.0467529296875, 'epoch': 1.31} 26%|██▌ | 421/1610 [7:12:38<19:39:27, 59.52s/it] 26%|██▌ | 422/1610 [7:13:01<16:02:02, 48.59s/it] {'loss': 0.0023, 'grad_norm': 1.3512216770315426, 'learning_rate': 7.37888198757764e-07, 'completion_length': 130.69643783569336, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.26353414729237556, 'kl': 0.0565185546875, 'epoch': 1.31} 26%|██▌ | 422/1610 [7:13:01<16:02:02, 48.59s/it] 26%|██▋ | 423/1610 [7:14:28<19:46:26, 59.97s/it] {'loss': 0.0021, 'grad_norm': 3.0015417018910573, 'learning_rate': 7.372670807453416e-07, 'completion_length': 127.28572463989258, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1539071872830391, 'kl': 0.0518798828125, 'epoch': 1.31} 26%|██▋ | 423/1610 [7:14:28<19:46:26, 59.97s/it] 26%|██▋ | 424/1610 [7:16:31<26:01:51, 79.02s/it] {'loss': 0.0027, 'grad_norm': 1.2986910590614826, 'learning_rate': 7.366459627329192e-07, 'completion_length': 114.87500762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.0682373046875, 'epoch': 1.32} 26%|██▋ | 424/1610 [7:16:31<26:01:51, 79.02s/it] 26%|██▋ | 425/1610 [7:17:17<22:42:36, 68.99s/it] {'loss': 0.0021, 'grad_norm': 1.7501444439351994, 'learning_rate': 7.360248447204969e-07, 'completion_length': 140.89286422729492, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.30228934437036514, 'kl': 0.052978515625, 'epoch': 1.32} 26%|██▋ | 425/1610 [7:17:17<22:42:36, 68.99s/it] 26%|██▋ | 426/1610 [7:17:38<17:58:36, 54.66s/it] {'loss': 0.0017, 'grad_norm': 10.527621039299694, 'learning_rate': 7.354037267080745e-07, 'completion_length': 122.71429443359375, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981074661016464, 'kl': 0.0416259765625, 'epoch': 1.32} 26%|██▋ | 426/1610 [7:17:38<17:58:36, 54.66s/it] 27%|██▋ | 427/1610 [7:17:59<14:38:05, 44.54s/it] {'loss': 0.0024, 'grad_norm': 1.429774330016831, 'learning_rate': 7.347826086956521e-07, 'completion_length': 119.00000762939453, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2253357544541359, 'kl': 0.0596923828125, 'epoch': 1.33} 27%|██▋ | 427/1610 [7:17:59<14:38:05, 44.54s/it] 27%|██▋ | 428/1610 [7:18:20<12:19:18, 37.53s/it] {'loss': 0.002, 'grad_norm': 0.7388277878052791, 'learning_rate': 7.341614906832298e-07, 'completion_length': 145.08929443359375, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0496826171875, 'epoch': 1.33} 27%|██▋ | 428/1610 [7:18:20<12:19:18, 37.53s/it] 27%|██▋ | 429/1610 [7:18:41<10:40:12, 32.52s/it] {'loss': 0.0026, 'grad_norm': 1.3399484925465286, 'learning_rate': 7.335403726708074e-07, 'completion_length': 135.53571701049805, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1539071798324585, 'kl': 0.06494140625, 'epoch': 1.33} 27%|██▋ | 429/1610 [7:18:41<10:40:12, 32.52s/it] 27%|██▋ | 430/1610 [7:19:29<12:10:08, 37.13s/it] {'loss': 0.0025, 'grad_norm': 1.468774821557782, 'learning_rate': 7.329192546583851e-07, 'completion_length': 118.8214340209961, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.1896214671432972, 'kl': 0.0616455078125, 'epoch': 1.34} 27%|██▋ | 430/1610 [7:19:29<12:10:08, 37.13s/it] 27%|██▋ | 431/1610 [7:20:35<15:01:42, 45.89s/it] {'loss': 0.0024, 'grad_norm': 3.025547357075443, 'learning_rate': 7.322981366459628e-07, 'completion_length': 126.5714340209961, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.059814453125, 'epoch': 1.34} 27%|██▋ | 431/1610 [7:20:35<15:01:42, 45.89s/it] 27%|██▋ | 432/1610 [7:21:31<15:57:30, 48.77s/it] {'loss': 0.0015, 'grad_norm': 1.5887159712798995, 'learning_rate': 7.316770186335404e-07, 'completion_length': 82.10714340209961, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.18409645557403564, 'kl': 0.03857421875, 'epoch': 1.34} 27%|██▋ | 432/1610 [7:21:31<15:57:30, 48.77s/it] 27%|██▋ | 433/1610 [7:22:47<18:37:58, 56.99s/it] {'loss': 0.002, 'grad_norm': 1.3080850437910392, 'learning_rate': 7.31055900621118e-07, 'completion_length': 124.89286422729492, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.25552501529455185, 'kl': 0.051025390625, 'epoch': 1.34} 27%|██▋ | 433/1610 [7:22:47<18:37:58, 56.99s/it] 27%|██▋ | 434/1610 [7:23:41<18:21:00, 56.17s/it] {'loss': 0.002, 'grad_norm': 1.024446013178308, 'learning_rate': 7.304347826086957e-07, 'completion_length': 137.75000381469727, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.05029296875, 'epoch': 1.35} 27%|██▋ | 434/1610 [7:23:41<18:21:00, 56.17s/it] 27%|██▋ | 435/1610 [7:24:03<14:58:56, 45.90s/it] {'loss': 0.002, 'grad_norm': 1.9952011242678327, 'learning_rate': 7.298136645962733e-07, 'completion_length': 130.14286041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1896214634180069, 'kl': 0.0504150390625, 'epoch': 1.35} 27%|██▋ | 435/1610 [7:24:03<14:58:56, 45.90s/it] 27%|██▋ | 436/1610 [7:24:22<12:17:47, 37.71s/it] {'loss': 0.0016, 'grad_norm': 1.1719943905060082, 'learning_rate': 7.291925465838509e-07, 'completion_length': 128.9285774230957, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.18409644439816475, 'kl': 0.03955078125, 'epoch': 1.35} 27%|██▋ | 436/1610 [7:24:22<12:17:47, 37.71s/it] 27%|██▋ | 437/1610 [7:24:42<10:33:12, 32.39s/it] {'loss': 0.0019, 'grad_norm': 3.4881719779845826, 'learning_rate': 7.285714285714286e-07, 'completion_length': 110.35715103149414, 'rewards/accuracy_reward': 0.3750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.21981073915958405, 'kl': 0.0467529296875, 'epoch': 1.36} 27%|██▋ | 437/1610 [7:24:42<10:33:12, 32.39s/it] 27%|██▋ | 438/1610 [7:25:00<9:10:52, 28.20s/it] {'loss': 0.0024, 'grad_norm': 1.8633943076771036, 'learning_rate': 7.279503105590061e-07, 'completion_length': 84.51786422729492, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.0611572265625, 'epoch': 1.36} 27%|██▋ | 438/1610 [7:25:00<9:10:52, 28.20s/it] 27%|██▋ | 439/1610 [7:25:15<7:50:40, 24.12s/it] {'loss': 0.0014, 'grad_norm': 0.94858351009906, 'learning_rate': 7.273291925465838e-07, 'completion_length': 118.73215103149414, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1428571492433548, 'kl': 0.0343017578125, 'epoch': 1.36} 27%|██▋ | 439/1610 [7:25:15<7:50:40, 24.12s/it] 27%|██▋ | 440/1610 [7:25:27<6:43:25, 20.69s/it] {'loss': 0.0027, 'grad_norm': 3.2289516683399073, 'learning_rate': 7.267080745341615e-07, 'completion_length': 111.58929061889648, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.2142857313156128, 'kl': 0.0665283203125, 'epoch': 1.37} 27%|██▋ | 440/1610 [7:25:27<6:43:25, 20.69s/it] 27%|██▋ | 441/1610 [7:25:44<6:19:27, 19.48s/it] {'loss': 0.0018, 'grad_norm': 5.641303534950194, 'learning_rate': 7.260869565217391e-07, 'completion_length': 129.87500762939453, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.0458984375, 'epoch': 1.37} 27%|██▋ | 441/1610 [7:25:44<6:19:27, 19.48s/it] 27%|██▋ | 442/1610 [7:25:58<5:48:54, 17.92s/it] {'loss': 0.003, 'grad_norm': 2.0462802896785286, 'learning_rate': 7.254658385093167e-07, 'completion_length': 104.0714340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.2937234118580818, 'kl': 0.0753173828125, 'epoch': 1.37} 27%|██▋ | 442/1610 [7:25:58<5:48:54, 17.92s/it] 28%|██▊ | 443/1610 [7:26:15<5:40:39, 17.51s/it] {'loss': 0.0017, 'grad_norm': 1.346818522139615, 'learning_rate': 7.248447204968943e-07, 'completion_length': 145.9464340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.3435286581516266, 'kl': 0.0416259765625, 'epoch': 1.38} 28%|██▊ | 443/1610 [7:26:15<5:40:39, 17.51s/it] 28%|██▊ | 444/1610 [7:26:26<5:04:59, 15.69s/it] {'loss': 0.002, 'grad_norm': 1.4597566100904564, 'learning_rate': 7.24223602484472e-07, 'completion_length': 84.9464340209961, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.04931640625, 'epoch': 1.38} 28%|██▊ | 444/1610 [7:26:26<5:04:59, 15.69s/it] 28%|██▊ | 445/1610 [7:26:43<5:11:21, 16.04s/it] {'loss': 0.003, 'grad_norm': 1.9896687051418889, 'learning_rate': 7.236024844720496e-07, 'completion_length': 120.91072082519531, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1896214708685875, 'kl': 0.0750732421875, 'epoch': 1.38} 28%|██▊ | 445/1610 [7:26:43<5:11:21, 16.04s/it] 28%|██▊ | 446/1610 [7:26:55<4:48:25, 14.87s/it] {'loss': 0.0019, 'grad_norm': 1.4953583253510403, 'learning_rate': 7.229813664596272e-07, 'completion_length': 100.05357360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.048095703125, 'epoch': 1.39} 28%|██▊ | 446/1610 [7:26:55<4:48:25, 14.87s/it] 28%|██▊ | 447/1610 [7:27:08<4:37:33, 14.32s/it] {'loss': 0.0022, 'grad_norm': 1.7505526240047826, 'learning_rate': 7.223602484472049e-07, 'completion_length': 111.8035774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981072798371315, 'kl': 0.054443359375, 'epoch': 1.39} 28%|██▊ | 447/1610 [7:27:08<4:37:33, 14.32s/it] 28%|██▊ | 448/1610 [7:27:24<4:43:34, 14.64s/it] {'loss': 0.0022, 'grad_norm': 1.4904809256113445, 'learning_rate': 7.217391304347826e-07, 'completion_length': 131.00000762939453, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.15943220257759094, 'kl': 0.055419921875, 'epoch': 1.39} 28%|██▊ | 448/1610 [7:27:24<4:43:34, 14.64s/it] 28%|██▊ | 449/1610 [7:27:37<4:35:55, 14.26s/it] {'loss': 0.0021, 'grad_norm': 2.744314970055631, 'learning_rate': 7.211180124223603e-07, 'completion_length': 115.4285774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.30228936672210693, 'kl': 0.0531005859375, 'epoch': 1.39} 28%|██▊ | 449/1610 [7:27:37<4:35:55, 14.26s/it] 28%|██▊ | 450/1610 [7:27:48<4:17:00, 13.29s/it] {'loss': 0.0019, 'grad_norm': 0.7391239764635058, 'learning_rate': 7.204968944099379e-07, 'completion_length': 85.28571701049805, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.0357142873108387, 'kl': 0.046875, 'epoch': 1.4} 28%|██▊ | 450/1610 [7:27:48<4:17:00, 13.29s/it] 28%|██▊ | 451/1610 [7:28:02<4:22:10, 13.57s/it] {'loss': 0.0026, 'grad_norm': 0.5377934261476482, 'learning_rate': 7.198757763975155e-07, 'completion_length': 112.00000762939453, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.064453125, 'epoch': 1.4} 28%|██▊ | 451/1610 [7:28:02<4:22:10, 13.57s/it] 28%|██▊ | 452/1610 [7:28:14<4:08:47, 12.89s/it] {'loss': 0.0021, 'grad_norm': 3.4218532458671236, 'learning_rate': 7.192546583850931e-07, 'completion_length': 92.03571701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.18409645557403564, 'kl': 0.05224609375, 'epoch': 1.4} 28%|██▊ | 452/1610 [7:28:14<4:08:47, 12.89s/it] 28%|██▊ | 453/1610 [7:28:24<3:54:58, 12.19s/it] {'loss': 0.0021, 'grad_norm': 0.8956672830674978, 'learning_rate': 7.186335403726708e-07, 'completion_length': 92.14286041259766, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.0517578125, 'epoch': 1.41} 28%|██▊ | 453/1610 [7:28:24<3:54:58, 12.19s/it] 28%|██▊ | 454/1610 [7:28:38<4:04:27, 12.69s/it] {'loss': 0.0024, 'grad_norm': 1.9049836089379566, 'learning_rate': 7.180124223602484e-07, 'completion_length': 118.48214721679688, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.1896214671432972, 'kl': 0.058837890625, 'epoch': 1.41} 28%|██▊ | 454/1610 [7:28:38<4:04:27, 12.69s/it] 28%|██▊ | 455/1610 [7:28:53<4:16:07, 13.31s/it] {'loss': 0.002, 'grad_norm': 2.2491239921309267, 'learning_rate': 7.17391304347826e-07, 'completion_length': 109.98214721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.18409644439816475, 'kl': 0.051025390625, 'epoch': 1.41} 28%|██▊ | 455/1610 [7:28:53<4:16:07, 13.31s/it] 28%|██▊ | 456/1610 [7:29:07<4:19:38, 13.50s/it] {'loss': 0.0016, 'grad_norm': 0.8294319695086413, 'learning_rate': 7.167701863354037e-07, 'completion_length': 95.96429061889648, 'rewards/accuracy_reward': 0.7321428656578064, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.07695359364151955, 'kl': 0.038818359375, 'epoch': 1.42} 28%|██▊ | 456/1610 [7:29:07<4:19:38, 13.50s/it] 28%|██▊ | 457/1610 [7:29:25<4:47:15, 14.95s/it] {'loss': 0.002, 'grad_norm': 2.006928171562083, 'learning_rate': 7.161490683229814e-07, 'completion_length': 141.6428680419922, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2857142984867096, 'kl': 0.050048828125, 'epoch': 1.42} 28%|██▊ | 457/1610 [7:29:25<4:47:15, 14.95s/it] 28%|██▊ | 458/1610 [7:29:36<4:25:55, 13.85s/it] {'loss': 0.0028, 'grad_norm': 2.0221541150396507, 'learning_rate': 7.15527950310559e-07, 'completion_length': 98.07143020629883, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1428571492433548, 'kl': 0.0689697265625, 'epoch': 1.42} 28%|██▊ | 458/1610 [7:29:36<4:25:55, 13.85s/it] 29%|██▊ | 459/1610 [7:29:50<4:22:13, 13.67s/it] {'loss': 0.003, 'grad_norm': 1.1156318544883443, 'learning_rate': 7.149068322981367e-07, 'completion_length': 114.01786422729492, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.075927734375, 'epoch': 1.43} 29%|██▊ | 459/1610 [7:29:50<4:22:13, 13.67s/it] 29%|██▊ | 460/1610 [7:30:03<4:22:46, 13.71s/it] {'loss': 0.0019, 'grad_norm': 3.1242614504580692, 'learning_rate': 7.142857142857143e-07, 'completion_length': 123.8035774230957, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.26657506823539734, 'kl': 0.0462646484375, 'epoch': 1.43} 29%|██▊ | 460/1610 [7:30:03<4:22:46, 13.71s/it] 29%|██▊ | 461/1610 [7:30:19<4:34:23, 14.33s/it] {'loss': 0.0022, 'grad_norm': 0.9282450064843556, 'learning_rate': 7.136645962732919e-07, 'completion_length': 123.41072082519531, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.1071428656578064, 'kl': 0.0546875, 'epoch': 1.43} 29%|██▊ | 461/1610 [7:30:19<4:34:23, 14.33s/it] 29%|██▊ | 462/1610 [7:30:34<4:37:49, 14.52s/it] {'loss': 0.002, 'grad_norm': 5.528580684937611, 'learning_rate': 7.130434782608695e-07, 'completion_length': 111.30357360839844, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25552501529455185, 'kl': 0.048828125, 'epoch': 1.43} 29%|██▊ | 462/1610 [7:30:34<4:37:49, 14.52s/it] 29%|██▉ | 463/1610 [7:30:53<5:04:25, 15.92s/it] {'loss': 0.0026, 'grad_norm': 0.952492147500249, 'learning_rate': 7.124223602484471e-07, 'completion_length': 144.75000762939453, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0662841796875, 'epoch': 1.44} 29%|██▉ | 463/1610 [7:30:53<5:04:25, 15.92s/it] 29%|██▉ | 464/1610 [7:31:05<4:42:01, 14.77s/it] {'loss': 0.0019, 'grad_norm': 2.1521926017058166, 'learning_rate': 7.118012422360247e-07, 'completion_length': 89.82143020629883, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.25552503019571304, 'kl': 0.0487060546875, 'epoch': 1.44} 29%|██▉ | 464/1610 [7:31:05<4:42:01, 14.77s/it] 29%|██▉ | 465/1610 [7:31:17<4:26:28, 13.96s/it] {'loss': 0.0018, 'grad_norm': 0.8210523189302467, 'learning_rate': 7.111801242236024e-07, 'completion_length': 92.75000381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.04123930633068085, 'kl': 0.0450439453125, 'epoch': 1.44} 29%|██▉ | 465/1610 [7:31:17<4:26:28, 13.96s/it] 29%|██▉ | 466/1610 [7:31:36<4:53:17, 15.38s/it] {'loss': 0.0021, 'grad_norm': 2.042843823585723, 'learning_rate': 7.105590062111801e-07, 'completion_length': 130.71428680419922, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.05224609375, 'epoch': 1.45} 29%|██▉ | 466/1610 [7:31:36<4:53:17, 15.38s/it] 29%|██▉ | 467/1610 [7:31:56<5:17:25, 16.66s/it] {'loss': 0.0033, 'grad_norm': 2.877172709433268, 'learning_rate': 7.099378881987577e-07, 'completion_length': 102.14286041259766, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.2253357544541359, 'kl': 0.083251953125, 'epoch': 1.45} 29%|██▉ | 467/1610 [7:31:56<5:17:25, 16.66s/it] 29%|██▉ | 468/1610 [7:32:14<5:28:50, 17.28s/it] {'loss': 0.0015, 'grad_norm': 1.4872297356139164, 'learning_rate': 7.093167701863354e-07, 'completion_length': 116.87500762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409644067287445, 'kl': 0.03857421875, 'epoch': 1.45} 29%|██▉ | 468/1610 [7:32:14<5:28:50, 17.28s/it] 29%|██▉ | 469/1610 [7:32:33<5:36:00, 17.67s/it] {'loss': 0.0026, 'grad_norm': 2.1376329072947167, 'learning_rate': 7.08695652173913e-07, 'completion_length': 98.39286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.14838217198848724, 'kl': 0.0640869140625, 'epoch': 1.46} 29%|██▉ | 469/1610 [7:32:33<5:36:00, 17.67s/it] 29%|██▉ | 470/1610 [7:32:50<5:32:49, 17.52s/it] {'loss': 0.0021, 'grad_norm': 1.8730475572320497, 'learning_rate': 7.080745341614906e-07, 'completion_length': 86.76786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.21981073915958405, 'kl': 0.052001953125, 'epoch': 1.46} 29%|██▉ | 470/1610 [7:32:50<5:32:49, 17.52s/it] 29%|██▉ | 471/1610 [7:33:09<5:37:10, 17.76s/it] {'loss': 0.0042, 'grad_norm': 1.5947841380622376, 'learning_rate': 7.074534161490683e-07, 'completion_length': 109.17857360839844, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2142857238650322, 'kl': 0.103759765625, 'epoch': 1.46} 29%|██▉ | 471/1610 [7:33:09<5:37:10, 17.76s/it] 29%|██▉ | 472/1610 [7:33:21<5:06:21, 16.15s/it] {'loss': 0.002, 'grad_norm': 1.86964386954041, 'learning_rate': 7.068322981366459e-07, 'completion_length': 74.66071891784668, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1896214783191681, 'kl': 0.05029296875, 'epoch': 1.47} 29%|██▉ | 472/1610 [7:33:21<5:06:21, 16.15s/it] 29%|██▉ | 473/1610 [7:33:32<4:36:57, 14.62s/it] {'loss': 0.0026, 'grad_norm': 0.9111090971180226, 'learning_rate': 7.062111801242235e-07, 'completion_length': 83.5535774230957, 'rewards/accuracy_reward': 0.357142873108387, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.0638427734375, 'epoch': 1.47} 29%|██▉ | 473/1610 [7:33:32<4:36:57, 14.62s/it] 29%|██▉ | 474/1610 [7:33:43<4:18:16, 13.64s/it] {'loss': 0.002, 'grad_norm': 1.2807346450097963, 'learning_rate': 7.055900621118012e-07, 'completion_length': 90.07143020629883, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.05029296875, 'epoch': 1.47} 29%|██▉ | 474/1610 [7:33:43<4:18:16, 13.64s/it] 30%|██▉ | 475/1610 [7:33:55<4:04:42, 12.94s/it] {'loss': 0.0023, 'grad_norm': 1.4262882258233875, 'learning_rate': 7.049689440993789e-07, 'completion_length': 89.46428680419922, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.0579833984375, 'epoch': 1.48} 30%|██▉ | 475/1610 [7:33:55<4:04:42, 12.94s/it] 30%|██▉ | 476/1610 [7:34:07<4:02:29, 12.83s/it] {'loss': 0.0018, 'grad_norm': 1.1480253879078481, 'learning_rate': 7.043478260869565e-07, 'completion_length': 92.6785774230957, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.1539071872830391, 'kl': 0.0460205078125, 'epoch': 1.48} 30%|██▉ | 476/1610 [7:34:07<4:02:29, 12.83s/it] 30%|██▉ | 477/1610 [7:34:18<3:51:09, 12.24s/it] {'loss': 0.0022, 'grad_norm': 2.1837908383263898, 'learning_rate': 7.037267080745342e-07, 'completion_length': 75.64286041259766, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035715222358704, 'reward_std': 0.1071428619325161, 'kl': 0.054443359375, 'epoch': 1.48} 30%|██▉ | 477/1610 [7:34:18<3:51:09, 12.24s/it] 30%|██▉ | 478/1610 [7:34:31<3:55:46, 12.50s/it] {'loss': 0.0016, 'grad_norm': 1.757807031136192, 'learning_rate': 7.031055900621118e-07, 'completion_length': 104.98214721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.2363857924938202, 'kl': 0.038818359375, 'epoch': 1.48} 30%|██▉ | 478/1610 [7:34:31<3:55:46, 12.50s/it] 30%|██▉ | 479/1610 [7:34:45<4:03:45, 12.93s/it] {'loss': 0.0024, 'grad_norm': 0.6842644084792515, 'learning_rate': 7.024844720496894e-07, 'completion_length': 107.28572082519531, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.07695359364151955, 'kl': 0.0599365234375, 'epoch': 1.49} 30%|██▉ | 479/1610 [7:34:45<4:03:45, 12.93s/it] 30%|██▉ | 480/1610 [7:34:59<4:10:52, 13.32s/it] {'loss': 0.0017, 'grad_norm': 2.7779568907019128, 'learning_rate': 7.018633540372671e-07, 'completion_length': 101.33929061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2721000909805298, 'kl': 0.04345703125, 'epoch': 1.49} 30%|██▉ | 480/1610 [7:34:59<4:10:52, 13.32s/it] 30%|██▉ | 481/1610 [7:35:17<4:33:10, 14.52s/it] {'loss': 0.0051, 'grad_norm': 2.4941624458142773, 'learning_rate': 7.012422360248447e-07, 'completion_length': 130.76786041259766, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.23086077719926834, 'kl': 0.12646484375, 'epoch': 1.49} 30%|██▉ | 481/1610 [7:35:17<4:33:10, 14.52s/it] 30%|██▉ | 482/1610 [7:35:31<4:29:50, 14.35s/it] {'loss': 0.0033, 'grad_norm': 2.01463815319075, 'learning_rate': 7.006211180124223e-07, 'completion_length': 109.03571701049805, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1539071798324585, 'kl': 0.082763671875, 'epoch': 1.5} 30%|██▉ | 482/1610 [7:35:31<4:29:50, 14.35s/it] 30%|███ | 483/1610 [7:35:44<4:26:40, 14.20s/it] {'loss': 0.0024, 'grad_norm': 1.5976635323439037, 'learning_rate': 7e-07, 'completion_length': 115.28572082519531, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.26657506078481674, 'kl': 0.059814453125, 'epoch': 1.5} 30%|███ | 483/1610 [7:35:44<4:26:40, 14.20s/it] 30%|███ | 484/1610 [7:35:58<4:21:05, 13.91s/it] {'loss': 0.0018, 'grad_norm': 0.9836327729208467, 'learning_rate': 6.993788819875777e-07, 'completion_length': 104.73214721679688, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.0450439453125, 'epoch': 1.5} 30%|███ | 484/1610 [7:35:58<4:21:05, 13.91s/it] 30%|███ | 485/1610 [7:36:12<4:25:43, 14.17s/it] {'loss': 0.0023, 'grad_norm': 2.0935632584146076, 'learning_rate': 6.987577639751553e-07, 'completion_length': 116.08929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.25552501529455185, 'kl': 0.0574951171875, 'epoch': 1.51} 30%|███ | 485/1610 [7:36:13<4:25:43, 14.17s/it] 30%|███ | 486/1610 [7:36:27<4:27:32, 14.28s/it] {'loss': 0.0025, 'grad_norm': 3.0303922367032343, 'learning_rate': 6.981366459627329e-07, 'completion_length': 115.41072082519531, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.21981074661016464, 'kl': 0.0638427734375, 'epoch': 1.51} 30%|███ | 486/1610 [7:36:27<4:27:32, 14.28s/it] 30%|███ | 487/1610 [7:36:40<4:22:33, 14.03s/it] {'loss': 0.0025, 'grad_norm': 2.1552828164006046, 'learning_rate': 6.975155279503105e-07, 'completion_length': 101.0714340209961, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.19514648616313934, 'kl': 0.062255859375, 'epoch': 1.51} 30%|███ | 487/1610 [7:36:40<4:22:33, 14.03s/it] 30%|███ | 488/1610 [7:36:58<4:40:53, 15.02s/it] {'loss': 0.0021, 'grad_norm': 0.9057395543006086, 'learning_rate': 6.968944099378881e-07, 'completion_length': 144.75000762939453, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.11266788095235825, 'kl': 0.052734375, 'epoch': 1.52} 30%|███ | 488/1610 [7:36:58<4:40:53, 15.02s/it] 30%|███ | 489/1610 [7:37:11<4:29:00, 14.40s/it] {'loss': 0.0022, 'grad_norm': 1.9706380909619876, 'learning_rate': 6.962732919254658e-07, 'completion_length': 113.4464340209961, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.3078143745660782, 'kl': 0.0550537109375, 'epoch': 1.52} 30%|███ | 489/1610 [7:37:11<4:29:00, 14.40s/it] 30%|███ | 490/1610 [7:37:27<4:39:10, 14.96s/it] {'loss': 0.0019, 'grad_norm': 1.393149534946784, 'learning_rate': 6.956521739130434e-07, 'completion_length': 129.08928680419922, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500454902649, 'kl': 0.046630859375, 'epoch': 1.52} 30%|███ | 490/1610 [7:37:27<4:39:10, 14.96s/it] 30%|███ | 491/1610 [7:37:43<4:46:54, 15.38s/it] {'loss': 0.0018, 'grad_norm': 0.8858184193041018, 'learning_rate': 6.95031055900621e-07, 'completion_length': 124.83929061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1071428619325161, 'kl': 0.045166015625, 'epoch': 1.52} 30%|███ | 491/1610 [7:37:43<4:46:54, 15.38s/it] 31%|███ | 492/1610 [7:37:56<4:30:04, 14.49s/it] {'loss': 0.002, 'grad_norm': 1.6738545662194961, 'learning_rate': 6.944099378881987e-07, 'completion_length': 109.48214721679688, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.18409644439816475, 'kl': 0.049560546875, 'epoch': 1.53} 31%|███ | 492/1610 [7:37:56<4:30:04, 14.49s/it] 31%|███ | 493/1610 [7:38:11<4:35:09, 14.78s/it] {'loss': 0.0027, 'grad_norm': 1.3375128603563564, 'learning_rate': 6.937888198757764e-07, 'completion_length': 120.51786422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.1428571529686451, 'kl': 0.0673828125, 'epoch': 1.53} 31%|███ | 493/1610 [7:38:11<4:35:09, 14.78s/it] 31%|███ | 494/1610 [7:38:25<4:27:59, 14.41s/it] {'loss': 0.0026, 'grad_norm': 1.0765981579853845, 'learning_rate': 6.93167701863354e-07, 'completion_length': 115.46428680419922, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.11266788095235825, 'kl': 0.065673828125, 'epoch': 1.53} 31%|███ | 494/1610 [7:38:25<4:27:59, 14.41s/it] 31%|███ | 495/1610 [7:38:38<4:21:24, 14.07s/it] {'loss': 0.0025, 'grad_norm': 1.2696889035513446, 'learning_rate': 6.925465838509317e-07, 'completion_length': 117.1785774230957, 'rewards/accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1785714328289032, 'kl': 0.0626220703125, 'epoch': 1.54} 31%|███ | 495/1610 [7:38:38<4:21:24, 14.07s/it] 31%|███ | 496/1610 [7:38:51<4:12:40, 13.61s/it] {'loss': 0.0022, 'grad_norm': 1.678410372178947, 'learning_rate': 6.919254658385093e-07, 'completion_length': 104.80357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.055908203125, 'epoch': 1.54} 31%|███ | 496/1610 [7:38:51<4:12:40, 13.61s/it] 31%|███ | 497/1610 [7:39:04<4:08:52, 13.42s/it] {'loss': 0.0021, 'grad_norm': 1.4920502345696476, 'learning_rate': 6.913043478260869e-07, 'completion_length': 108.3035774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.05303955078125, 'epoch': 1.54} 31%|███ | 497/1610 [7:39:04<4:08:52, 13.42s/it] 31%|███ | 498/1610 [7:39:15<3:56:00, 12.73s/it] {'loss': 0.002, 'grad_norm': 1.779845961642922, 'learning_rate': 6.906832298136646e-07, 'completion_length': 74.0535774230957, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.2142857238650322, 'kl': 0.051025390625, 'epoch': 1.55} 31%|███ | 498/1610 [7:39:15<3:56:00, 12.73s/it] 31%|███ | 499/1610 [7:39:29<4:02:59, 13.12s/it] {'loss': 0.0024, 'grad_norm': 2.6249725492603453, 'learning_rate': 6.900621118012422e-07, 'completion_length': 108.12500381469727, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.05908203125, 'epoch': 1.55} 31%|███ | 499/1610 [7:39:29<4:02:59, 13.12s/it] 31%|███ | 500/1610 [7:39:45<4:22:34, 14.19s/it] {'loss': 0.0024, 'grad_norm': 3.1232858617691166, 'learning_rate': 6.894409937888198e-07, 'completion_length': 102.33929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.26657506078481674, 'kl': 0.05908203125, 'epoch': 1.55} 31%|███ | 500/1610 [7:39:45<4:22:34, 14.19s/it] 31%|███ | 501/1610 [7:43:05<21:32:21, 69.92s/it] {'loss': 0.0024, 'grad_norm': 2.135895322332985, 'learning_rate': 6.888198757763975e-07, 'completion_length': 116.41072082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357544541359, 'kl': 0.0595703125, 'epoch': 1.56} 31%|███ | 501/1610 [7:43:05<21:32:21, 69.92s/it] 31%|███ | 502/1610 [7:43:15<15:57:50, 51.87s/it] {'loss': 0.002, 'grad_norm': 1.7314564412030642, 'learning_rate': 6.881987577639752e-07, 'completion_length': 80.25000381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.0499267578125, 'epoch': 1.56} 31%|███ | 502/1610 [7:43:15<15:57:50, 51.87s/it] 31%|███ | 503/1610 [7:43:28<12:23:50, 40.32s/it] {'loss': 0.0016, 'grad_norm': 1.686953004121729, 'learning_rate': 6.875776397515528e-07, 'completion_length': 120.50000762939453, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2500000149011612, 'kl': 0.03924560546875, 'epoch': 1.56} 31%|███ | 503/1610 [7:43:29<12:23:50, 40.32s/it] 31%|███▏ | 504/1610 [7:43:42<9:54:13, 32.24s/it] {'loss': 0.0024, 'grad_norm': 1.8835455132190004, 'learning_rate': 6.869565217391305e-07, 'completion_length': 100.9464340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.21981074661016464, 'kl': 0.0606689453125, 'epoch': 1.57} 31%|███▏ | 504/1610 [7:43:42<9:54:13, 32.24s/it] 31%|███▏ | 505/1610 [7:43:55<8:08:01, 26.50s/it] {'loss': 0.0017, 'grad_norm': 0.9000928195511089, 'learning_rate': 6.863354037267081e-07, 'completion_length': 101.57143020629883, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.043212890625, 'epoch': 1.57} 31%|███▏ | 505/1610 [7:43:55<8:08:01, 26.50s/it] 31%|███▏ | 506/1610 [7:44:07<6:49:00, 22.23s/it] {'loss': 0.0017, 'grad_norm': 5.837016141478413, 'learning_rate': 6.857142857142857e-07, 'completion_length': 96.41071701049805, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.043701171875, 'epoch': 1.57} 31%|███▏ | 506/1610 [7:44:07<6:49:00, 22.23s/it] 31%|███▏ | 507/1610 [7:44:20<5:59:00, 19.53s/it] {'loss': 0.0023, 'grad_norm': 3.2556158781202464, 'learning_rate': 6.850931677018634e-07, 'completion_length': 104.78572082519531, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.11266787722706795, 'kl': 0.057861328125, 'epoch': 1.57} 31%|███▏ | 507/1610 [7:44:20<5:59:00, 19.53s/it] 32%|███▏ | 508/1610 [7:44:37<5:40:21, 18.53s/it] {'loss': 0.0021, 'grad_norm': 1.8589807939687018, 'learning_rate': 6.84472049689441e-07, 'completion_length': 122.10714721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2610500454902649, 'kl': 0.0526123046875, 'epoch': 1.58} 32%|███▏ | 508/1610 [7:44:37<5:40:21, 18.53s/it] 32%|███▏ | 509/1610 [7:44:48<4:59:22, 16.31s/it] {'loss': 0.0029, 'grad_norm': 0.9991614344250872, 'learning_rate': 6.838509316770185e-07, 'completion_length': 106.50000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.0714111328125, 'epoch': 1.58} 32%|███▏ | 509/1610 [7:44:48<4:59:22, 16.31s/it] 32%|███▏ | 510/1610 [7:45:02<4:48:13, 15.72s/it] {'loss': 0.0022, 'grad_norm': 4.596051872755181, 'learning_rate': 6.832298136645962e-07, 'completion_length': 122.39286041259766, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2500000074505806, 'kl': 0.054931640625, 'epoch': 1.58} 32%|███▏ | 510/1610 [7:45:02<4:48:13, 15.72s/it] 32%|███▏ | 511/1610 [7:45:15<4:33:43, 14.94s/it] {'loss': 0.0019, 'grad_norm': 1.9757554775560375, 'learning_rate': 6.826086956521738e-07, 'completion_length': 117.5535774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1785714402794838, 'kl': 0.04736328125, 'epoch': 1.59} 32%|███▏ | 511/1610 [7:45:15<4:33:43, 14.94s/it] 32%|███▏ | 512/1610 [7:45:33<4:46:40, 15.67s/it] {'loss': 0.0024, 'grad_norm': 2.146756634268676, 'learning_rate': 6.819875776397515e-07, 'completion_length': 120.64286041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1071428619325161, 'kl': 0.0601806640625, 'epoch': 1.59} 32%|███▏ | 512/1610 [7:45:33<4:46:40, 15.67s/it] 32%|███▏ | 513/1610 [7:45:46<4:32:43, 14.92s/it] {'loss': 0.0022, 'grad_norm': 1.1815087839286593, 'learning_rate': 6.813664596273292e-07, 'completion_length': 114.08929061889648, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.1428571529686451, 'kl': 0.053955078125, 'epoch': 1.59} 32%|███▏ | 513/1610 [7:45:46<4:32:43, 14.92s/it] 32%|███▏ | 514/1610 [7:46:00<4:29:27, 14.75s/it] {'loss': 0.0023, 'grad_norm': 0.9006227598132843, 'learning_rate': 6.807453416149068e-07, 'completion_length': 127.87500381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.15943220257759094, 'kl': 0.058349609375, 'epoch': 1.6} 32%|███▏ | 514/1610 [7:46:00<4:29:27, 14.75s/it] 32%|███▏ | 515/1610 [7:46:15<4:29:02, 14.74s/it] {'loss': 0.0023, 'grad_norm': 1.8881289962122303, 'learning_rate': 6.801242236024844e-07, 'completion_length': 114.48215103149414, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.19514649361371994, 'kl': 0.056396484375, 'epoch': 1.6} 32%|███▏ | 515/1610 [7:46:15<4:29:02, 14.74s/it] 32%|███▏ | 516/1610 [7:46:29<4:25:53, 14.58s/it] {'loss': 0.0022, 'grad_norm': 4.940503468328761, 'learning_rate': 6.795031055900621e-07, 'completion_length': 100.4464340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.0548095703125, 'epoch': 1.6} 32%|███▏ | 516/1610 [7:46:29<4:25:53, 14.58s/it] 32%|███▏ | 517/1610 [7:46:44<4:28:30, 14.74s/it] {'loss': 0.0021, 'grad_norm': 1.4846332704133132, 'learning_rate': 6.788819875776397e-07, 'completion_length': 96.98214721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.0531005859375, 'epoch': 1.61} 32%|███▏ | 517/1610 [7:46:44<4:28:30, 14.74s/it] 32%|███▏ | 518/1610 [7:46:59<4:27:47, 14.71s/it] {'loss': 0.0021, 'grad_norm': 1.5414382098778856, 'learning_rate': 6.782608695652173e-07, 'completion_length': 124.4285774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.19514649361371994, 'kl': 0.0535888671875, 'epoch': 1.61} 32%|███▏ | 518/1610 [7:46:59<4:27:47, 14.71s/it] 32%|███▏ | 519/1610 [7:47:12<4:16:12, 14.09s/it] {'loss': 0.0018, 'grad_norm': 2.3084506374434675, 'learning_rate': 6.77639751552795e-07, 'completion_length': 103.0714340209961, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.18409644439816475, 'kl': 0.04443359375, 'epoch': 1.61} 32%|███▏ | 519/1610 [7:47:12<4:16:12, 14.09s/it] 32%|███▏ | 520/1610 [7:47:26<4:17:35, 14.18s/it] {'loss': 0.002, 'grad_norm': 0.8684137434715279, 'learning_rate': 6.770186335403726e-07, 'completion_length': 127.73215103149414, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.07695359364151955, 'kl': 0.049560546875, 'epoch': 1.61} 32%|███▏ | 520/1610 [7:47:26<4:17:35, 14.18s/it] 32%|███▏ | 521/1610 [7:47:38<4:08:41, 13.70s/it] {'loss': 0.0024, 'grad_norm': 1.5487930654949664, 'learning_rate': 6.763975155279503e-07, 'completion_length': 108.35714721679688, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1896214708685875, 'kl': 0.0589599609375, 'epoch': 1.62} 32%|███▏ | 521/1610 [7:47:38<4:08:41, 13.70s/it] 32%|███▏ | 522/1610 [7:47:57<4:33:01, 15.06s/it] {'loss': 0.0021, 'grad_norm': 3.1264763024806377, 'learning_rate': 6.75776397515528e-07, 'completion_length': 131.3035774230957, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.3545787185430527, 'kl': 0.05126953125, 'epoch': 1.62} 32%|███▏ | 522/1610 [7:47:57<4:33:01, 15.06s/it] 32%|███▏ | 523/1610 [7:48:19<5:11:53, 17.22s/it] {'loss': 0.0024, 'grad_norm': 1.3845515767072742, 'learning_rate': 6.751552795031056e-07, 'completion_length': 129.01786422729492, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.2253357656300068, 'kl': 0.060791015625, 'epoch': 1.62} 32%|███▏ | 523/1610 [7:48:19<5:11:53, 17.22s/it] 33%|███▎ | 524/1610 [7:48:37<5:13:30, 17.32s/it] {'loss': 0.0017, 'grad_norm': 1.420015104201669, 'learning_rate': 6.745341614906832e-07, 'completion_length': 93.64286422729492, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.11266787722706795, 'kl': 0.04345703125, 'epoch': 1.63} 33%|███▎ | 524/1610 [7:48:37<5:13:30, 17.32s/it] 33%|███▎ | 525/1610 [7:48:56<5:24:11, 17.93s/it] {'loss': 0.0021, 'grad_norm': 1.871668125547415, 'learning_rate': 6.739130434782609e-07, 'completion_length': 115.71429061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.21981073170900345, 'kl': 0.0533447265625, 'epoch': 1.63} 33%|███▎ | 525/1610 [7:48:56<5:24:11, 17.93s/it] 33%|███▎ | 526/1610 [7:49:16<5:37:22, 18.67s/it] {'loss': 0.0021, 'grad_norm': 1.1951199796604925, 'learning_rate': 6.732919254658385e-07, 'completion_length': 115.89286041259766, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1539071798324585, 'kl': 0.0531005859375, 'epoch': 1.63} 33%|███▎ | 526/1610 [7:49:16<5:37:22, 18.67s/it] 33%|███▎ | 527/1610 [7:49:33<5:26:07, 18.07s/it] {'loss': 0.002, 'grad_norm': 10.389077101892438, 'learning_rate': 6.726708074534161e-07, 'completion_length': 112.9285774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1649572253227234, 'kl': 0.050537109375, 'epoch': 1.64} 33%|███▎ | 527/1610 [7:49:33<5:26:07, 18.07s/it] 33%|███▎ | 528/1610 [7:49:49<5:17:05, 17.58s/it] {'loss': 0.0022, 'grad_norm': 2.0153590043514815, 'learning_rate': 6.720496894409938e-07, 'completion_length': 94.32143020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.30228933691978455, 'kl': 0.05517578125, 'epoch': 1.64} 33%|███▎ | 528/1610 [7:49:49<5:17:05, 17.58s/it] 33%|███▎ | 529/1610 [7:50:09<5:29:55, 18.31s/it] {'loss': 0.0018, 'grad_norm': 1.4695021277676907, 'learning_rate': 6.714285714285714e-07, 'completion_length': 120.5535774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.0458984375, 'epoch': 1.64} 33%|███▎ | 529/1610 [7:50:09<5:29:55, 18.31s/it] 33%|███▎ | 530/1610 [7:50:29<5:34:48, 18.60s/it] {'loss': 0.0019, 'grad_norm': 1.436453163812213, 'learning_rate': 6.708074534161491e-07, 'completion_length': 124.33929061889648, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.0474853515625, 'epoch': 1.65} 33%|███▎ | 530/1610 [7:50:29<5:34:48, 18.60s/it] 33%|███▎ | 531/1610 [7:50:45<5:20:44, 17.84s/it] {'loss': 0.0018, 'grad_norm': 1.5137751096208816, 'learning_rate': 6.701863354037268e-07, 'completion_length': 74.87500381469727, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.0714285746216774, 'kl': 0.0450439453125, 'epoch': 1.65} 33%|███▎ | 531/1610 [7:50:45<5:20:44, 17.84s/it] 33%|███▎ | 532/1610 [7:51:03<5:22:55, 17.97s/it] {'loss': 0.0023, 'grad_norm': 1.1200010614551172, 'learning_rate': 6.695652173913044e-07, 'completion_length': 105.71429061889648, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1428571529686451, 'kl': 0.0572509765625, 'epoch': 1.65} 33%|███▎ | 532/1610 [7:51:03<5:22:55, 17.97s/it] 33%|███▎ | 533/1610 [7:51:24<5:40:34, 18.97s/it] {'loss': 0.0021, 'grad_norm': 1.2893570374144054, 'learning_rate': 6.689440993788819e-07, 'completion_length': 113.01786041259766, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0537109375, 'epoch': 1.66} 33%|███▎ | 533/1610 [7:51:24<5:40:34, 18.97s/it] 33%|███▎ | 534/1610 [7:51:44<5:45:27, 19.26s/it] {'loss': 0.0021, 'grad_norm': 4.857010199834846, 'learning_rate': 6.683229813664595e-07, 'completion_length': 88.96429061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.3324785977602005, 'kl': 0.05224609375, 'epoch': 1.66} 33%|███▎ | 534/1610 [7:51:44<5:45:27, 19.26s/it] 33%|███▎ | 535/1610 [7:52:02<5:36:13, 18.77s/it] {'loss': 0.0018, 'grad_norm': 1.0841536666100176, 'learning_rate': 6.677018633540372e-07, 'completion_length': 101.05357360839844, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.046142578125, 'epoch': 1.66} 33%|███▎ | 535/1610 [7:52:02<5:36:13, 18.77s/it] 33%|███▎ | 536/1610 [7:52:21<5:38:53, 18.93s/it] {'loss': 0.0024, 'grad_norm': 1.4567006476672428, 'learning_rate': 6.670807453416148e-07, 'completion_length': 107.51786422729492, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.17098906636238098, 'kl': 0.0604248046875, 'epoch': 1.66} 33%|███▎ | 536/1610 [7:52:21<5:38:53, 18.93s/it] 33%|███▎ | 537/1610 [7:52:40<5:39:25, 18.98s/it] {'loss': 0.0026, 'grad_norm': 2.007794758362935, 'learning_rate': 6.664596273291924e-07, 'completion_length': 107.87500762939453, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.11266788095235825, 'kl': 0.065673828125, 'epoch': 1.67} 33%|███▎ | 537/1610 [7:52:40<5:39:25, 18.98s/it] 33%|███▎ | 538/1610 [7:52:59<5:38:07, 18.92s/it] {'loss': 0.0024, 'grad_norm': 1.6691629115821458, 'learning_rate': 6.658385093167701e-07, 'completion_length': 93.1964340209961, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.1539071835577488, 'kl': 0.06103515625, 'epoch': 1.67} 33%|███▎ | 538/1610 [7:52:59<5:38:07, 18.92s/it] 33%|███▎ | 539/1610 [7:53:17<5:35:03, 18.77s/it] {'loss': 0.0021, 'grad_norm': 1.4016449597092309, 'learning_rate': 6.652173913043478e-07, 'completion_length': 93.17857360839844, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.0531005859375, 'epoch': 1.67} 33%|███▎ | 539/1610 [7:53:17<5:35:03, 18.77s/it] 34%|███▎ | 540/1610 [7:53:39<5:51:30, 19.71s/it] {'loss': 0.003, 'grad_norm': 1.630601582970776, 'learning_rate': 6.645962732919254e-07, 'completion_length': 114.83929061889648, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.073974609375, 'epoch': 1.68} 34%|███▎ | 540/1610 [7:53:39<5:51:30, 19.71s/it] 34%|███▎ | 541/1610 [7:53:56<5:34:53, 18.80s/it] {'loss': 0.0028, 'grad_norm': 2.081816920928544, 'learning_rate': 6.639751552795031e-07, 'completion_length': 87.33929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.21981073170900345, 'kl': 0.0693359375, 'epoch': 1.68} 34%|███▎ | 541/1610 [7:53:56<5:34:53, 18.80s/it] 34%|███▎ | 542/1610 [7:54:14<5:29:03, 18.49s/it] {'loss': 0.0017, 'grad_norm': 1.7671761809968798, 'learning_rate': 6.633540372670807e-07, 'completion_length': 113.00000762939453, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.18409644439816475, 'kl': 0.0418701171875, 'epoch': 1.68} 34%|███▎ | 542/1610 [7:54:14<5:29:03, 18.49s/it] 34%|███▎ | 543/1610 [7:54:30<5:14:02, 17.66s/it] {'loss': 0.0021, 'grad_norm': 3.8551842750434293, 'learning_rate': 6.627329192546583e-07, 'completion_length': 83.17857360839844, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857909202576, 'reward_std': 0.2721000760793686, 'kl': 0.05224609375, 'epoch': 1.69} 34%|███▎ | 543/1610 [7:54:30<5:14:02, 17.66s/it] 34%|███▍ | 544/1610 [7:54:46<5:06:46, 17.27s/it] {'loss': 0.002, 'grad_norm': 3.5042508214037835, 'learning_rate': 6.62111801242236e-07, 'completion_length': 80.42857360839844, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.3078143820166588, 'kl': 0.05029296875, 'epoch': 1.69} 34%|███▍ | 544/1610 [7:54:46<5:06:46, 17.27s/it] 34%|███▍ | 545/1610 [7:55:04<5:09:12, 17.42s/it] {'loss': 0.0017, 'grad_norm': 2.268018237260785, 'learning_rate': 6.614906832298136e-07, 'completion_length': 93.19643020629883, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.2363857924938202, 'kl': 0.0419921875, 'epoch': 1.69} 34%|███▍ | 545/1610 [7:55:04<5:09:12, 17.42s/it] 34%|███▍ | 546/1610 [7:55:19<4:57:54, 16.80s/it] {'loss': 0.0022, 'grad_norm': 1.9729112986867197, 'learning_rate': 6.608695652173912e-07, 'completion_length': 76.48214721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1785714365541935, 'kl': 0.0548095703125, 'epoch': 1.7} 34%|███▍ | 546/1610 [7:55:19<4:57:54, 16.80s/it] 34%|███▍ | 547/1610 [7:55:36<4:58:18, 16.84s/it] {'loss': 0.0025, 'grad_norm': 1.7391204558557203, 'learning_rate': 6.602484472049689e-07, 'completion_length': 103.37500381469727, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2610500380396843, 'kl': 0.063232421875, 'epoch': 1.7} 34%|███▍ | 547/1610 [7:55:36<4:58:18, 16.84s/it] 34%|███▍ | 548/1610 [7:55:54<5:04:10, 17.18s/it] {'loss': 0.0024, 'grad_norm': 3.3982135557214006, 'learning_rate': 6.596273291925466e-07, 'completion_length': 110.98214721679688, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.06103515625, 'epoch': 1.7} 34%|███▍ | 548/1610 [7:55:54<5:04:10, 17.18s/it] 34%|███▍ | 549/1610 [7:56:13<5:11:06, 17.59s/it] {'loss': 0.0022, 'grad_norm': 2.0147755130254383, 'learning_rate': 6.590062111801242e-07, 'completion_length': 111.16071701049805, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.1428571492433548, 'kl': 0.0556640625, 'epoch': 1.7} 34%|███▍ | 549/1610 [7:56:13<5:11:06, 17.59s/it] 34%|███▍ | 550/1610 [7:56:33<5:27:39, 18.55s/it] {'loss': 0.002, 'grad_norm': 2.7487272645818206, 'learning_rate': 6.583850931677019e-07, 'completion_length': 112.46429061889648, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.0357142873108387, 'kl': 0.0494384765625, 'epoch': 1.71} 34%|███▍ | 550/1610 [7:56:33<5:27:39, 18.55s/it] 34%|███▍ | 551/1610 [7:56:49<5:14:58, 17.85s/it] {'loss': 0.002, 'grad_norm': 1.2457649285816603, 'learning_rate': 6.577639751552795e-07, 'completion_length': 89.1964340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.050537109375, 'epoch': 1.71} 34%|███▍ | 551/1610 [7:56:49<5:14:58, 17.85s/it] 34%|███▍ | 552/1610 [7:57:09<5:23:41, 18.36s/it] {'loss': 0.0021, 'grad_norm': 1.7196685047853846, 'learning_rate': 6.571428571428571e-07, 'completion_length': 112.83928680419922, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.21981073915958405, 'kl': 0.052490234375, 'epoch': 1.71} 34%|███▍ | 552/1610 [7:57:09<5:23:41, 18.36s/it] 34%|███▍ | 553/1610 [7:57:28<5:24:02, 18.39s/it] {'loss': 0.0018, 'grad_norm': 2.4659584016042655, 'learning_rate': 6.565217391304348e-07, 'completion_length': 82.94643020629883, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.11266788095235825, 'kl': 0.0439453125, 'epoch': 1.72} 34%|███▍ | 553/1610 [7:57:28<5:24:02, 18.39s/it] 34%|███▍ | 554/1610 [7:57:46<5:24:59, 18.47s/it] {'loss': 0.0019, 'grad_norm': 1.430448825852895, 'learning_rate': 6.559006211180124e-07, 'completion_length': 117.66071701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.04736328125, 'epoch': 1.72} 34%|███▍ | 554/1610 [7:57:46<5:24:59, 18.47s/it] 34%|███▍ | 555/1610 [7:58:05<5:28:58, 18.71s/it] {'loss': 0.0023, 'grad_norm': 4.3796865530068905, 'learning_rate': 6.5527950310559e-07, 'completion_length': 118.28571701049805, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.0579833984375, 'epoch': 1.72} 34%|███▍ | 555/1610 [7:58:05<5:28:58, 18.71s/it] 35%|███▍ | 556/1610 [7:58:24<5:27:34, 18.65s/it] {'loss': 0.0021, 'grad_norm': 3.5357972505056194, 'learning_rate': 6.546583850931676e-07, 'completion_length': 90.37500381469727, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.23086077719926834, 'kl': 0.0513916015625, 'epoch': 1.73} 35%|███▍ | 556/1610 [7:58:24<5:27:34, 18.65s/it] 35%|███▍ | 557/1610 [7:58:39<5:11:00, 17.72s/it] {'loss': 0.0019, 'grad_norm': 1.8349927173417302, 'learning_rate': 6.540372670807453e-07, 'completion_length': 81.26786422729492, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.046630859375, 'epoch': 1.73} 35%|███▍ | 557/1610 [7:58:39<5:11:00, 17.72s/it] 35%|███▍ | 558/1610 [7:58:59<5:20:22, 18.27s/it] {'loss': 0.0024, 'grad_norm': 2.659037678032093, 'learning_rate': 6.534161490683229e-07, 'completion_length': 102.33929061889648, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.25552503019571304, 'kl': 0.059814453125, 'epoch': 1.73} 35%|███▍ | 558/1610 [7:58:59<5:20:22, 18.27s/it] 35%|███▍ | 559/1610 [7:59:19<5:30:10, 18.85s/it] {'loss': 0.0031, 'grad_norm': 2.3485690298169213, 'learning_rate': 6.527950310559006e-07, 'completion_length': 114.10714721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216453790665, 'kl': 0.078369140625, 'epoch': 1.74} 35%|███▍ | 559/1610 [7:59:19<5:30:10, 18.85s/it] 35%|███▍ | 560/1610 [7:59:36<5:20:32, 18.32s/it] {'loss': 0.0017, 'grad_norm': 1.6340969994743528, 'learning_rate': 6.521739130434782e-07, 'completion_length': 100.9285774230957, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.29123930633068085, 'kl': 0.041748046875, 'epoch': 1.74} 35%|███▍ | 560/1610 [7:59:36<5:20:32, 18.32s/it] 35%|███▍ | 561/1610 [7:59:54<5:14:36, 18.00s/it] {'loss': 0.0021, 'grad_norm': 1.2084543406341188, 'learning_rate': 6.515527950310558e-07, 'completion_length': 115.0714340209961, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0513916015625, 'epoch': 1.74} 35%|███▍ | 561/1610 [7:59:54<5:14:36, 18.00s/it] 35%|███▍ | 562/1610 [8:00:09<4:58:41, 17.10s/it] {'loss': 0.0028, 'grad_norm': 1.0528075231736338, 'learning_rate': 6.509316770186335e-07, 'completion_length': 115.66072082519531, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.1539071872830391, 'kl': 0.0706787109375, 'epoch': 1.75} 35%|███▍ | 562/1610 [8:00:09<4:58:41, 17.10s/it] 35%|███▍ | 563/1610 [8:00:27<5:03:57, 17.42s/it] {'loss': 0.0031, 'grad_norm': 1.5269022726156958, 'learning_rate': 6.503105590062111e-07, 'completion_length': 128.46429061889648, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.1428571529686451, 'kl': 0.0765380859375, 'epoch': 1.75} 35%|███▍ | 563/1610 [8:00:27<5:03:57, 17.42s/it] 35%|███▌ | 564/1610 [8:00:40<4:42:22, 16.20s/it] {'loss': 0.0021, 'grad_norm': 1.457007560731952, 'learning_rate': 6.496894409937887e-07, 'completion_length': 110.32143783569336, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.1428571492433548, 'kl': 0.052734375, 'epoch': 1.75} 35%|███▌ | 564/1610 [8:00:40<4:42:22, 16.20s/it] 35%|███▌ | 565/1610 [8:00:52<4:19:16, 14.89s/it] {'loss': 0.0017, 'grad_norm': 2.313662294396997, 'learning_rate': 6.490683229813664e-07, 'completion_length': 101.41072082519531, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.0428466796875, 'epoch': 1.75} 35%|███▌ | 565/1610 [8:00:52<4:19:16, 14.89s/it] 35%|███▌ | 566/1610 [8:01:06<4:15:19, 14.67s/it] {'loss': 0.0028, 'grad_norm': 6.899744629650609, 'learning_rate': 6.484472049689441e-07, 'completion_length': 108.66072082519531, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.23086076974868774, 'kl': 0.0711669921875, 'epoch': 1.76} 35%|███▌ | 566/1610 [8:01:06<4:15:19, 14.67s/it] 35%|███▌ | 567/1610 [8:01:19<4:07:02, 14.21s/it] {'loss': 0.0023, 'grad_norm': 3.6686473132098736, 'learning_rate': 6.478260869565217e-07, 'completion_length': 113.50000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.0570068359375, 'epoch': 1.76} 35%|███▌ | 567/1610 [8:01:19<4:07:02, 14.21s/it] 35%|███▌ | 568/1610 [8:01:36<4:19:38, 14.95s/it] {'loss': 0.0018, 'grad_norm': 1.9698491872565163, 'learning_rate': 6.472049689440994e-07, 'completion_length': 136.05357360839844, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.25552502647042274, 'kl': 0.0445556640625, 'epoch': 1.76} 35%|███▌ | 568/1610 [8:01:36<4:19:38, 14.95s/it] 35%|███▌ | 569/1610 [8:01:47<3:59:06, 13.78s/it] {'loss': 0.0025, 'grad_norm': 1.2594771109234406, 'learning_rate': 6.46583850931677e-07, 'completion_length': 83.6964340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838215708732605, 'kl': 0.062255859375, 'epoch': 1.77} 35%|███▌ | 569/1610 [8:01:47<3:59:06, 13.78s/it] 35%|███▌ | 570/1610 [8:01:59<3:51:50, 13.38s/it] {'loss': 0.0016, 'grad_norm': 1.7810067281496091, 'learning_rate': 6.459627329192546e-07, 'completion_length': 102.23214721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.2142857238650322, 'kl': 0.0389404296875, 'epoch': 1.77} 35%|███▌ | 570/1610 [8:01:59<3:51:50, 13.38s/it] 35%|███▌ | 571/1610 [8:02:12<3:46:21, 13.07s/it] {'loss': 0.0022, 'grad_norm': 2.262741362198001, 'learning_rate': 6.453416149068323e-07, 'completion_length': 103.80357360839844, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.2857142984867096, 'kl': 0.0543212890625, 'epoch': 1.77} 35%|███▌ | 571/1610 [8:02:12<3:46:21, 13.07s/it] 36%|███▌ | 572/1610 [8:02:29<4:06:01, 14.22s/it] {'loss': 0.0031, 'grad_norm': 1.7774896426373878, 'learning_rate': 6.447204968944099e-07, 'completion_length': 137.51786041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.2937234044075012, 'kl': 0.077880859375, 'epoch': 1.78} 36%|███▌ | 572/1610 [8:02:29<4:06:01, 14.22s/it] 36%|███▌ | 573/1610 [8:02:42<4:02:11, 14.01s/it] {'loss': 0.0022, 'grad_norm': 5.789934851180757, 'learning_rate': 6.440993788819875e-07, 'completion_length': 100.0535774230957, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.05517578125, 'epoch': 1.78} 36%|███▌ | 573/1610 [8:02:42<4:02:11, 14.01s/it] 36%|███▌ | 574/1610 [8:03:00<4:21:29, 15.14s/it] {'loss': 0.0022, 'grad_norm': 1.3441150968052842, 'learning_rate': 6.434782608695652e-07, 'completion_length': 156.37500762939453, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.33800365030765533, 'kl': 0.0540771484375, 'epoch': 1.78} 36%|███▌ | 574/1610 [8:03:00<4:21:29, 15.14s/it] 36%|███▌ | 575/1610 [8:03:12<4:04:43, 14.19s/it] {'loss': 0.0025, 'grad_norm': 2.467979704215309, 'learning_rate': 6.428571428571429e-07, 'completion_length': 105.3214340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1181928962469101, 'kl': 0.0614013671875, 'epoch': 1.79} 36%|███▌ | 575/1610 [8:03:12<4:04:43, 14.19s/it] 36%|███▌ | 576/1610 [8:03:26<4:04:51, 14.21s/it] {'loss': 0.0026, 'grad_norm': 1.8703017554163213, 'learning_rate': 6.422360248447205e-07, 'completion_length': 117.00000381469727, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214783191681, 'kl': 0.0650634765625, 'epoch': 1.79} 36%|███▌ | 576/1610 [8:03:26<4:04:51, 14.21s/it] 36%|███▌ | 577/1610 [8:03:43<4:20:21, 15.12s/it] {'loss': 0.0032, 'grad_norm': 1.5483260400566297, 'learning_rate': 6.416149068322982e-07, 'completion_length': 155.42858123779297, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.2253357619047165, 'kl': 0.079833984375, 'epoch': 1.79} 36%|███▌ | 577/1610 [8:03:43<4:20:21, 15.12s/it] 36%|███▌ | 578/1610 [8:03:58<4:18:21, 15.02s/it] {'loss': 0.0022, 'grad_norm': 1.6411599462276456, 'learning_rate': 6.409937888198758e-07, 'completion_length': 108.67857360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.2142857238650322, 'kl': 0.0550537109375, 'epoch': 1.8} 36%|███▌ | 578/1610 [8:03:58<4:18:21, 15.02s/it] 36%|███▌ | 579/1610 [8:04:13<4:15:42, 14.88s/it] {'loss': 0.0017, 'grad_norm': 1.5093074729408402, 'learning_rate': 6.403726708074534e-07, 'completion_length': 113.48214340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1896214634180069, 'kl': 0.042724609375, 'epoch': 1.8} 36%|███▌ | 579/1610 [8:04:13<4:15:42, 14.88s/it] 36%|███▌ | 580/1610 [8:04:31<4:33:00, 15.90s/it] {'loss': 0.0027, 'grad_norm': 1.2470703655635291, 'learning_rate': 6.39751552795031e-07, 'completion_length': 151.87500762939453, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.22781985998153687, 'kl': 0.0673828125, 'epoch': 1.8} 36%|███▌ | 580/1610 [8:04:31<4:33:00, 15.90s/it] 36%|███▌ | 581/1610 [8:04:44<4:16:10, 14.94s/it] {'loss': 0.0026, 'grad_norm': 1.3137775177556092, 'learning_rate': 6.391304347826086e-07, 'completion_length': 121.00000762939453, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.15943221747875214, 'kl': 0.0660400390625, 'epoch': 1.8} 36%|███▌ | 581/1610 [8:04:44<4:16:10, 14.94s/it] 36%|███▌ | 582/1610 [8:05:03<4:37:53, 16.22s/it] {'loss': 0.0038, 'grad_norm': 2.047829182631946, 'learning_rate': 6.385093167701862e-07, 'completion_length': 148.35714721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.2610500380396843, 'kl': 0.095703125, 'epoch': 1.81} 36%|███▌ | 582/1610 [8:05:03<4:37:53, 16.22s/it] 36%|███▌ | 583/1610 [8:05:19<4:37:27, 16.21s/it] {'loss': 0.0034, 'grad_norm': 1.0427703898669478, 'learning_rate': 6.378881987577639e-07, 'completion_length': 154.89286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.1539071835577488, 'kl': 0.083984375, 'epoch': 1.81} 36%|███▌ | 583/1610 [8:05:19<4:37:27, 16.21s/it] 36%|███▋ | 584/1610 [8:05:32<4:19:02, 15.15s/it] {'loss': 0.0021, 'grad_norm': 1.0908019695808073, 'learning_rate': 6.372670807453416e-07, 'completion_length': 113.25000762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.0714285746216774, 'kl': 0.0523681640625, 'epoch': 1.81} 36%|███▋ | 584/1610 [8:05:32<4:19:02, 15.15s/it] 36%|███▋ | 585/1610 [8:05:44<4:03:56, 14.28s/it] {'loss': 0.0026, 'grad_norm': 0.8667407489816984, 'learning_rate': 6.366459627329192e-07, 'completion_length': 108.53572082519531, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266788095235825, 'kl': 0.063720703125, 'epoch': 1.82} 36%|███▋ | 585/1610 [8:05:44<4:03:56, 14.28s/it] 36%|███▋ | 586/1610 [8:05:53<3:38:03, 12.78s/it] {'loss': 0.0015, 'grad_norm': 3.122671178650278, 'learning_rate': 6.360248447204969e-07, 'completion_length': 89.21429061889648, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1539071872830391, 'kl': 0.03857421875, 'epoch': 1.82} 36%|███▋ | 586/1610 [8:05:53<3:38:03, 12.78s/it] 36%|███▋ | 587/1610 [8:06:07<3:40:35, 12.94s/it] {'loss': 0.0028, 'grad_norm': 1.776821618017603, 'learning_rate': 6.354037267080745e-07, 'completion_length': 103.55357360839844, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.2610500529408455, 'kl': 0.0699462890625, 'epoch': 1.82} 36%|███▋ | 587/1610 [8:06:07<3:40:35, 12.94s/it] 37%|███▋ | 588/1610 [8:06:21<3:49:13, 13.46s/it] {'loss': 0.0032, 'grad_norm': 2.529279245974982, 'learning_rate': 6.347826086956521e-07, 'completion_length': 123.26786041259766, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981073915958405, 'kl': 0.0791015625, 'epoch': 1.83} 37%|███▋ | 588/1610 [8:06:21<3:49:13, 13.46s/it] 37%|███▋ | 589/1610 [8:06:38<4:05:10, 14.41s/it] {'loss': 0.0032, 'grad_norm': 1.464051379675937, 'learning_rate': 6.341614906832298e-07, 'completion_length': 130.14286422729492, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1785714402794838, 'kl': 0.080810546875, 'epoch': 1.83} 37%|███▋ | 589/1610 [8:06:38<4:05:10, 14.41s/it] 37%|███▋ | 590/1610 [8:06:50<3:52:33, 13.68s/it] {'loss': 0.0027, 'grad_norm': 1.4180101841721744, 'learning_rate': 6.335403726708074e-07, 'completion_length': 110.50000381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1896214671432972, 'kl': 0.068359375, 'epoch': 1.83} 37%|███▋ | 590/1610 [8:06:50<3:52:33, 13.68s/it] 37%|███▋ | 591/1610 [8:07:04<3:55:40, 13.88s/it] {'loss': 0.0021, 'grad_norm': 1.6646962045479985, 'learning_rate': 6.32919254658385e-07, 'completion_length': 116.3214340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.15943220257759094, 'kl': 0.05322265625, 'epoch': 1.84} 37%|███▋ | 591/1610 [8:07:04<3:55:40, 13.88s/it] 37%|███▋ | 592/1610 [8:07:19<3:58:06, 14.03s/it] {'loss': 0.003, 'grad_norm': 1.296019590375463, 'learning_rate': 6.322981366459627e-07, 'completion_length': 129.73215103149414, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1539071872830391, 'kl': 0.0740966796875, 'epoch': 1.84} 37%|███▋ | 592/1610 [8:07:19<3:58:06, 14.03s/it] 37%|███▋ | 593/1610 [8:07:35<4:09:47, 14.74s/it] {'loss': 0.002, 'grad_norm': 1.8059739828109087, 'learning_rate': 6.316770186335404e-07, 'completion_length': 115.6785774230957, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.14534124732017517, 'kl': 0.0506591796875, 'epoch': 1.84} 37%|███▋ | 593/1610 [8:07:35<4:09:47, 14.74s/it] 37%|███▋ | 594/1610 [8:07:49<4:05:44, 14.51s/it] {'loss': 0.0017, 'grad_norm': 0.7717697006149635, 'learning_rate': 6.31055900621118e-07, 'completion_length': 95.80357360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.0357142873108387, 'kl': 0.0435791015625, 'epoch': 1.84} 37%|███▋ | 594/1610 [8:07:49<4:05:44, 14.51s/it] 37%|███▋ | 595/1610 [8:08:07<4:20:53, 15.42s/it] {'loss': 0.0036, 'grad_norm': 2.0540757371070293, 'learning_rate': 6.304347826086957e-07, 'completion_length': 134.6607208251953, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.18409644439816475, 'kl': 0.08935546875, 'epoch': 1.85} 37%|███▋ | 595/1610 [8:08:07<4:20:53, 15.42s/it] 37%|███▋ | 596/1610 [8:08:19<4:05:09, 14.51s/it] {'loss': 0.0026, 'grad_norm': 3.9526349741442908, 'learning_rate': 6.298136645962733e-07, 'completion_length': 91.39286422729492, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.0640869140625, 'epoch': 1.85} 37%|███▋ | 596/1610 [8:08:19<4:05:09, 14.51s/it] 37%|███▋ | 597/1610 [8:08:30<3:47:41, 13.49s/it] {'loss': 0.0017, 'grad_norm': 1.2531093169679823, 'learning_rate': 6.291925465838509e-07, 'completion_length': 93.3035774230957, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.04327392578125, 'epoch': 1.85} 37%|███▋ | 597/1610 [8:08:30<3:47:41, 13.49s/it] 37%|███▋ | 598/1610 [8:08:41<3:35:16, 12.76s/it] {'loss': 0.0019, 'grad_norm': 3.147747778067125, 'learning_rate': 6.285714285714286e-07, 'completion_length': 97.01786422729492, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.0479736328125, 'epoch': 1.86} 37%|███▋ | 598/1610 [8:08:41<3:35:16, 12.76s/it] 37%|███▋ | 599/1610 [8:08:54<3:34:53, 12.75s/it] {'loss': 0.0021, 'grad_norm': 1.9204770776792355, 'learning_rate': 6.279503105590062e-07, 'completion_length': 95.4464340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.21981074661016464, 'kl': 0.0535888671875, 'epoch': 1.86} 37%|███▋ | 599/1610 [8:08:54<3:34:53, 12.75s/it] 37%|███▋ | 600/1610 [8:09:09<3:46:13, 13.44s/it] {'loss': 0.0036, 'grad_norm': 1.873642923523625, 'learning_rate': 6.273291925465838e-07, 'completion_length': 141.1964340209961, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25552501529455185, 'kl': 0.0894775390625, 'epoch': 1.86} 37%|███▋ | 600/1610 [8:09:09<3:46:13, 13.44s/it] 37%|███▋ | 601/1610 [8:12:26<19:13:14, 68.58s/it] {'loss': 0.0072, 'grad_norm': 1.5145898595065397, 'learning_rate': 6.267080745341615e-07, 'completion_length': 139.28572463989258, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.28819839656352997, 'kl': 0.1806640625, 'epoch': 1.87} 37%|███▋ | 601/1610 [8:12:26<19:13:14, 68.58s/it] 37%|███▋ | 602/1610 [8:12:43<14:53:45, 53.20s/it] {'loss': 0.0039, 'grad_norm': 2.792066713861984, 'learning_rate': 6.260869565217392e-07, 'completion_length': 126.76786041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.21222837269306183, 'kl': 0.096435546875, 'epoch': 1.87} 37%|███▋ | 602/1610 [8:12:43<14:53:45, 53.20s/it] 37%|███▋ | 603/1610 [8:13:01<11:51:57, 42.42s/it] {'loss': 0.005, 'grad_norm': 5.16094550101429, 'learning_rate': 6.254658385093168e-07, 'completion_length': 147.4464340209961, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035715222358704, 'reward_std': 0.1896214783191681, 'kl': 0.125, 'epoch': 1.87} 37%|███▋ | 603/1610 [8:13:01<11:51:57, 42.42s/it] 38%|███▊ | 604/1610 [8:13:14<9:26:55, 33.81s/it] {'loss': 0.0027, 'grad_norm': 2.3594304754759263, 'learning_rate': 6.248447204968945e-07, 'completion_length': 102.32143020629883, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.066650390625, 'epoch': 1.88} 38%|███▊ | 604/1610 [8:13:14<9:26:55, 33.81s/it] 38%|███▊ | 605/1610 [8:13:32<8:05:43, 29.00s/it] {'loss': 0.0075, 'grad_norm': 3.215712803549996, 'learning_rate': 6.24223602484472e-07, 'completion_length': 173.58929443359375, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.25552501529455185, 'kl': 0.185546875, 'epoch': 1.88} 38%|███▊ | 605/1610 [8:13:32<8:05:43, 29.00s/it] 38%|███▊ | 606/1610 [8:13:45<6:43:45, 24.13s/it] {'loss': 0.003, 'grad_norm': 2.038295971829527, 'learning_rate': 6.236024844720496e-07, 'completion_length': 126.4464340209961, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.075927734375, 'epoch': 1.88} 38%|███▊ | 606/1610 [8:13:45<6:43:45, 24.13s/it] 38%|███▊ | 607/1610 [8:14:02<6:07:00, 21.95s/it] {'loss': 0.0035, 'grad_norm': 2.370806148747809, 'learning_rate': 6.229813664596273e-07, 'completion_length': 120.39286041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.21981073170900345, 'kl': 0.087646484375, 'epoch': 1.89} 38%|███▊ | 607/1610 [8:14:02<6:07:00, 21.95s/it] 38%|███▊ | 608/1610 [8:14:16<5:27:42, 19.62s/it] {'loss': 0.0049, 'grad_norm': 2.782622523265657, 'learning_rate': 6.223602484472049e-07, 'completion_length': 107.64286041259766, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.25552503019571304, 'kl': 0.121826171875, 'epoch': 1.89} 38%|███▊ | 608/1610 [8:14:16<5:27:42, 19.62s/it] 38%|███▊ | 609/1610 [8:14:30<5:00:56, 18.04s/it] {'loss': 0.0046, 'grad_norm': 2.2417765720632215, 'learning_rate': 6.217391304347825e-07, 'completion_length': 122.62500381469727, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.31333939731121063, 'kl': 0.11572265625, 'epoch': 1.89} 38%|███▊ | 609/1610 [8:14:30<5:00:56, 18.04s/it] 38%|███▊ | 610/1610 [8:14:46<4:46:50, 17.21s/it] {'loss': 0.0036, 'grad_norm': 1.7798101776103625, 'learning_rate': 6.211180124223601e-07, 'completion_length': 145.55357360839844, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285715222358704, 'reward_std': 0.2253357619047165, 'kl': 0.091064453125, 'epoch': 1.89} 38%|███▊ | 610/1610 [8:14:46<4:46:50, 17.21s/it] 38%|███▊ | 611/1610 [8:15:00<4:29:56, 16.21s/it] {'loss': 0.0045, 'grad_norm': 2.1705052365414805, 'learning_rate': 6.204968944099379e-07, 'completion_length': 118.60714721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1896214708685875, 'kl': 0.11279296875, 'epoch': 1.9} 38%|███▊ | 611/1610 [8:15:00<4:29:56, 16.21s/it] 38%|███▊ | 612/1610 [8:15:14<4:19:32, 15.60s/it] {'loss': 0.006, 'grad_norm': 4.5550355773812115, 'learning_rate': 6.198757763975155e-07, 'completion_length': 122.66071701049805, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.2142857238650322, 'kl': 0.150390625, 'epoch': 1.9} 38%|███▊ | 612/1610 [8:15:14<4:19:32, 15.60s/it] 38%|███▊ | 613/1610 [8:15:27<4:10:09, 15.05s/it] {'loss': 0.0032, 'grad_norm': 2.747595656705141, 'learning_rate': 6.192546583850932e-07, 'completion_length': 119.4464340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.21981073170900345, 'kl': 0.0804443359375, 'epoch': 1.9} 38%|███▊ | 613/1610 [8:15:27<4:10:09, 15.05s/it] 38%|███▊ | 614/1610 [8:15:40<3:59:10, 14.41s/it] {'loss': 0.0034, 'grad_norm': 1.1707044804631799, 'learning_rate': 6.186335403726708e-07, 'completion_length': 122.12500762939453, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.2253357619047165, 'kl': 0.084716796875, 'epoch': 1.91} 38%|███▊ | 614/1610 [8:15:40<3:59:10, 14.41s/it] 38%|███▊ | 615/1610 [8:15:59<4:17:24, 15.52s/it] {'loss': 0.0066, 'grad_norm': 1.3895655575335464, 'learning_rate': 6.180124223602484e-07, 'completion_length': 154.23214721679688, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.2500000074505806, 'kl': 0.166259765625, 'epoch': 1.91} 38%|███▊ | 615/1610 [8:15:59<4:17:24, 15.52s/it] 38%|███▊ | 616/1610 [8:16:14<4:15:46, 15.44s/it] {'loss': 0.0044, 'grad_norm': 1.32404907061716, 'learning_rate': 6.17391304347826e-07, 'completion_length': 128.71429443359375, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.1428571492433548, 'kl': 0.10986328125, 'epoch': 1.91} 38%|███▊ | 616/1610 [8:16:14<4:15:46, 15.44s/it] 38%|███▊ | 617/1610 [8:16:26<3:59:45, 14.49s/it] {'loss': 0.0021, 'grad_norm': 1.3483244399532766, 'learning_rate': 6.167701863354037e-07, 'completion_length': 107.10714721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214783191681, 'kl': 0.0523681640625, 'epoch': 1.92} 38%|███▊ | 617/1610 [8:16:26<3:59:45, 14.49s/it] 38%|███▊ | 618/1610 [8:16:41<3:59:39, 14.50s/it] {'loss': 0.003, 'grad_norm': 1.9131509782792946, 'learning_rate': 6.161490683229813e-07, 'completion_length': 110.37500381469727, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.2721000760793686, 'kl': 0.074951171875, 'epoch': 1.92} 38%|███▊ | 618/1610 [8:16:41<3:59:39, 14.50s/it] 38%|███▊ | 619/1610 [8:16:58<4:13:33, 15.35s/it] {'loss': 0.0093, 'grad_norm': 1.6353619476693726, 'learning_rate': 6.15527950310559e-07, 'completion_length': 148.0357208251953, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928572535514832, 'reward_std': 0.1539071872830391, 'kl': 0.23291015625, 'epoch': 1.92} 38%|███▊ | 619/1610 [8:16:58<4:13:33, 15.35s/it] 39%|███▊ | 620/1610 [8:17:16<4:27:02, 16.18s/it] {'loss': 0.0075, 'grad_norm': 1.3124745336907968, 'learning_rate': 6.149068322981367e-07, 'completion_length': 164.89286041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.21222837269306183, 'kl': 0.18896484375, 'epoch': 1.93} 39%|███▊ | 620/1610 [8:17:16<4:27:02, 16.18s/it] 39%|███▊ | 621/1610 [8:17:33<4:31:55, 16.50s/it] {'loss': 0.0076, 'grad_norm': 1.7452398157793088, 'learning_rate': 6.142857142857143e-07, 'completion_length': 132.85714721679688, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.1785714402794838, 'kl': 0.19091796875, 'epoch': 1.93} 39%|███▊ | 621/1610 [8:17:33<4:31:55, 16.50s/it] 39%|███▊ | 622/1610 [8:17:47<4:17:44, 15.65s/it] {'loss': 0.0038, 'grad_norm': 11.797736170873169, 'learning_rate': 6.13664596273292e-07, 'completion_length': 131.87500762939453, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.2142857201397419, 'kl': 0.095458984375, 'epoch': 1.93} 39%|███▊ | 622/1610 [8:17:47<4:17:44, 15.65s/it] 39%|███▊ | 623/1610 [8:18:06<4:32:23, 16.56s/it] {'loss': 0.0049, 'grad_norm': 1.4810766202277368, 'learning_rate': 6.130434782608696e-07, 'completion_length': 131.30357360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000001192092896, 'reward_std': 0.1539071872830391, 'kl': 0.12158203125, 'epoch': 1.93} 39%|███▊ | 623/1610 [8:18:06<4:32:23, 16.56s/it] 39%|███▉ | 624/1610 [8:18:23<4:38:15, 16.93s/it] {'loss': 0.0036, 'grad_norm': 4.3168877685068585, 'learning_rate': 6.124223602484472e-07, 'completion_length': 133.85714721679688, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.29123930633068085, 'kl': 0.091064453125, 'epoch': 1.94} 39%|███▉ | 624/1610 [8:18:23<4:38:15, 16.93s/it] 39%|███▉ | 625/1610 [8:18:44<4:54:15, 17.92s/it] {'loss': 0.0067, 'grad_norm': 1.2024810013464462, 'learning_rate': 6.118012422360248e-07, 'completion_length': 157.39286041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.21124479919672012, 'kl': 0.16845703125, 'epoch': 1.94} 39%|███▉ | 625/1610 [8:18:44<4:54:15, 17.92s/it] 39%|███▉ | 626/1610 [8:19:00<4:44:56, 17.37s/it] {'loss': 0.003, 'grad_norm': 2.6381485438765613, 'learning_rate': 6.111801242236025e-07, 'completion_length': 110.6964340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1785714402794838, 'kl': 0.0740966796875, 'epoch': 1.94} 39%|███▉ | 626/1610 [8:19:00<4:44:56, 17.37s/it] 39%|███▉ | 627/1610 [8:19:14<4:30:24, 16.51s/it] {'loss': 0.004, 'grad_norm': 1.9137269390961038, 'learning_rate': 6.105590062111801e-07, 'completion_length': 114.35714721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.0989990234375, 'epoch': 1.95} 39%|███▉ | 627/1610 [8:19:14<4:30:24, 16.51s/it] 39%|███▉ | 628/1610 [8:19:31<4:33:11, 16.69s/it] {'loss': 0.0087, 'grad_norm': 1.0877540578422709, 'learning_rate': 6.099378881987576e-07, 'completion_length': 143.7857208251953, 'rewards/accuracy_reward': 0.3571428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.218505859375, 'epoch': 1.95} 39%|███▉ | 628/1610 [8:19:31<4:33:11, 16.69s/it] 39%|███▉ | 629/1610 [8:19:48<4:35:08, 16.83s/it] {'loss': 0.0072, 'grad_norm': 1.5829420553506848, 'learning_rate': 6.093167701863354e-07, 'completion_length': 144.66072463989258, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.2580091208219528, 'kl': 0.1796875, 'epoch': 1.95} 39%|███▉ | 629/1610 [8:19:48<4:35:08, 16.83s/it] 39%|███▉ | 630/1610 [8:20:00<4:07:59, 15.18s/it] {'loss': 0.0013, 'grad_norm': 1.6483673975621975, 'learning_rate': 6.08695652173913e-07, 'completion_length': 95.48214721679688, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214783191681, 'kl': 0.033203125, 'epoch': 1.96} 39%|███▉ | 630/1610 [8:20:00<4:07:59, 15.18s/it] 39%|███▉ | 631/1610 [8:20:16<4:14:35, 15.60s/it] {'loss': 0.0078, 'grad_norm': 1.4782051372128497, 'learning_rate': 6.080745341614906e-07, 'completion_length': 137.1785774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.1896214708685875, 'kl': 0.1953125, 'epoch': 1.96} 39%|███▉ | 631/1610 [8:20:16<4:14:35, 15.60s/it] 39%|███▉ | 632/1610 [8:20:31<4:07:16, 15.17s/it] {'loss': 0.0019, 'grad_norm': 1.6585053043205922, 'learning_rate': 6.074534161490683e-07, 'completion_length': 141.9464340209961, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.2721000909805298, 'kl': 0.048095703125, 'epoch': 1.96} 39%|███▉ | 632/1610 [8:20:31<4:07:16, 15.17s/it] 39%|███▉ | 633/1610 [8:20:42<3:48:59, 14.06s/it] {'loss': 0.0022, 'grad_norm': 3.068510157132408, 'learning_rate': 6.068322981366459e-07, 'completion_length': 98.16072082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.05517578125, 'epoch': 1.97} 39%|███▉ | 633/1610 [8:20:42<3:48:59, 14.06s/it] 39%|███▉ | 634/1610 [8:20:56<3:49:47, 14.13s/it] {'loss': 0.0041, 'grad_norm': 2.464536054814395, 'learning_rate': 6.062111801242235e-07, 'completion_length': 125.51786041259766, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.14838216826319695, 'kl': 0.101806640625, 'epoch': 1.97} 39%|███▉ | 634/1610 [8:20:56<3:49:47, 14.13s/it] 39%|███▉ | 635/1610 [8:21:14<4:07:13, 15.21s/it] {'loss': 0.0046, 'grad_norm': 1.6364972830978448, 'learning_rate': 6.055900621118012e-07, 'completion_length': 111.26786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.20670336484909058, 'kl': 0.11614990234375, 'epoch': 1.97} 39%|███▉ | 635/1610 [8:21:14<4:07:13, 15.21s/it] 40%|███▉ | 636/1610 [8:21:28<3:59:52, 14.78s/it] {'loss': 0.0044, 'grad_norm': 0.8604286396955967, 'learning_rate': 6.049689440993788e-07, 'completion_length': 124.01786041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.0357142873108387, 'kl': 0.11083984375, 'epoch': 1.98} 40%|███▉ | 636/1610 [8:21:28<3:59:52, 14.78s/it] 40%|███▉ | 637/1610 [8:21:41<3:50:40, 14.22s/it] {'loss': 0.0038, 'grad_norm': 1.3962627019535245, 'learning_rate': 6.043478260869564e-07, 'completion_length': 112.10714721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.11266787722706795, 'kl': 0.0947265625, 'epoch': 1.98} 40%|███▉ | 637/1610 [8:21:41<3:50:40, 14.22s/it] 40%|███▉ | 638/1610 [8:21:54<3:43:45, 13.81s/it] {'loss': 0.0026, 'grad_norm': 1.7531182291934402, 'learning_rate': 6.037267080745342e-07, 'completion_length': 100.78571701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.19514649361371994, 'kl': 0.064208984375, 'epoch': 1.98} 40%|███▉ | 638/1610 [8:21:54<3:43:45, 13.81s/it] 40%|███▉ | 639/1610 [8:22:08<3:46:51, 14.02s/it] {'loss': 0.0018, 'grad_norm': 0.9108292013653183, 'learning_rate': 6.031055900621118e-07, 'completion_length': 119.23214721679688, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.1181928962469101, 'kl': 0.04473876953125, 'epoch': 1.98} 40%|███▉ | 639/1610 [8:22:08<3:46:51, 14.02s/it] 40%|███▉ | 640/1610 [8:22:21<3:42:25, 13.76s/it] {'loss': 0.0029, 'grad_norm': 3.9407112232049597, 'learning_rate': 6.024844720496894e-07, 'completion_length': 120.33929443359375, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500305891037, 'kl': 0.0726318359375, 'epoch': 1.99} 40%|███▉ | 640/1610 [8:22:21<3:42:25, 13.76s/it] 40%|███▉ | 641/1610 [8:22:39<4:02:57, 15.04s/it] {'loss': 0.0033, 'grad_norm': 1.428024678609846, 'learning_rate': 6.018633540372671e-07, 'completion_length': 117.12500762939453, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.1428571529686451, 'kl': 0.0821533203125, 'epoch': 1.99} 40%|███▉ | 641/1610 [8:22:39<4:02:57, 15.04s/it] 40%|███▉ | 642/1610 [8:22:56<4:10:00, 15.50s/it] {'loss': 0.0018, 'grad_norm': 1.2369676670765024, 'learning_rate': 6.012422360248447e-07, 'completion_length': 106.66072082519531, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.14838216826319695, 'kl': 0.0452880859375, 'epoch': 1.99} 40%|███▉ | 642/1610 [8:22:56<4:10:00, 15.50s/it] 40%|███▉ | 643/1610 [8:23:11<4:09:58, 15.51s/it] {'loss': 0.0058, 'grad_norm': 1.8565055369216263, 'learning_rate': 6.006211180124223e-07, 'completion_length': 119.6964340209961, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.3404877483844757, 'kl': 0.1461181640625, 'epoch': 2.0} 40%|███▉ | 643/1610 [8:23:11<4:09:58, 15.51s/it] 40%|████ | 644/1610 [8:23:24<3:54:32, 14.57s/it] {'loss': 0.005, 'grad_norm': 2.2442683628222464, 'learning_rate': 6e-07, 'completion_length': 106.37500381469727, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.1259765625, 'epoch': 2.0} 40%|████ | 644/1610 [8:23:24<3:54:32, 14.57s/it] 40%|████ | 645/1610 [8:23:37<3:48:39, 14.22s/it] {'loss': 0.0023, 'grad_norm': 1.5533050379241276, 'learning_rate': 5.993788819875776e-07, 'completion_length': 111.16071701049805, 'rewards/accuracy_reward': 0.4464286044239998, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.057861328125, 'epoch': 2.0} 40%|████ | 645/1610 [8:23:37<3:48:39, 14.22s/it] 40%|████ | 646/1610 [8:23:50<3:40:40, 13.73s/it] {'loss': 0.002, 'grad_norm': 0.7984516578481915, 'learning_rate': 5.987577639751552e-07, 'completion_length': 120.30357360839844, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.05078125, 'epoch': 2.01} 40%|████ | 646/1610 [8:23:50<3:40:40, 13.73s/it] 40%|████ | 647/1610 [8:24:02<3:32:17, 13.23s/it] {'loss': 0.0026, 'grad_norm': 4.329620222569222, 'learning_rate': 5.98136645962733e-07, 'completion_length': 94.17857360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.19514648616313934, 'kl': 0.066162109375, 'epoch': 2.01} 40%|████ | 647/1610 [8:24:02<3:32:17, 13.23s/it] 40%|████ | 648/1610 [8:24:16<3:34:20, 13.37s/it] {'loss': 0.0028, 'grad_norm': 1.851575842631447, 'learning_rate': 5.975155279503106e-07, 'completion_length': 106.60714721679688, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.2253357544541359, 'kl': 0.0697021484375, 'epoch': 2.01} 40%|████ | 648/1610 [8:24:16<3:34:20, 13.37s/it] 40%|████ | 649/1610 [8:24:29<3:36:45, 13.53s/it] {'loss': 0.0022, 'grad_norm': 1.4056595283788984, 'learning_rate': 5.968944099378882e-07, 'completion_length': 111.83929061889648, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0557861328125, 'epoch': 2.02} 40%|████ | 649/1610 [8:24:29<3:36:45, 13.53s/it] 40%|████ | 650/1610 [8:24:44<3:39:28, 13.72s/it] {'loss': 0.0034, 'grad_norm': 3.138090662871653, 'learning_rate': 5.962732919254659e-07, 'completion_length': 113.41071701049805, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.23086077719926834, 'kl': 0.0863037109375, 'epoch': 2.02} 40%|████ | 650/1610 [8:24:44<3:39:28, 13.72s/it] 40%|████ | 651/1610 [8:24:56<3:33:02, 13.33s/it] {'loss': 0.002, 'grad_norm': 1.7574630849909496, 'learning_rate': 5.956521739130435e-07, 'completion_length': 102.55357360839844, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.2363857999444008, 'kl': 0.0501708984375, 'epoch': 2.02} 40%|████ | 651/1610 [8:24:56<3:33:02, 13.33s/it] 40%|████ | 652/1610 [8:25:08<3:28:35, 13.06s/it] {'loss': 0.0017, 'grad_norm': 2.949028677266978, 'learning_rate': 5.95031055900621e-07, 'completion_length': 110.48215103149414, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214671432972, 'kl': 0.04150390625, 'epoch': 2.02} 40%|████ | 652/1610 [8:25:08<3:28:35, 13.06s/it] 41%|████ | 653/1610 [8:25:27<3:52:17, 14.56s/it] {'loss': 0.0014, 'grad_norm': 1.6922247628782001, 'learning_rate': 5.944099378881987e-07, 'completion_length': 117.41072082519531, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.25248411297798157, 'kl': 0.0361328125, 'epoch': 2.03} 41%|████ | 653/1610 [8:25:27<3:52:17, 14.56s/it] 41%|████ | 654/1610 [8:25:40<3:48:47, 14.36s/it] {'loss': 0.0021, 'grad_norm': 1.8870106460881098, 'learning_rate': 5.937888198757763e-07, 'completion_length': 127.76786422729492, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.23086078464984894, 'kl': 0.0531005859375, 'epoch': 2.03} 41%|████ | 654/1610 [8:25:40<3:48:47, 14.36s/it] 41%|████ | 655/1610 [8:25:54<3:46:09, 14.21s/it] {'loss': 0.0027, 'grad_norm': 1.588721893843084, 'learning_rate': 5.931677018633539e-07, 'completion_length': 136.9464340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1896214634180069, 'kl': 0.0675048828125, 'epoch': 2.03} 41%|████ | 655/1610 [8:25:54<3:46:09, 14.21s/it] 41%|████ | 656/1610 [8:26:08<3:43:40, 14.07s/it] {'loss': 0.002, 'grad_norm': 1.9775887257822804, 'learning_rate': 5.925465838509317e-07, 'completion_length': 101.26786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.0509033203125, 'epoch': 2.04} 41%|████ | 656/1610 [8:26:08<3:43:40, 14.07s/it] 41%|████ | 657/1610 [8:26:20<3:31:37, 13.32s/it] {'loss': 0.0018, 'grad_norm': 1.8310417702701458, 'learning_rate': 5.919254658385093e-07, 'completion_length': 104.76786422729492, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.2610500380396843, 'kl': 0.0440673828125, 'epoch': 2.04} 41%|████ | 657/1610 [8:26:20<3:31:37, 13.32s/it] 41%|████ | 658/1610 [8:26:35<3:39:28, 13.83s/it] {'loss': 0.0021, 'grad_norm': 0.09415787955448365, 'learning_rate': 5.913043478260869e-07, 'completion_length': 134.4821548461914, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0, 'kl': 0.0516357421875, 'epoch': 2.04} 41%|████ | 658/1610 [8:26:35<3:39:28, 13.83s/it] 41%|████ | 659/1610 [8:26:48<3:38:54, 13.81s/it] {'loss': 0.0024, 'grad_norm': 1.296646117740129, 'learning_rate': 5.906832298136646e-07, 'completion_length': 106.00000381469727, 'rewards/accuracy_reward': 0.75, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.0714285746216774, 'kl': 0.0596923828125, 'epoch': 2.05} 41%|████ | 659/1610 [8:26:48<3:38:54, 13.81s/it] 41%|████ | 660/1610 [8:27:03<3:44:53, 14.20s/it] {'loss': 0.0017, 'grad_norm': 1.1336802954794407, 'learning_rate': 5.900621118012422e-07, 'completion_length': 140.85714721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.0418701171875, 'epoch': 2.05} 41%|████ | 660/1610 [8:27:03<3:44:53, 14.20s/it] 41%|████ | 661/1610 [8:27:15<3:32:36, 13.44s/it] {'loss': 0.0017, 'grad_norm': 1.1941785349143017, 'learning_rate': 5.894409937888198e-07, 'completion_length': 97.80357360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.14838216826319695, 'kl': 0.04345703125, 'epoch': 2.05} 41%|████ | 661/1610 [8:27:15<3:32:36, 13.44s/it] 41%|████ | 662/1610 [8:27:27<3:26:37, 13.08s/it] {'loss': 0.0027, 'grad_norm': 1.5707102970364388, 'learning_rate': 5.888198757763975e-07, 'completion_length': 99.9285774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.1539071798324585, 'kl': 0.0662841796875, 'epoch': 2.06} 41%|████ | 662/1610 [8:27:27<3:26:37, 13.08s/it] 41%|████ | 663/1610 [8:27:39<3:21:09, 12.74s/it] {'loss': 0.0023, 'grad_norm': 1.8683729460176783, 'learning_rate': 5.881987577639751e-07, 'completion_length': 101.10714721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.23086079210042953, 'kl': 0.0582275390625, 'epoch': 2.06} 41%|████ | 663/1610 [8:27:39<3:21:09, 12.74s/it] 41%|████ | 664/1610 [8:27:52<3:19:26, 12.65s/it] {'loss': 0.002, 'grad_norm': 1.8767593287016486, 'learning_rate': 5.875776397515527e-07, 'completion_length': 108.78572082519531, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.2610500305891037, 'kl': 0.0491943359375, 'epoch': 2.06} 41%|████ | 664/1610 [8:27:52<3:19:26, 12.65s/it] 41%|████▏ | 665/1610 [8:28:04<3:15:58, 12.44s/it] {'loss': 0.0018, 'grad_norm': 1.2343220269372164, 'learning_rate': 5.869565217391305e-07, 'completion_length': 98.50000762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.04123930633068085, 'kl': 0.0462646484375, 'epoch': 2.07} 41%|████▏ | 665/1610 [8:28:04<3:15:58, 12.44s/it] 41%|████▏ | 666/1610 [8:28:18<3:22:58, 12.90s/it] {'loss': 0.0031, 'grad_norm': 1.0463608421660935, 'learning_rate': 5.863354037267081e-07, 'completion_length': 121.9285774230957, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0765380859375, 'epoch': 2.07} 41%|████▏ | 666/1610 [8:28:18<3:22:58, 12.90s/it] 41%|████▏ | 667/1610 [8:28:31<3:22:21, 12.88s/it] {'loss': 0.0019, 'grad_norm': 1.1992468834303356, 'learning_rate': 5.857142857142857e-07, 'completion_length': 101.8035774230957, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.0474853515625, 'epoch': 2.07} 41%|████▏ | 667/1610 [8:28:31<3:22:21, 12.88s/it] 41%|████▏ | 668/1610 [8:28:45<3:30:58, 13.44s/it] {'loss': 0.0017, 'grad_norm': 1.5468331964080555, 'learning_rate': 5.850931677018634e-07, 'completion_length': 137.87500762939453, 'rewards/accuracy_reward': 0.482142873108387, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.29123931378126144, 'kl': 0.0435791015625, 'epoch': 2.07} 41%|████▏ | 668/1610 [8:28:45<3:30:58, 13.44s/it] 42%|████▏ | 669/1610 [8:28:58<3:28:50, 13.32s/it] {'loss': 0.0022, 'grad_norm': 4.765399827673934, 'learning_rate': 5.84472049689441e-07, 'completion_length': 99.98214721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2006715126335621, 'kl': 0.0560302734375, 'epoch': 2.08} 42%|████▏ | 669/1610 [8:28:58<3:28:50, 13.32s/it] 42%|████▏ | 670/1610 [8:29:13<3:35:04, 13.73s/it] {'loss': 0.0018, 'grad_norm': 2.0086318335896003, 'learning_rate': 5.838509316770186e-07, 'completion_length': 124.62500762939453, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.1428571492433548, 'kl': 0.044921875, 'epoch': 2.08} 42%|████▏ | 670/1610 [8:29:13<3:35:04, 13.73s/it] 42%|████▏ | 671/1610 [8:29:29<3:43:19, 14.27s/it] {'loss': 0.0051, 'grad_norm': 1.5267977766128764, 'learning_rate': 5.832298136645963e-07, 'completion_length': 116.10715103149414, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.12646484375, 'epoch': 2.08} 42%|████▏ | 671/1610 [8:29:29<3:43:19, 14.27s/it] 42%|████▏ | 672/1610 [8:29:41<3:35:57, 13.81s/it] {'loss': 0.0015, 'grad_norm': 3.457482222845197, 'learning_rate': 5.826086956521739e-07, 'completion_length': 118.26786422729492, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.1539071798324585, 'kl': 0.03759765625, 'epoch': 2.09} 42%|████▏ | 672/1610 [8:29:41<3:35:57, 13.81s/it] 42%|████▏ | 673/1610 [8:29:57<3:43:15, 14.30s/it] {'loss': 0.0054, 'grad_norm': 2.6738704513848868, 'learning_rate': 5.819875776397515e-07, 'completion_length': 149.89286041259766, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2721000760793686, 'kl': 0.13525390625, 'epoch': 2.09} 42%|████▏ | 673/1610 [8:29:57<3:43:15, 14.30s/it] 42%|████▏ | 674/1610 [8:30:13<3:53:42, 14.98s/it] {'loss': 0.0026, 'grad_norm': 1.4333671226414837, 'learning_rate': 5.813664596273293e-07, 'completion_length': 118.0714340209961, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.29123930633068085, 'kl': 0.0638427734375, 'epoch': 2.09} 42%|████▏ | 674/1610 [8:30:13<3:53:42, 14.98s/it] 42%|████▏ | 675/1610 [8:30:26<3:42:58, 14.31s/it] {'loss': 0.002, 'grad_norm': 2.0737769736591254, 'learning_rate': 5.807453416149069e-07, 'completion_length': 105.4464340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.15943220257759094, 'kl': 0.04931640625, 'epoch': 2.1} 42%|████▏ | 675/1610 [8:30:26<3:42:58, 14.31s/it] 42%|████▏ | 676/1610 [8:30:41<3:45:50, 14.51s/it] {'loss': 0.004, 'grad_norm': 1.87700054340091, 'learning_rate': 5.801242236024844e-07, 'completion_length': 133.62500762939453, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.26657506823539734, 'kl': 0.09912109375, 'epoch': 2.1} 42%|████▏ | 676/1610 [8:30:41<3:45:50, 14.51s/it] 42%|████▏ | 677/1610 [8:30:55<3:43:19, 14.36s/it] {'loss': 0.0016, 'grad_norm': 0.7345889218910313, 'learning_rate': 5.795031055900621e-07, 'completion_length': 109.3214340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0824786126613617, 'kl': 0.0400390625, 'epoch': 2.1} 42%|████▏ | 677/1610 [8:30:55<3:43:19, 14.36s/it] 42%|████▏ | 678/1610 [8:31:05<3:24:29, 13.16s/it] {'loss': 0.0015, 'grad_norm': 1.7317223986912365, 'learning_rate': 5.788819875776397e-07, 'completion_length': 88.28571701049805, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1785714402794838, 'kl': 0.03717041015625, 'epoch': 2.11} 42%|████▏ | 678/1610 [8:31:05<3:24:29, 13.16s/it] 42%|████▏ | 679/1610 [8:31:19<3:26:42, 13.32s/it] {'loss': 0.0017, 'grad_norm': 2.211305556721473, 'learning_rate': 5.782608695652173e-07, 'completion_length': 127.60715103149414, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.18409645557403564, 'kl': 0.0428466796875, 'epoch': 2.11} 42%|████▏ | 679/1610 [8:31:19<3:26:42, 13.32s/it] 42%|████▏ | 680/1610 [8:31:34<3:35:41, 13.92s/it] {'loss': 0.0024, 'grad_norm': 1.687052706790347, 'learning_rate': 5.77639751552795e-07, 'completion_length': 130.98214721679688, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.14838217198848724, 'kl': 0.059814453125, 'epoch': 2.11} 42%|████▏ | 680/1610 [8:31:34<3:35:41, 13.92s/it] 42%|████▏ | 681/1610 [8:31:47<3:28:58, 13.50s/it] {'loss': 0.0017, 'grad_norm': 1.5871590059374618, 'learning_rate': 5.770186335403726e-07, 'completion_length': 120.75000381469727, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.15943220257759094, 'kl': 0.042236328125, 'epoch': 2.11} 42%|████▏ | 681/1610 [8:31:47<3:28:58, 13.50s/it] 42%|████▏ | 682/1610 [8:31:58<3:19:40, 12.91s/it] {'loss': 0.0041, 'grad_norm': 4.6133707437254525, 'learning_rate': 5.763975155279502e-07, 'completion_length': 96.17857360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.14838215708732605, 'kl': 0.1026611328125, 'epoch': 2.12} 42%|████▏ | 682/1610 [8:31:58<3:19:40, 12.91s/it] 42%|████▏ | 683/1610 [8:32:12<3:24:19, 13.23s/it] {'loss': 0.0018, 'grad_norm': 0.7134773594667997, 'learning_rate': 5.75776397515528e-07, 'completion_length': 126.9285774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266788095235825, 'kl': 0.04443359375, 'epoch': 2.12} 42%|████▏ | 683/1610 [8:32:12<3:24:19, 13.23s/it] 42%|████▏ | 684/1610 [8:32:24<3:16:39, 12.74s/it] {'loss': 0.0019, 'grad_norm': 1.1761869205713265, 'learning_rate': 5.751552795031056e-07, 'completion_length': 97.37500381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.15943220257759094, 'kl': 0.048095703125, 'epoch': 2.12} 42%|████▏ | 684/1610 [8:32:24<3:16:39, 12.74s/it] 43%|████▎ | 685/1610 [8:32:39<3:25:19, 13.32s/it] {'loss': 0.0045, 'grad_norm': 1.8589097922265438, 'learning_rate': 5.745341614906832e-07, 'completion_length': 105.76786041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500380396843, 'kl': 0.11328125, 'epoch': 2.13} 43%|████▎ | 685/1610 [8:32:39<3:25:19, 13.32s/it] 43%|████▎ | 686/1610 [8:32:52<3:25:14, 13.33s/it] {'loss': 0.0018, 'grad_norm': 1.2734544996920936, 'learning_rate': 5.739130434782609e-07, 'completion_length': 102.73214721679688, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.07695358991622925, 'kl': 0.0455322265625, 'epoch': 2.13} 43%|████▎ | 686/1610 [8:32:52<3:25:14, 13.33s/it] 43%|████▎ | 687/1610 [8:33:04<3:18:53, 12.93s/it] {'loss': 0.0024, 'grad_norm': 3.3883929171692926, 'learning_rate': 5.732919254658385e-07, 'completion_length': 104.5714340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.33800364285707474, 'kl': 0.060546875, 'epoch': 2.13} 43%|████▎ | 687/1610 [8:33:04<3:18:53, 12.93s/it] 43%|████▎ | 688/1610 [8:33:21<3:36:30, 14.09s/it] {'loss': 0.0035, 'grad_norm': 2.4537391856034274, 'learning_rate': 5.726708074534161e-07, 'completion_length': 133.21429061889648, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3750000596046448, 'reward_std': 0.31937122344970703, 'kl': 0.087646484375, 'epoch': 2.14} 43%|████▎ | 688/1610 [8:33:21<3:36:30, 14.09s/it] 43%|████▎ | 689/1610 [8:33:38<3:52:42, 15.16s/it] {'loss': 0.0048, 'grad_norm': 1.5199290686595859, 'learning_rate': 5.720496894409938e-07, 'completion_length': 160.14286041259766, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.2610500454902649, 'kl': 0.120849609375, 'epoch': 2.14} 43%|████▎ | 689/1610 [8:33:38<3:52:42, 15.16s/it] 43%|████▎ | 690/1610 [8:33:53<3:51:38, 15.11s/it] {'loss': 0.0034, 'grad_norm': 2.7699049048318543, 'learning_rate': 5.714285714285714e-07, 'completion_length': 117.53571701049805, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1428571492433548, 'kl': 0.0855712890625, 'epoch': 2.14} 43%|████▎ | 690/1610 [8:33:53<3:51:38, 15.11s/it] 43%|████▎ | 691/1610 [8:34:08<3:48:27, 14.92s/it] {'loss': 0.0034, 'grad_norm': 1.6952077733711461, 'learning_rate': 5.70807453416149e-07, 'completion_length': 115.64286041259766, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.1539071872830391, 'kl': 0.0863037109375, 'epoch': 2.15} 43%|████▎ | 691/1610 [8:34:08<3:48:27, 14.92s/it] 43%|████▎ | 692/1610 [8:34:20<3:34:27, 14.02s/it] {'loss': 0.0015, 'grad_norm': 0.81304044570947, 'learning_rate': 5.701863354037268e-07, 'completion_length': 103.01786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.036376953125, 'epoch': 2.15} 43%|████▎ | 692/1610 [8:34:20<3:34:27, 14.02s/it] 43%|████▎ | 693/1610 [8:34:37<3:49:19, 15.00s/it] {'loss': 0.0063, 'grad_norm': 2.0465787203462393, 'learning_rate': 5.695652173913044e-07, 'completion_length': 133.5357208251953, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.158203125, 'epoch': 2.15} 43%|████▎ | 693/1610 [8:34:37<3:49:19, 15.00s/it] 43%|████▎ | 694/1610 [8:34:50<3:39:04, 14.35s/it] {'loss': 0.0019, 'grad_norm': 2.2258659798209286, 'learning_rate': 5.68944099378882e-07, 'completion_length': 119.17858123779297, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.19514648616313934, 'kl': 0.0474853515625, 'epoch': 2.16} 43%|████▎ | 694/1610 [8:34:50<3:39:04, 14.35s/it] 43%|████▎ | 695/1610 [8:35:06<3:45:48, 14.81s/it] {'loss': 0.0023, 'grad_norm': 1.3998363762519832, 'learning_rate': 5.683229813664597e-07, 'completion_length': 116.46429061889648, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.0714285746216774, 'kl': 0.05712890625, 'epoch': 2.16} 43%|████▎ | 695/1610 [8:35:06<3:45:48, 14.81s/it] 43%|████▎ | 696/1610 [8:35:20<3:40:57, 14.51s/it] {'loss': 0.0027, 'grad_norm': 1.070573107133565, 'learning_rate': 5.677018633540373e-07, 'completion_length': 137.0178680419922, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1181928999722004, 'kl': 0.0682373046875, 'epoch': 2.16} 43%|████▎ | 696/1610 [8:35:20<3:40:57, 14.51s/it] 43%|████▎ | 697/1610 [8:35:35<3:43:21, 14.68s/it] {'loss': 0.0021, 'grad_norm': 2.5244304346658195, 'learning_rate': 5.670807453416149e-07, 'completion_length': 118.35714721679688, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2721000984311104, 'kl': 0.0517578125, 'epoch': 2.16} 43%|████▎ | 697/1610 [8:35:35<3:43:21, 14.68s/it] 43%|████▎ | 698/1610 [8:35:47<3:29:56, 13.81s/it] {'loss': 0.0027, 'grad_norm': 1.6685727137605173, 'learning_rate': 5.664596273291926e-07, 'completion_length': 126.57143020629883, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.19514648616313934, 'kl': 0.068359375, 'epoch': 2.17} 43%|████▎ | 698/1610 [8:35:47<3:29:56, 13.81s/it] 43%|████▎ | 699/1610 [8:35:59<3:25:03, 13.51s/it] {'loss': 0.0021, 'grad_norm': 4.562453178874777, 'learning_rate': 5.658385093167701e-07, 'completion_length': 102.17857360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.0533447265625, 'epoch': 2.17} 43%|████▎ | 699/1610 [8:35:59<3:25:03, 13.51s/it] 43%|████▎ | 700/1610 [8:36:11<3:17:35, 13.03s/it] {'loss': 0.0027, 'grad_norm': 1.6979230739944622, 'learning_rate': 5.652173913043477e-07, 'completion_length': 114.4285774230957, 'rewards/accuracy_reward': 0.6428571790456772, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.26657506823539734, 'kl': 0.067626953125, 'epoch': 2.17} 43%|████▎ | 700/1610 [8:36:11<3:17:35, 13.03s/it] 44%|████▎ | 701/1610 [8:39:16<16:18:07, 64.56s/it] {'loss': 0.0036, 'grad_norm': 2.136764532133357, 'learning_rate': 5.645962732919255e-07, 'completion_length': 110.87500762939453, 'rewards/accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2610500454902649, 'kl': 0.08935546875, 'epoch': 2.18} 44%|████▎ | 701/1610 [8:39:16<16:18:07, 64.56s/it] 44%|████▎ | 702/1610 [8:39:33<12:40:58, 50.28s/it] {'loss': 0.0034, 'grad_norm': 1.353729431346559, 'learning_rate': 5.639751552795031e-07, 'completion_length': 160.23214721679688, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.19514648616313934, 'kl': 0.0863037109375, 'epoch': 2.18} 44%|████▎ | 702/1610 [8:39:33<12:40:58, 50.28s/it] 44%|████▎ | 703/1610 [8:39:49<10:02:50, 39.88s/it] {'loss': 0.0024, 'grad_norm': 2.0130853437882554, 'learning_rate': 5.633540372670807e-07, 'completion_length': 124.35714721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071835577488, 'kl': 0.0589599609375, 'epoch': 2.18} 44%|████▎ | 703/1610 [8:39:49<10:02:50, 39.88s/it] 44%|████▎ | 704/1610 [8:40:02<8:03:34, 32.03s/it] {'loss': 0.002, 'grad_norm': 0.9333821243698652, 'learning_rate': 5.627329192546583e-07, 'completion_length': 122.50000762939453, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1539071872830391, 'kl': 0.0506591796875, 'epoch': 2.19} 44%|████▎ | 704/1610 [8:40:02<8:03:34, 32.03s/it] 44%|████▍ | 705/1610 [8:40:17<6:43:48, 26.77s/it] {'loss': 0.0021, 'grad_norm': 3.646185925689395, 'learning_rate': 5.62111801242236e-07, 'completion_length': 115.62500381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1785714365541935, 'kl': 0.0535888671875, 'epoch': 2.19} 44%|████▍ | 705/1610 [8:40:17<6:43:48, 26.77s/it] 44%|████▍ | 706/1610 [8:40:33<5:56:31, 23.66s/it] {'loss': 0.002, 'grad_norm': 1.301050178448044, 'learning_rate': 5.614906832298136e-07, 'completion_length': 122.14286422729492, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3392857313156128, 'reward_std': 0.15943220257759094, 'kl': 0.049072265625, 'epoch': 2.19} 44%|████▍ | 706/1610 [8:40:33<5:56:31, 23.66s/it] 44%|████▍ | 707/1610 [8:40:47<5:11:17, 20.68s/it] {'loss': 0.0014, 'grad_norm': 1.4031703455741462, 'learning_rate': 5.608695652173912e-07, 'completion_length': 115.4285774230957, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1896214708685875, 'kl': 0.0345458984375, 'epoch': 2.2} 44%|████▍ | 707/1610 [8:40:47<5:11:17, 20.68s/it] 44%|████▍ | 708/1610 [8:41:02<4:43:42, 18.87s/it] {'loss': 0.002, 'grad_norm': 1.4305746595371052, 'learning_rate': 5.602484472049689e-07, 'completion_length': 124.98215103149414, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.0499267578125, 'epoch': 2.2} 44%|████▍ | 708/1610 [8:41:02<4:43:42, 18.87s/it] 44%|████▍ | 709/1610 [8:41:16<4:21:25, 17.41s/it] {'loss': 0.002, 'grad_norm': 1.5190940596772708, 'learning_rate': 5.596273291925465e-07, 'completion_length': 122.0535774230957, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.0504150390625, 'epoch': 2.2} 44%|████▍ | 709/1610 [8:41:16<4:21:25, 17.41s/it] 44%|████▍ | 710/1610 [8:41:27<3:54:53, 15.66s/it] {'loss': 0.0027, 'grad_norm': 1.0078086389976795, 'learning_rate': 5.590062111801241e-07, 'completion_length': 104.41071701049805, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.06689453125, 'epoch': 2.2} 44%|████▍ | 710/1610 [8:41:27<3:54:53, 15.66s/it] 44%|████▍ | 711/1610 [8:41:40<3:42:54, 14.88s/it] {'loss': 0.0024, 'grad_norm': 1.8756292475676968, 'learning_rate': 5.583850931677019e-07, 'completion_length': 127.21429061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.2721000984311104, 'kl': 0.0604248046875, 'epoch': 2.21} 44%|████▍ | 711/1610 [8:41:40<3:42:54, 14.88s/it] 44%|████▍ | 712/1610 [8:41:53<3:33:47, 14.29s/it] {'loss': 0.0017, 'grad_norm': 0.7046297864956126, 'learning_rate': 5.577639751552795e-07, 'completion_length': 114.87500381469727, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.041748046875, 'epoch': 2.21} 44%|████▍ | 712/1610 [8:41:53<3:33:47, 14.29s/it] 44%|████▍ | 713/1610 [8:42:09<3:39:27, 14.68s/it] {'loss': 0.0026, 'grad_norm': 2.1769205188857614, 'learning_rate': 5.571428571428571e-07, 'completion_length': 120.3035774230957, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1785714402794838, 'kl': 0.06591796875, 'epoch': 2.21} 44%|████▍ | 713/1610 [8:42:09<3:39:27, 14.68s/it] 44%|████▍ | 714/1610 [8:42:21<3:26:45, 13.85s/it] {'loss': 0.0021, 'grad_norm': 2.453283682852455, 'learning_rate': 5.565217391304348e-07, 'completion_length': 90.64286041259766, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.2967643290758133, 'kl': 0.053466796875, 'epoch': 2.22} 44%|████▍ | 714/1610 [8:42:21<3:26:45, 13.85s/it] 44%|████▍ | 715/1610 [8:42:34<3:24:14, 13.69s/it] {'loss': 0.0039, 'grad_norm': 1.3034264075853297, 'learning_rate': 5.559006211180124e-07, 'completion_length': 151.0714340209961, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.098388671875, 'epoch': 2.22} 44%|████▍ | 715/1610 [8:42:34<3:24:14, 13.69s/it] 44%|████▍ | 716/1610 [8:42:51<3:38:22, 14.66s/it] {'loss': 0.0035, 'grad_norm': 2.2018044874979976, 'learning_rate': 5.5527950310559e-07, 'completion_length': 152.375, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.3138462007045746, 'kl': 0.087158203125, 'epoch': 2.22} 44%|████▍ | 716/1610 [8:42:51<3:38:22, 14.66s/it] 45%|████▍ | 717/1610 [8:43:07<3:42:42, 14.96s/it] {'loss': 0.002, 'grad_norm': 1.3750939838100933, 'learning_rate': 5.546583850931677e-07, 'completion_length': 142.8035774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.04931640625, 'epoch': 2.23} 45%|████▍ | 717/1610 [8:43:07<3:42:42, 14.96s/it] 45%|████▍ | 718/1610 [8:43:22<3:44:45, 15.12s/it] {'loss': 0.0028, 'grad_norm': 1.3679889846381312, 'learning_rate': 5.540372670807453e-07, 'completion_length': 135.96429061889648, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.26657505333423615, 'kl': 0.0693359375, 'epoch': 2.23} 45%|████▍ | 718/1610 [8:43:22<3:44:45, 15.12s/it] 45%|████▍ | 719/1610 [8:43:37<3:44:39, 15.13s/it] {'loss': 0.0024, 'grad_norm': 0.9936690075191721, 'learning_rate': 5.534161490683229e-07, 'completion_length': 127.69643783569336, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1181928962469101, 'kl': 0.0596923828125, 'epoch': 2.23} 45%|████▍ | 719/1610 [8:43:37<3:44:39, 15.13s/it] 45%|████▍ | 720/1610 [8:43:50<3:34:26, 14.46s/it] {'loss': 0.0018, 'grad_norm': 1.7383861958234859, 'learning_rate': 5.527950310559007e-07, 'completion_length': 110.91072082519531, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.18409645557403564, 'kl': 0.045654296875, 'epoch': 2.24} 45%|████▍ | 720/1610 [8:43:50<3:34:26, 14.46s/it] 45%|████▍ | 721/1610 [8:44:01<3:20:02, 13.50s/it] {'loss': 0.0017, 'grad_norm': 3.084508830966026, 'learning_rate': 5.521739130434783e-07, 'completion_length': 107.9285774230957, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.1896214708685875, 'kl': 0.042236328125, 'epoch': 2.24} 45%|████▍ | 721/1610 [8:44:01<3:20:02, 13.50s/it] 45%|████▍ | 722/1610 [8:44:17<3:30:47, 14.24s/it] {'loss': 0.0022, 'grad_norm': 0.8651553271681155, 'learning_rate': 5.515527950310559e-07, 'completion_length': 169.42858123779297, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.0548095703125, 'epoch': 2.24} 45%|████▍ | 722/1610 [8:44:17<3:30:47, 14.24s/it] 45%|████▍ | 723/1610 [8:44:31<3:27:47, 14.06s/it] {'loss': 0.0023, 'grad_norm': 1.9405869311873964, 'learning_rate': 5.509316770186335e-07, 'completion_length': 108.96428680419922, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.0562744140625, 'epoch': 2.25} 45%|████▍ | 723/1610 [8:44:31<3:27:47, 14.06s/it] 45%|████▍ | 724/1610 [8:44:44<3:21:39, 13.66s/it] {'loss': 0.0018, 'grad_norm': 0.5089754329915841, 'learning_rate': 5.503105590062111e-07, 'completion_length': 96.0714340209961, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.0438232421875, 'epoch': 2.25} 45%|████▍ | 724/1610 [8:44:44<3:21:39, 13.66s/it] 45%|████▌ | 725/1610 [8:45:02<3:40:54, 14.98s/it] {'loss': 0.0027, 'grad_norm': 1.273556698148802, 'learning_rate': 5.496894409937887e-07, 'completion_length': 140.1607208251953, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.2253357470035553, 'kl': 0.0673828125, 'epoch': 2.25} 45%|████▌ | 725/1610 [8:45:02<3:40:54, 14.98s/it] 45%|████▌ | 726/1610 [8:45:12<3:19:57, 13.57s/it] {'loss': 0.0025, 'grad_norm': 1.6666729670163623, 'learning_rate': 5.490683229813664e-07, 'completion_length': 82.60714340209961, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1428571529686451, 'kl': 0.0633544921875, 'epoch': 2.25} 45%|████▌ | 726/1610 [8:45:12<3:19:57, 13.57s/it] 45%|████▌ | 727/1610 [8:45:28<3:31:02, 14.34s/it] {'loss': 0.002, 'grad_norm': 0.9858325920000791, 'learning_rate': 5.48447204968944e-07, 'completion_length': 134.01786041259766, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.19514649361371994, 'kl': 0.049072265625, 'epoch': 2.26} 45%|████▌ | 727/1610 [8:45:28<3:31:02, 14.34s/it] 45%|████▌ | 728/1610 [8:45:41<3:23:39, 13.85s/it] {'loss': 0.002, 'grad_norm': 0.5660856703799046, 'learning_rate': 5.478260869565216e-07, 'completion_length': 121.03572082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0498046875, 'epoch': 2.26} 45%|████▌ | 728/1610 [8:45:41<3:23:39, 13.85s/it] 45%|████▌ | 729/1610 [8:45:56<3:30:35, 14.34s/it] {'loss': 0.0018, 'grad_norm': 0.8542706419746796, 'learning_rate': 5.472049689440994e-07, 'completion_length': 119.4285774230957, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.11266788095235825, 'kl': 0.0445556640625, 'epoch': 2.26} 45%|████▌ | 729/1610 [8:45:56<3:30:35, 14.34s/it] 45%|████▌ | 730/1610 [8:46:11<3:33:18, 14.54s/it] {'loss': 0.002, 'grad_norm': 1.657616141558973, 'learning_rate': 5.46583850931677e-07, 'completion_length': 139.33929443359375, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.23086076974868774, 'kl': 0.05078125, 'epoch': 2.27} 45%|████▌ | 730/1610 [8:46:11<3:33:18, 14.54s/it] 45%|████▌ | 731/1610 [8:46:25<3:28:27, 14.23s/it] {'loss': 0.0018, 'grad_norm': 2.2600023025278135, 'learning_rate': 5.459627329192546e-07, 'completion_length': 105.60714721679688, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.21981074661016464, 'kl': 0.04498291015625, 'epoch': 2.27} 45%|████▌ | 731/1610 [8:46:25<3:28:27, 14.23s/it] 45%|████▌ | 732/1610 [8:46:38<3:25:02, 14.01s/it] {'loss': 0.0023, 'grad_norm': 1.8235378014787915, 'learning_rate': 5.453416149068323e-07, 'completion_length': 130.21429061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.24191083014011383, 'kl': 0.0584716796875, 'epoch': 2.27} 45%|████▌ | 732/1610 [8:46:38<3:25:02, 14.01s/it] 46%|████▌ | 733/1610 [8:46:52<3:22:19, 13.84s/it] {'loss': 0.0021, 'grad_norm': 1.0520340613443977, 'learning_rate': 5.447204968944099e-07, 'completion_length': 102.41071701049805, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1181928999722004, 'kl': 0.0535888671875, 'epoch': 2.28} 46%|████▌ | 733/1610 [8:46:52<3:22:19, 13.84s/it] 46%|████▌ | 734/1610 [8:47:05<3:17:24, 13.52s/it] {'loss': 0.0019, 'grad_norm': 1.865092705578448, 'learning_rate': 5.440993788819875e-07, 'completion_length': 99.6964340209961, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.14838216453790665, 'kl': 0.0472412109375, 'epoch': 2.28} 46%|████▌ | 734/1610 [8:47:05<3:17:24, 13.52s/it] 46%|████▌ | 735/1610 [8:47:18<3:17:41, 13.56s/it] {'loss': 0.0016, 'grad_norm': 1.5881065297585184, 'learning_rate': 5.434782608695652e-07, 'completion_length': 121.46429061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.11266788095235825, 'kl': 0.0406494140625, 'epoch': 2.28} 46%|████▌ | 735/1610 [8:47:18<3:17:41, 13.56s/it] 46%|████▌ | 736/1610 [8:47:33<3:21:14, 13.82s/it] {'loss': 0.0016, 'grad_norm': 1.3733705546617285, 'learning_rate': 5.428571428571428e-07, 'completion_length': 147.1607208251953, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.0394287109375, 'epoch': 2.29} 46%|████▌ | 736/1610 [8:47:33<3:21:14, 13.82s/it] 46%|████▌ | 737/1610 [8:47:46<3:19:42, 13.73s/it] {'loss': 0.0023, 'grad_norm': 0.7561051594432452, 'learning_rate': 5.422360248447204e-07, 'completion_length': 101.5714340209961, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.07695359364151955, 'kl': 0.0582275390625, 'epoch': 2.29} 46%|████▌ | 737/1610 [8:47:46<3:19:42, 13.73s/it] 46%|████▌ | 738/1610 [8:47:59<3:13:34, 13.32s/it] {'loss': 0.0023, 'grad_norm': 0.6477771299943758, 'learning_rate': 5.416149068322982e-07, 'completion_length': 106.0714340209961, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.057861328125, 'epoch': 2.29} 46%|████▌ | 738/1610 [8:47:59<3:13:34, 13.32s/it] 46%|████▌ | 739/1610 [8:48:13<3:19:37, 13.75s/it] {'loss': 0.0026, 'grad_norm': 2.6963518856882684, 'learning_rate': 5.409937888198758e-07, 'completion_length': 120.58929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.1896214634180069, 'kl': 0.06591796875, 'epoch': 2.3} 46%|████▌ | 739/1610 [8:48:13<3:19:37, 13.75s/it] 46%|████▌ | 740/1610 [8:48:29<3:28:39, 14.39s/it] {'loss': 0.0018, 'grad_norm': 2.432577420764626, 'learning_rate': 5.403726708074534e-07, 'completion_length': 108.73215103149414, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4107143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.0452880859375, 'epoch': 2.3} 46%|████▌ | 740/1610 [8:48:29<3:28:39, 14.39s/it] 46%|████▌ | 741/1610 [8:48:43<3:27:33, 14.33s/it] {'loss': 0.0016, 'grad_norm': 1.133200381005401, 'learning_rate': 5.397515527950311e-07, 'completion_length': 131.37500762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.0401611328125, 'epoch': 2.3} 46%|████▌ | 741/1610 [8:48:43<3:27:33, 14.33s/it] 46%|████▌ | 742/1610 [8:48:57<3:22:52, 14.02s/it] {'loss': 0.0014, 'grad_norm': 2.6156491476218813, 'learning_rate': 5.391304347826087e-07, 'completion_length': 129.46428680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1181928999722004, 'kl': 0.0360107421875, 'epoch': 2.3} 46%|████▌ | 742/1610 [8:48:57<3:22:52, 14.02s/it] 46%|████▌ | 743/1610 [8:49:12<3:29:27, 14.50s/it] {'loss': 0.0019, 'grad_norm': 0.9571760245550255, 'learning_rate': 5.385093167701863e-07, 'completion_length': 123.82143020629883, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1071428656578064, 'kl': 0.047119140625, 'epoch': 2.31} 46%|████▌ | 743/1610 [8:49:12<3:29:27, 14.50s/it] 46%|████▌ | 744/1610 [8:49:26<3:24:38, 14.18s/it] {'loss': 0.0019, 'grad_norm': 1.177921869043649, 'learning_rate': 5.37888198757764e-07, 'completion_length': 130.64286422729492, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1539071872830391, 'kl': 0.0469970703125, 'epoch': 2.31} 46%|████▌ | 744/1610 [8:49:26<3:24:38, 14.18s/it] 46%|████▋ | 745/1610 [8:49:42<3:33:24, 14.80s/it] {'loss': 0.0024, 'grad_norm': 1.5149121094117628, 'learning_rate': 5.372670807453416e-07, 'completion_length': 140.62500762939453, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.19514648616313934, 'kl': 0.0587158203125, 'epoch': 2.31} 46%|████▋ | 745/1610 [8:49:42<3:33:24, 14.80s/it] 46%|████▋ | 746/1610 [8:49:56<3:29:57, 14.58s/it] {'loss': 0.0014, 'grad_norm': 1.3178679271554814, 'learning_rate': 5.366459627329191e-07, 'completion_length': 112.30357360839844, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.1071428619325161, 'kl': 0.033935546875, 'epoch': 2.32} 46%|████▋ | 746/1610 [8:49:56<3:29:57, 14.58s/it] 46%|████▋ | 747/1610 [8:50:09<3:21:19, 14.00s/it] {'loss': 0.0017, 'grad_norm': 1.7133160304963237, 'learning_rate': 5.360248447204969e-07, 'completion_length': 109.14286422729492, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1896214634180069, 'kl': 0.0413818359375, 'epoch': 2.32} 46%|████▋ | 747/1610 [8:50:09<3:21:19, 14.00s/it] 46%|████▋ | 748/1610 [8:50:21<3:11:49, 13.35s/it] {'loss': 0.0017, 'grad_norm': 3.4553117106545312, 'learning_rate': 5.354037267080745e-07, 'completion_length': 90.3214340209961, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.15943220257759094, 'kl': 0.0421142578125, 'epoch': 2.32} 46%|████▋ | 748/1610 [8:50:21<3:11:49, 13.35s/it] 47%|████▋ | 749/1610 [8:50:39<3:33:22, 14.87s/it] {'loss': 0.002, 'grad_norm': 1.6390705646372363, 'learning_rate': 5.347826086956521e-07, 'completion_length': 157.83929443359375, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5357143878936768, 'reward_std': 0.2580091431736946, 'kl': 0.0498046875, 'epoch': 2.33} 47%|████▋ | 749/1610 [8:50:39<3:33:22, 14.87s/it] 47%|████▋ | 750/1610 [8:50:54<3:36:10, 15.08s/it] {'loss': 0.0017, 'grad_norm': 0.6233711551573585, 'learning_rate': 5.341614906832298e-07, 'completion_length': 159.3214340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0714285746216774, 'kl': 0.04296875, 'epoch': 2.33} 47%|████▋ | 750/1610 [8:50:54<3:36:10, 15.08s/it] 47%|████▋ | 751/1610 [8:51:08<3:28:37, 14.57s/it] {'loss': 0.0018, 'grad_norm': 1.6275881602215747, 'learning_rate': 5.335403726708074e-07, 'completion_length': 98.21429061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.2610500305891037, 'kl': 0.0458984375, 'epoch': 2.33} 47%|████▋ | 751/1610 [8:51:08<3:28:37, 14.57s/it] 47%|████▋ | 752/1610 [8:51:24<3:36:47, 15.16s/it] {'loss': 0.0016, 'grad_norm': 0.900840007749196, 'learning_rate': 5.32919254658385e-07, 'completion_length': 119.64286041259766, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.1539071872830391, 'kl': 0.039306640625, 'epoch': 2.34} 47%|████▋ | 752/1610 [8:51:24<3:36:47, 15.16s/it] 47%|████▋ | 753/1610 [8:51:37<3:27:06, 14.50s/it] {'loss': 0.0021, 'grad_norm': 0.7727107249951113, 'learning_rate': 5.322981366459627e-07, 'completion_length': 114.98215103149414, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0714285746216774, 'kl': 0.0526123046875, 'epoch': 2.34} 47%|████▋ | 753/1610 [8:51:37<3:27:06, 14.50s/it] 47%|████▋ | 754/1610 [8:51:51<3:24:54, 14.36s/it] {'loss': 0.0017, 'grad_norm': 1.3677286789597072, 'learning_rate': 5.316770186335403e-07, 'completion_length': 119.01786041259766, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071798324585, 'kl': 0.0433349609375, 'epoch': 2.34} 47%|████▋ | 754/1610 [8:51:51<3:24:54, 14.36s/it] 47%|████▋ | 755/1610 [8:52:09<3:36:54, 15.22s/it] {'loss': 0.0018, 'grad_norm': 0.09956169677028275, 'learning_rate': 5.310559006211179e-07, 'completion_length': 129.64286422729492, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.046142578125, 'epoch': 2.34} 47%|████▋ | 755/1610 [8:52:09<3:36:54, 15.22s/it] 47%|████▋ | 756/1610 [8:52:23<3:32:26, 14.93s/it] {'loss': 0.0023, 'grad_norm': 1.5129050361177978, 'learning_rate': 5.304347826086957e-07, 'completion_length': 135.33928680419922, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.26657506078481674, 'kl': 0.057373046875, 'epoch': 2.35} 47%|████▋ | 756/1610 [8:52:23<3:32:26, 14.93s/it] 47%|████▋ | 757/1610 [8:52:38<3:32:59, 14.98s/it] {'loss': 0.0018, 'grad_norm': 3.7664342887953417, 'learning_rate': 5.298136645962733e-07, 'completion_length': 122.41072082519531, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.1539071835577488, 'kl': 0.0438232421875, 'epoch': 2.35} 47%|████▋ | 757/1610 [8:52:38<3:32:59, 14.98s/it] 47%|████▋ | 758/1610 [8:52:54<3:36:06, 15.22s/it] {'loss': 0.0026, 'grad_norm': 1.5851094826205208, 'learning_rate': 5.291925465838509e-07, 'completion_length': 144.46429443359375, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.33800362050533295, 'kl': 0.0657958984375, 'epoch': 2.35} 47%|████▋ | 758/1610 [8:52:54<3:36:06, 15.22s/it] 47%|████▋ | 759/1610 [8:53:08<3:32:00, 14.95s/it] {'loss': 0.002, 'grad_norm': 1.6182115023417523, 'learning_rate': 5.285714285714286e-07, 'completion_length': 132.08929061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1785714402794838, 'kl': 0.051025390625, 'epoch': 2.36} 47%|████▋ | 759/1610 [8:53:08<3:32:00, 14.95s/it] 47%|████▋ | 760/1610 [8:53:21<3:22:45, 14.31s/it] {'loss': 0.0018, 'grad_norm': 1.06033093828921, 'learning_rate': 5.279503105590062e-07, 'completion_length': 113.50000381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.044921875, 'epoch': 2.36} 47%|████▋ | 760/1610 [8:53:21<3:22:45, 14.31s/it] 47%|████▋ | 761/1610 [8:53:36<3:26:27, 14.59s/it] {'loss': 0.0024, 'grad_norm': 1.2143139063033908, 'learning_rate': 5.273291925465838e-07, 'completion_length': 140.55357360839844, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.2610500305891037, 'kl': 0.060302734375, 'epoch': 2.36} 47%|████▋ | 761/1610 [8:53:36<3:26:27, 14.59s/it] 47%|████▋ | 762/1610 [8:53:51<3:28:31, 14.75s/it] {'loss': 0.0039, 'grad_norm': 1.0707004846156205, 'learning_rate': 5.267080745341615e-07, 'completion_length': 121.1785774230957, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.15943220257759094, 'kl': 0.09716796875, 'epoch': 2.37} 47%|████▋ | 762/1610 [8:53:51<3:28:31, 14.75s/it] 47%|████▋ | 763/1610 [8:54:09<3:39:12, 15.53s/it] {'loss': 0.0025, 'grad_norm': 2.4071825527763555, 'learning_rate': 5.260869565217391e-07, 'completion_length': 151.7678680419922, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.2781319245696068, 'kl': 0.0614013671875, 'epoch': 2.37} 47%|████▋ | 763/1610 [8:54:09<3:39:12, 15.53s/it] 47%|████▋ | 764/1610 [8:54:24<3:37:41, 15.44s/it] {'loss': 0.0018, 'grad_norm': 0.6255812576730648, 'learning_rate': 5.254658385093167e-07, 'completion_length': 127.10714721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.0357142873108387, 'kl': 0.0447998046875, 'epoch': 2.37} 47%|████▋ | 764/1610 [8:54:24<3:37:41, 15.44s/it] 48%|████▊ | 765/1610 [8:54:40<3:40:18, 15.64s/it] {'loss': 0.0028, 'grad_norm': 1.561564100542864, 'learning_rate': 5.248447204968945e-07, 'completion_length': 138.42857360839844, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.0692138671875, 'epoch': 2.38} 48%|████▊ | 765/1610 [8:54:40<3:40:18, 15.64s/it] 48%|████▊ | 766/1610 [8:54:57<3:46:20, 16.09s/it] {'loss': 0.004, 'grad_norm': 1.302264034951205, 'learning_rate': 5.242236024844721e-07, 'completion_length': 124.00000381469727, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.17098906636238098, 'kl': 0.101318359375, 'epoch': 2.38} 48%|████▊ | 766/1610 [8:54:57<3:46:20, 16.09s/it] 48%|████▊ | 767/1610 [8:55:14<3:49:21, 16.32s/it] {'loss': 0.0035, 'grad_norm': 1.0483403627068268, 'learning_rate': 5.236024844720497e-07, 'completion_length': 136.37500381469727, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250001192092896, 'reward_std': 0.2006715089082718, 'kl': 0.086669921875, 'epoch': 2.38} 48%|████▊ | 767/1610 [8:55:14<3:49:21, 16.32s/it] 48%|████▊ | 768/1610 [8:55:29<3:43:55, 15.96s/it] {'loss': 0.0032, 'grad_norm': 1.4546115236151236, 'learning_rate': 5.229813664596274e-07, 'completion_length': 127.60715103149414, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1785714402794838, 'kl': 0.0810546875, 'epoch': 2.39} 48%|████▊ | 768/1610 [8:55:29<3:43:55, 15.96s/it] 48%|████▊ | 769/1610 [8:55:43<3:36:38, 15.46s/it] {'loss': 0.0027, 'grad_norm': 0.9407920326687376, 'learning_rate': 5.22360248447205e-07, 'completion_length': 156.94644165039062, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 1.0, 'reward': 1.3928571939468384, 'reward_std': 0.18409644439816475, 'kl': 0.066650390625, 'epoch': 2.39} 48%|████▊ | 769/1610 [8:55:43<3:36:38, 15.46s/it] 48%|████▊ | 770/1610 [8:55:55<3:20:36, 14.33s/it] {'loss': 0.0024, 'grad_norm': 4.532839423879304, 'learning_rate': 5.217391304347825e-07, 'completion_length': 92.98214721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.2142857313156128, 'kl': 0.0604248046875, 'epoch': 2.39} 48%|████▊ | 770/1610 [8:55:55<3:20:36, 14.33s/it] 48%|████▊ | 771/1610 [8:56:11<3:26:02, 14.74s/it] {'loss': 0.0021, 'grad_norm': 2.5698808918912137, 'learning_rate': 5.211180124223602e-07, 'completion_length': 133.01786041259766, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.23086078464984894, 'kl': 0.0526123046875, 'epoch': 2.39} 48%|████▊ | 771/1610 [8:56:11<3:26:02, 14.74s/it] 48%|████▊ | 772/1610 [8:56:24<3:19:03, 14.25s/it] {'loss': 0.0019, 'grad_norm': 1.160770529639758, 'learning_rate': 5.204968944099378e-07, 'completion_length': 114.58929061889648, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0472412109375, 'epoch': 2.4} 48%|████▊ | 772/1610 [8:56:24<3:19:03, 14.25s/it] 48%|████▊ | 773/1610 [8:56:38<3:17:23, 14.15s/it] {'loss': 0.0018, 'grad_norm': 9.436192975663692, 'learning_rate': 5.198757763975154e-07, 'completion_length': 111.33929061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.21981073170900345, 'kl': 0.0439453125, 'epoch': 2.4} 48%|████▊ | 773/1610 [8:56:38<3:17:23, 14.15s/it] 48%|████▊ | 774/1610 [8:56:50<3:09:09, 13.58s/it] {'loss': 0.0017, 'grad_norm': 1.5329287854259128, 'learning_rate': 5.192546583850932e-07, 'completion_length': 110.9464340209961, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.19514648616313934, 'kl': 0.0430908203125, 'epoch': 2.4} 48%|████▊ | 774/1610 [8:56:50<3:09:09, 13.58s/it] 48%|████▊ | 775/1610 [8:57:04<3:11:21, 13.75s/it] {'loss': 0.0027, 'grad_norm': 1.6681686087631473, 'learning_rate': 5.186335403726708e-07, 'completion_length': 124.60715103149414, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.21981073915958405, 'kl': 0.0673828125, 'epoch': 2.41} 48%|████▊ | 775/1610 [8:57:04<3:11:21, 13.75s/it] 48%|████▊ | 776/1610 [8:57:19<3:16:33, 14.14s/it] {'loss': 0.0023, 'grad_norm': 0.9557460978350135, 'learning_rate': 5.180124223602484e-07, 'completion_length': 125.16072463989258, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.18409645557403564, 'kl': 0.056884765625, 'epoch': 2.41} 48%|████▊ | 776/1610 [8:57:19<3:16:33, 14.14s/it] 48%|████▊ | 777/1610 [8:57:35<3:21:43, 14.53s/it] {'loss': 0.0045, 'grad_norm': 0.8797102204490028, 'learning_rate': 5.173913043478261e-07, 'completion_length': 120.03571701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.09403547644615173, 'kl': 0.111572265625, 'epoch': 2.41} 48%|████▊ | 777/1610 [8:57:35<3:21:43, 14.53s/it] 48%|████▊ | 778/1610 [8:57:49<3:21:31, 14.53s/it] {'loss': 0.0028, 'grad_norm': 1.5321491935805254, 'learning_rate': 5.167701863354037e-07, 'completion_length': 127.3035774230957, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216826319695, 'kl': 0.0709228515625, 'epoch': 2.42} 48%|████▊ | 778/1610 [8:57:49<3:21:31, 14.53s/it] 48%|████▊ | 779/1610 [8:58:05<3:26:26, 14.91s/it] {'loss': 0.0021, 'grad_norm': 2.920039753124924, 'learning_rate': 5.161490683229813e-07, 'completion_length': 107.60714721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.05322265625, 'epoch': 2.42} 48%|████▊ | 779/1610 [8:58:05<3:26:26, 14.91s/it] 48%|████▊ | 780/1610 [8:58:21<3:31:57, 15.32s/it] {'loss': 0.0017, 'grad_norm': 2.354224717333618, 'learning_rate': 5.15527950310559e-07, 'completion_length': 139.25000381469727, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.21124479547142982, 'kl': 0.0421142578125, 'epoch': 2.42} 48%|████▊ | 780/1610 [8:58:21<3:31:57, 15.32s/it] 49%|████▊ | 781/1610 [8:58:31<3:07:45, 13.59s/it] {'loss': 0.0019, 'grad_norm': 1.225106182053518, 'learning_rate': 5.149068322981366e-07, 'completion_length': 90.78571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.0469970703125, 'epoch': 2.43} 49%|████▊ | 781/1610 [8:58:31<3:07:45, 13.59s/it] 49%|████▊ | 782/1610 [8:58:49<3:25:47, 14.91s/it] {'loss': 0.0019, 'grad_norm': 5.84769357423679, 'learning_rate': 5.142857142857142e-07, 'completion_length': 117.4285774230957, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.2580091208219528, 'kl': 0.0469970703125, 'epoch': 2.43} 49%|████▊ | 782/1610 [8:58:49<3:25:47, 14.91s/it] 49%|████▊ | 783/1610 [8:59:03<3:21:32, 14.62s/it] {'loss': 0.0019, 'grad_norm': 0.6915696791302148, 'learning_rate': 5.13664596273292e-07, 'completion_length': 114.6964340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.07695359364151955, 'kl': 0.0487060546875, 'epoch': 2.43} 49%|████▊ | 783/1610 [8:59:03<3:21:32, 14.62s/it] 49%|████▊ | 784/1610 [8:59:15<3:12:29, 13.98s/it] {'loss': 0.0022, 'grad_norm': 1.1066809158536945, 'learning_rate': 5.130434782608696e-07, 'completion_length': 105.21429061889648, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.14838217198848724, 'kl': 0.0555419921875, 'epoch': 2.43} 49%|████▊ | 784/1610 [8:59:15<3:12:29, 13.98s/it] 49%|████▉ | 785/1610 [8:59:29<3:12:57, 14.03s/it] {'loss': 0.0036, 'grad_norm': 1.4262132976402997, 'learning_rate': 5.124223602484472e-07, 'completion_length': 116.64286422729492, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.0897216796875, 'epoch': 2.44} 49%|████▉ | 785/1610 [8:59:29<3:12:57, 14.03s/it] 49%|████▉ | 786/1610 [8:59:43<3:11:08, 13.92s/it] {'loss': 0.0022, 'grad_norm': 1.5554583491956346, 'learning_rate': 5.118012422360249e-07, 'completion_length': 135.2857208251953, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.23086078464984894, 'kl': 0.0546875, 'epoch': 2.44} 49%|████▉ | 786/1610 [8:59:43<3:11:08, 13.92s/it] 49%|████▉ | 787/1610 [8:59:59<3:19:44, 14.56s/it] {'loss': 0.0028, 'grad_norm': 1.3617851878662768, 'learning_rate': 5.111801242236025e-07, 'completion_length': 133.4285774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.0694580078125, 'epoch': 2.44} 49%|████▉ | 787/1610 [8:59:59<3:19:44, 14.56s/it] 49%|████▉ | 788/1610 [9:00:15<3:25:25, 15.00s/it] {'loss': 0.0039, 'grad_norm': 1.4932147818240769, 'learning_rate': 5.105590062111801e-07, 'completion_length': 130.46429443359375, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.11266787722706795, 'kl': 0.09814453125, 'epoch': 2.45} 49%|████▉ | 788/1610 [9:00:15<3:25:25, 15.00s/it] 49%|████▉ | 789/1610 [9:00:26<3:07:55, 13.73s/it] {'loss': 0.0022, 'grad_norm': 1.7044242851987095, 'learning_rate': 5.099378881987578e-07, 'completion_length': 99.85714721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.15943220257759094, 'kl': 0.054443359375, 'epoch': 2.45} 49%|████▉ | 789/1610 [9:00:26<3:07:55, 13.73s/it] 49%|████▉ | 790/1610 [9:00:40<3:10:28, 13.94s/it] {'loss': 0.0041, 'grad_norm': 1.5178181031743714, 'learning_rate': 5.093167701863354e-07, 'completion_length': 127.41072082519531, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1896214708685875, 'kl': 0.1029052734375, 'epoch': 2.45} 49%|████▉ | 790/1610 [9:00:40<3:10:28, 13.94s/it] 49%|████▉ | 791/1610 [9:00:53<3:06:02, 13.63s/it] {'loss': 0.0043, 'grad_norm': 1.414834993598954, 'learning_rate': 5.08695652173913e-07, 'completion_length': 118.83929061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.10791015625, 'epoch': 2.46} 49%|████▉ | 791/1610 [9:00:53<3:06:02, 13.63s/it] 49%|████▉ | 792/1610 [9:01:07<3:07:57, 13.79s/it] {'loss': 0.0022, 'grad_norm': 0.7916022273530533, 'learning_rate': 5.080745341614908e-07, 'completion_length': 129.7857208251953, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0546875, 'epoch': 2.46} 49%|████▉ | 792/1610 [9:01:07<3:07:57, 13.79s/it] 49%|████▉ | 793/1610 [9:01:23<3:16:44, 14.45s/it] {'loss': 0.0029, 'grad_norm': 1.3294787981178189, 'learning_rate': 5.074534161490684e-07, 'completion_length': 138.4464340209961, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.18409645557403564, 'kl': 0.0716552734375, 'epoch': 2.46} 49%|████▉ | 793/1610 [9:01:23<3:16:44, 14.45s/it] 49%|████▉ | 794/1610 [9:01:39<3:23:06, 14.93s/it] {'loss': 0.0055, 'grad_norm': 0.8992468907948115, 'learning_rate': 5.068322981366459e-07, 'completion_length': 131.01786041259766, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1181928962469101, 'kl': 0.1376953125, 'epoch': 2.47} 49%|████▉ | 794/1610 [9:01:39<3:23:06, 14.93s/it] 49%|████▉ | 795/1610 [9:01:54<3:21:34, 14.84s/it] {'loss': 0.0081, 'grad_norm': 1.3826515062864577, 'learning_rate': 5.062111801242235e-07, 'completion_length': 156.25000762939453, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.3392857313156128, 'reward_std': 0.23086077719926834, 'kl': 0.203857421875, 'epoch': 2.47} 49%|████▉ | 795/1610 [9:01:54<3:21:34, 14.84s/it] 49%|████▉ | 796/1610 [9:02:11<3:27:50, 15.32s/it] {'loss': 0.0033, 'grad_norm': 0.9707560391860346, 'learning_rate': 5.055900621118012e-07, 'completion_length': 135.89286041259766, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857313156128, 'reward_std': 0.1428571529686451, 'kl': 0.08251953125, 'epoch': 2.47} 49%|████▉ | 796/1610 [9:02:11<3:27:50, 15.32s/it] 50%|████▉ | 797/1610 [9:02:26<3:26:27, 15.24s/it] {'loss': 0.0037, 'grad_norm': 1.5306405167748838, 'learning_rate': 5.049689440993788e-07, 'completion_length': 108.76786041259766, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.10410194471478462, 'kl': 0.09326171875, 'epoch': 2.48} 50%|████▉ | 797/1610 [9:02:26<3:26:27, 15.24s/it] 50%|████▉ | 798/1610 [9:02:44<3:40:26, 16.29s/it] {'loss': 0.0048, 'grad_norm': 1.8977571342329782, 'learning_rate': 5.043478260869564e-07, 'completion_length': 172.35714721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.3495604991912842, 'kl': 0.120849609375, 'epoch': 2.48} 50%|████▉ | 798/1610 [9:02:44<3:40:26, 16.29s/it] 50%|████▉ | 799/1610 [9:03:03<3:49:30, 16.98s/it] {'loss': 0.0103, 'grad_norm': 1.8363235321000286, 'learning_rate': 5.037267080745341e-07, 'completion_length': 163.46428680419922, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.2881983816623688, 'kl': 0.2568359375, 'epoch': 2.48} 50%|████▉ | 799/1610 [9:03:03<3:49:30, 16.98s/it] 50%|████▉ | 800/1610 [9:03:22<3:58:06, 17.64s/it] {'loss': 0.0051, 'grad_norm': 1.2889016124317503, 'learning_rate': 5.031055900621117e-07, 'completion_length': 167.67858123779297, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535715222358704, 'reward_std': 0.23086077719926834, 'kl': 0.12646484375, 'epoch': 2.48} 50%|████▉ | 800/1610 [9:03:22<3:58:06, 17.64s/it] 50%|████▉ | 801/1610 [9:06:29<15:24:36, 68.57s/it] {'loss': 0.0021, 'grad_norm': 13.731212566330656, 'learning_rate': 5.024844720496894e-07, 'completion_length': 117.64286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.05322265625, 'epoch': 2.49} 50%|████▉ | 801/1610 [9:06:29<15:24:36, 68.57s/it] 50%|████▉ | 802/1610 [9:06:43<11:41:59, 52.13s/it] {'loss': 0.0072, 'grad_norm': 2.8352878818249834, 'learning_rate': 5.018633540372671e-07, 'completion_length': 114.3035774230957, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.1896214671432972, 'kl': 0.17919921875, 'epoch': 2.49} 50%|████▉ | 802/1610 [9:06:43<11:41:59, 52.13s/it] 50%|████▉ | 803/1610 [9:07:01<9:20:46, 41.69s/it] {'loss': 0.0056, 'grad_norm': 5.958829634459467, 'learning_rate': 5.012422360248447e-07, 'completion_length': 143.1785774230957, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.13916015625, 'epoch': 2.49} 50%|████▉ | 803/1610 [9:07:01<9:20:46, 41.69s/it] 50%|████▉ | 804/1610 [9:07:16<7:34:49, 33.86s/it] {'loss': 0.0056, 'grad_norm': 1.51181565780349, 'learning_rate': 5.006211180124223e-07, 'completion_length': 157.50000762939453, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.140869140625, 'epoch': 2.5} 50%|████▉ | 804/1610 [9:07:16<7:34:49, 33.86s/it] 50%|█████ | 805/1610 [9:07:29<6:08:17, 27.45s/it] {'loss': 0.0045, 'grad_norm': 1.8675432697305245, 'learning_rate': 5e-07, 'completion_length': 110.87500762939453, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857313156128, 'reward_std': 0.25552502274513245, 'kl': 0.11279296875, 'epoch': 2.5} 50%|█████ | 805/1610 [9:07:29<6:08:17, 27.45s/it] 50%|█████ | 806/1610 [9:07:44<5:20:07, 23.89s/it] {'loss': 0.0074, 'grad_norm': 1.613238244932594, 'learning_rate': 4.993788819875776e-07, 'completion_length': 155.67857360839844, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.184326171875, 'epoch': 2.5} 50%|█████ | 806/1610 [9:07:44<5:20:07, 23.89s/it] 50%|█████ | 807/1610 [9:08:00<4:47:48, 21.50s/it] {'loss': 0.0078, 'grad_norm': 1.6794708070509636, 'learning_rate': 4.987577639751552e-07, 'completion_length': 133.89286041259766, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.18409645557403564, 'kl': 0.1962890625, 'epoch': 2.51} 50%|█████ | 807/1610 [9:08:00<4:47:48, 21.50s/it] 50%|█████ | 808/1610 [9:08:15<4:20:23, 19.48s/it] {'loss': 0.0031, 'grad_norm': 0.8666811381901502, 'learning_rate': 4.981366459627329e-07, 'completion_length': 149.05357360839844, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.11266788095235825, 'kl': 0.078125, 'epoch': 2.51} 50%|█████ | 808/1610 [9:08:15<4:20:23, 19.48s/it] 50%|█████ | 809/1610 [9:08:34<4:18:17, 19.35s/it] {'loss': 0.0103, 'grad_norm': 3.1932073254438387, 'learning_rate': 4.975155279503105e-07, 'completion_length': 149.44644165039062, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.39880891144275665, 'kl': 0.2578125, 'epoch': 2.51} 50%|█████ | 809/1610 [9:08:34<4:18:17, 19.35s/it] 50%|█████ | 810/1610 [9:08:54<4:20:00, 19.50s/it] {'loss': 0.0133, 'grad_norm': 1.728383791834142, 'learning_rate': 4.968944099378881e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.3294376879930496, 'kl': 0.3310546875, 'epoch': 2.52} 50%|█████ | 810/1610 [9:08:54<4:20:00, 19.50s/it] 50%|█████ | 811/1610 [9:09:09<4:03:37, 18.30s/it] {'loss': 0.0022, 'grad_norm': 2.9723376757234616, 'learning_rate': 4.962732919254658e-07, 'completion_length': 124.75000381469727, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.0562744140625, 'epoch': 2.52} 50%|█████ | 811/1610 [9:09:09<4:03:37, 18.30s/it] 50%|█████ | 812/1610 [9:09:28<4:05:27, 18.46s/it] {'loss': 0.0059, 'grad_norm': 0.9099962648159866, 'learning_rate': 4.956521739130435e-07, 'completion_length': 129.6607208251953, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.1071428619325161, 'kl': 0.147216796875, 'epoch': 2.52} 50%|█████ | 812/1610 [9:09:28<4:05:27, 18.46s/it] 50%|█████ | 813/1610 [9:09:43<3:52:01, 17.47s/it] {'loss': 0.0069, 'grad_norm': 0.7404008518963227, 'learning_rate': 4.950310559006211e-07, 'completion_length': 154.75000762939453, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.173095703125, 'epoch': 2.52} 50%|█████ | 813/1610 [9:09:43<3:52:01, 17.47s/it] 51%|█████ | 814/1610 [9:10:02<3:56:36, 17.84s/it] {'loss': 0.0063, 'grad_norm': 1.1882851695143055, 'learning_rate': 4.944099378881988e-07, 'completion_length': 167.05357360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.157958984375, 'epoch': 2.53} 51%|█████ | 814/1610 [9:10:02<3:56:36, 17.84s/it] 51%|█████ | 815/1610 [9:10:16<3:39:44, 16.58s/it] {'loss': 0.002, 'grad_norm': 0.984189572581243, 'learning_rate': 4.937888198757764e-07, 'completion_length': 141.5714340209961, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.050048828125, 'epoch': 2.53} 51%|█████ | 815/1610 [9:10:16<3:39:44, 16.58s/it] 51%|█████ | 816/1610 [9:10:30<3:31:12, 15.96s/it] {'loss': 0.0024, 'grad_norm': 1.584928907056938, 'learning_rate': 4.93167701863354e-07, 'completion_length': 109.53571701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.0594482421875, 'epoch': 2.53} 51%|█████ | 816/1610 [9:10:30<3:31:12, 15.96s/it] 51%|█████ | 817/1610 [9:10:44<3:22:39, 15.33s/it] {'loss': 0.0072, 'grad_norm': 2.963496587505281, 'learning_rate': 4.925465838509317e-07, 'completion_length': 128.92857360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.178955078125, 'epoch': 2.54} 51%|█████ | 817/1610 [9:10:44<3:22:39, 15.33s/it] 51%|█████ | 818/1610 [9:10:57<3:13:06, 14.63s/it] {'loss': 0.0025, 'grad_norm': 1.8373322757778079, 'learning_rate': 4.919254658385093e-07, 'completion_length': 100.4464340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.0625, 'epoch': 2.54} 51%|█████ | 818/1610 [9:10:57<3:13:06, 14.63s/it] 51%|█████ | 819/1610 [9:11:16<3:32:01, 16.08s/it] {'loss': 0.0025, 'grad_norm': 1.596508694303788, 'learning_rate': 4.913043478260869e-07, 'completion_length': 168.4464340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714285969734192, 'reward_std': 0.25552502274513245, 'kl': 0.0621337890625, 'epoch': 2.54} 51%|█████ | 819/1610 [9:11:16<3:32:01, 16.08s/it] 51%|█████ | 820/1610 [9:11:32<3:27:40, 15.77s/it] {'loss': 0.0025, 'grad_norm': 0.7209895098424204, 'learning_rate': 4.906832298136646e-07, 'completion_length': 145.80358123779297, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.0714285746216774, 'kl': 0.0625, 'epoch': 2.55} 51%|█████ | 820/1610 [9:11:32<3:27:40, 15.77s/it] 51%|█████ | 821/1610 [9:11:52<3:45:56, 17.18s/it] {'loss': 0.0065, 'grad_norm': 2.1127062315187968, 'learning_rate': 4.900621118012422e-07, 'completion_length': 161.6964340209961, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4285714626312256, 'reward_std': 0.3294377028942108, 'kl': 0.16357421875, 'epoch': 2.55} 51%|█████ | 821/1610 [9:11:52<3:45:56, 17.18s/it] 51%|█████ | 822/1610 [9:12:06<3:31:08, 16.08s/it] {'loss': 0.0017, 'grad_norm': 2.544559384129079, 'learning_rate': 4.894409937888198e-07, 'completion_length': 119.48214721679688, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.11266787722706795, 'kl': 0.041259765625, 'epoch': 2.55} 51%|█████ | 822/1610 [9:12:06<3:31:08, 16.08s/it] 51%|█████ | 823/1610 [9:12:20<3:23:52, 15.54s/it] {'loss': 0.0053, 'grad_norm': 1.691595279016639, 'learning_rate': 4.888198757763975e-07, 'completion_length': 108.21429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357656300068, 'kl': 0.1324462890625, 'epoch': 2.56} 51%|█████ | 823/1610 [9:12:20<3:23:52, 15.54s/it] 51%|█████ | 824/1610 [9:12:36<3:28:03, 15.88s/it] {'loss': 0.0055, 'grad_norm': 1.2096612390057304, 'learning_rate': 4.881987577639751e-07, 'completion_length': 190.92857360839844, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.136962890625, 'epoch': 2.56} 51%|█████ | 824/1610 [9:12:36<3:28:03, 15.88s/it] 51%|█████ | 825/1610 [9:12:52<3:24:26, 15.63s/it] {'loss': 0.0051, 'grad_norm': 1.4437844246755467, 'learning_rate': 4.875776397515527e-07, 'completion_length': 119.8214340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.25248411297798157, 'kl': 0.127197265625, 'epoch': 2.56} 51%|█████ | 825/1610 [9:12:52<3:24:26, 15.63s/it] 51%|█████▏ | 826/1610 [9:13:07<3:23:12, 15.55s/it] {'loss': 0.0067, 'grad_norm': 1.0843921048353953, 'learning_rate': 4.869565217391305e-07, 'completion_length': 104.26786041259766, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.1649572253227234, 'kl': 0.167724609375, 'epoch': 2.57} 51%|█████▏ | 826/1610 [9:13:07<3:23:12, 15.55s/it] 51%|█████▏ | 827/1610 [9:13:22<3:22:09, 15.49s/it] {'loss': 0.0036, 'grad_norm': 1.5753322895521604, 'learning_rate': 4.863354037267081e-07, 'completion_length': 146.9107208251953, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.25552502647042274, 'kl': 0.0909423828125, 'epoch': 2.57} 51%|█████▏ | 827/1610 [9:13:22<3:22:09, 15.49s/it] 51%|█████▏ | 828/1610 [9:13:37<3:19:27, 15.30s/it] {'loss': 0.0038, 'grad_norm': 0.920784945334175, 'learning_rate': 4.857142857142857e-07, 'completion_length': 117.3214340209961, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.094482421875, 'epoch': 2.57} 51%|█████▏ | 828/1610 [9:13:37<3:19:27, 15.30s/it] 51%|█████▏ | 829/1610 [9:13:57<3:37:28, 16.71s/it] {'loss': 0.007, 'grad_norm': 1.0215222036551397, 'learning_rate': 4.850931677018633e-07, 'completion_length': 176.1607208251953, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.13527478277683258, 'kl': 0.1748046875, 'epoch': 2.57} 51%|█████▏ | 829/1610 [9:13:57<3:37:28, 16.71s/it] 52%|█████▏ | 830/1610 [9:14:16<3:43:59, 17.23s/it] {'loss': 0.0103, 'grad_norm': 2.0538427212804153, 'learning_rate': 4.84472049689441e-07, 'completion_length': 166.1964340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.3495604991912842, 'kl': 0.2568359375, 'epoch': 2.58} 52%|█████▏ | 830/1610 [9:14:16<3:43:59, 17.23s/it] 52%|█████▏ | 831/1610 [9:14:31<3:35:45, 16.62s/it] {'loss': 0.0024, 'grad_norm': 1.6126177530216261, 'learning_rate': 4.838509316770186e-07, 'completion_length': 133.67857360839844, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.060546875, 'epoch': 2.58} 52%|█████▏ | 831/1610 [9:14:31<3:35:45, 16.62s/it] 52%|█████▏ | 832/1610 [9:14:45<3:27:59, 16.04s/it] {'loss': 0.0041, 'grad_norm': 1.1801121960950047, 'learning_rate': 4.832298136645963e-07, 'completion_length': 124.28571701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.21981073170900345, 'kl': 0.10205078125, 'epoch': 2.58} 52%|█████▏ | 832/1610 [9:14:45<3:27:59, 16.04s/it] 52%|█████▏ | 833/1610 [9:15:04<3:37:42, 16.81s/it] {'loss': 0.0062, 'grad_norm': 1.1089397694090841, 'learning_rate': 4.826086956521739e-07, 'completion_length': 136.8928680419922, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.13981622830033302, 'kl': 0.155029296875, 'epoch': 2.59} 52%|█████▏ | 833/1610 [9:15:04<3:37:42, 16.81s/it] 52%|█████▏ | 834/1610 [9:15:19<3:30:31, 16.28s/it] {'loss': 0.0062, 'grad_norm': 1.6292901228563845, 'learning_rate': 4.819875776397515e-07, 'completion_length': 141.0357208251953, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.0714285746216774, 'kl': 0.1552734375, 'epoch': 2.59} 52%|█████▏ | 834/1610 [9:15:19<3:30:31, 16.28s/it] 52%|█████▏ | 835/1610 [9:15:38<3:39:07, 16.96s/it] {'loss': 0.0151, 'grad_norm': 1.7224993577573828, 'learning_rate': 4.813664596273292e-07, 'completion_length': 164.5357208251953, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4464285969734192, 'reward_std': 0.21981073915958405, 'kl': 0.37841796875, 'epoch': 2.59} 52%|█████▏ | 835/1610 [9:15:38<3:39:07, 16.96s/it] 52%|█████▏ | 836/1610 [9:15:56<3:44:50, 17.43s/it] {'loss': 0.0109, 'grad_norm': 1.2762668373877428, 'learning_rate': 4.807453416149068e-07, 'completion_length': 146.83929443359375, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.13527478277683258, 'kl': 0.271484375, 'epoch': 2.6} 52%|█████▏ | 836/1610 [9:15:56<3:44:50, 17.43s/it] 52%|█████▏ | 837/1610 [9:16:13<3:42:31, 17.27s/it] {'loss': 0.0089, 'grad_norm': 1.8987064958830584, 'learning_rate': 4.801242236024844e-07, 'completion_length': 108.66072082519531, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.23086078464984894, 'kl': 0.22314453125, 'epoch': 2.6} 52%|█████▏ | 837/1610 [9:16:13<3:42:31, 17.27s/it] 52%|█████▏ | 838/1610 [9:16:31<3:46:01, 17.57s/it] {'loss': 0.0175, 'grad_norm': 2.22922225688835, 'learning_rate': 4.795031055900621e-07, 'completion_length': 159.5178680419922, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.1428571529686451, 'kl': 0.43896484375, 'epoch': 2.6} 52%|█████▏ | 838/1610 [9:16:31<3:46:01, 17.57s/it] 52%|█████▏ | 839/1610 [9:16:47<3:36:47, 16.87s/it] {'loss': 0.0031, 'grad_norm': 1.3357472175885279, 'learning_rate': 4.788819875776398e-07, 'completion_length': 128.46429443359375, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.07695358991622925, 'kl': 0.0782470703125, 'epoch': 2.61} 52%|█████▏ | 839/1610 [9:16:47<3:36:47, 16.87s/it] 52%|█████▏ | 840/1610 [9:17:01<3:27:40, 16.18s/it] {'loss': 0.0027, 'grad_norm': 13.954744064210592, 'learning_rate': 4.782608695652174e-07, 'completion_length': 125.17858123779297, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.0660400390625, 'epoch': 2.61} 52%|█████▏ | 840/1610 [9:17:01<3:27:40, 16.18s/it] 52%|█████▏ | 841/1610 [9:17:13<3:11:39, 14.95s/it] {'loss': 0.0032, 'grad_norm': 1.6458184512030918, 'learning_rate': 4.77639751552795e-07, 'completion_length': 104.66072082519531, 'rewards/accuracy_reward': 0.6428571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.0784912109375, 'epoch': 2.61} 52%|█████▏ | 841/1610 [9:17:13<3:11:39, 14.95s/it] 52%|█████▏ | 842/1610 [9:17:28<3:10:25, 14.88s/it] {'loss': 0.0036, 'grad_norm': 6.1543596480550775, 'learning_rate': 4.770186335403726e-07, 'completion_length': 135.78571701049805, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1785714402794838, 'kl': 0.0902099609375, 'epoch': 2.61} 52%|█████▏ | 842/1610 [9:17:28<3:10:25, 14.88s/it] 52%|█████▏ | 843/1610 [9:17:46<3:21:06, 15.73s/it] {'loss': 0.008, 'grad_norm': 10.083229208280049, 'learning_rate': 4.763975155279503e-07, 'completion_length': 151.42858123779297, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4821429252624512, 'reward_std': 0.31937122344970703, 'kl': 0.200927734375, 'epoch': 2.62} 52%|█████▏ | 843/1610 [9:17:46<3:21:06, 15.73s/it] 52%|█████▏ | 844/1610 [9:18:00<3:16:16, 15.37s/it] {'loss': 0.002, 'grad_norm': 1.390844926413403, 'learning_rate': 4.7577639751552796e-07, 'completion_length': 106.83929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.0496826171875, 'epoch': 2.62} 52%|█████▏ | 844/1610 [9:18:00<3:16:16, 15.37s/it] 52%|█████▏ | 845/1610 [9:18:14<3:11:29, 15.02s/it] {'loss': 0.0019, 'grad_norm': 0.8298236058133196, 'learning_rate': 4.751552795031056e-07, 'completion_length': 116.44643783569336, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.0474853515625, 'epoch': 2.62} 52%|█████▏ | 845/1610 [9:18:14<3:11:29, 15.02s/it] 53%|█████▎ | 846/1610 [9:18:27<3:03:26, 14.41s/it] {'loss': 0.0021, 'grad_norm': 1.7833258982547462, 'learning_rate': 4.7453416149068323e-07, 'completion_length': 122.03572082519531, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.0518798828125, 'epoch': 2.63} 53%|█████▎ | 846/1610 [9:18:27<3:03:26, 14.41s/it] 53%|█████▎ | 847/1610 [9:18:43<3:08:59, 14.86s/it] {'loss': 0.0049, 'grad_norm': 5.910805791719298, 'learning_rate': 4.739130434782608e-07, 'completion_length': 137.26786041259766, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.1539071835577488, 'kl': 0.1220703125, 'epoch': 2.63} 53%|█████▎ | 847/1610 [9:18:43<3:08:59, 14.86s/it] 53%|█████▎ | 848/1610 [9:18:58<3:07:16, 14.75s/it] {'loss': 0.0028, 'grad_norm': 1.5098769483993042, 'learning_rate': 4.732919254658385e-07, 'completion_length': 135.33929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.5714285969734192, 'reward_std': 0.19514648616313934, 'kl': 0.0687255859375, 'epoch': 2.63} 53%|█████▎ | 848/1610 [9:18:58<3:07:16, 14.75s/it] 53%|█████▎ | 849/1610 [9:19:11<3:02:12, 14.37s/it] {'loss': 0.0022, 'grad_norm': 2.587704206557348, 'learning_rate': 4.7267080745341613e-07, 'completion_length': 127.33929443359375, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.2142857238650322, 'kl': 0.055419921875, 'epoch': 2.64} 53%|█████▎ | 849/1610 [9:19:11<3:02:12, 14.37s/it] 53%|█████▎ | 850/1610 [9:19:29<3:16:08, 15.48s/it] {'loss': 0.0048, 'grad_norm': 2.9953904056122336, 'learning_rate': 4.7204968944099376e-07, 'completion_length': 155.0714340209961, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785715222358704, 'reward_std': 0.24241764098405838, 'kl': 0.1192626953125, 'epoch': 2.64} 53%|█████▎ | 850/1610 [9:19:29<3:16:08, 15.48s/it] 53%|█████▎ | 851/1610 [9:19:44<3:13:50, 15.32s/it] {'loss': 0.0019, 'grad_norm': 1.0995847378567254, 'learning_rate': 4.714285714285714e-07, 'completion_length': 106.10714721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.04833984375, 'epoch': 2.64} 53%|█████▎ | 851/1610 [9:19:44<3:13:50, 15.32s/it] 53%|█████▎ | 852/1610 [9:19:58<3:06:00, 14.72s/it] {'loss': 0.0019, 'grad_norm': 1.3656369779204682, 'learning_rate': 4.70807453416149e-07, 'completion_length': 126.46429061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.07695359364151955, 'kl': 0.048095703125, 'epoch': 2.65} 53%|█████▎ | 852/1610 [9:19:58<3:06:00, 14.72s/it] 53%|█████▎ | 853/1610 [9:20:13<3:07:36, 14.87s/it] {'loss': 0.0021, 'grad_norm': 3.481885648342102, 'learning_rate': 4.701863354037267e-07, 'completion_length': 104.62500381469727, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.2500000149011612, 'kl': 0.0513916015625, 'epoch': 2.65} 53%|█████▎ | 853/1610 [9:20:13<3:07:36, 14.87s/it] 53%|█████▎ | 854/1610 [9:20:26<2:59:13, 14.22s/it] {'loss': 0.0021, 'grad_norm': 1.2415322001052287, 'learning_rate': 4.6956521739130434e-07, 'completion_length': 124.03572082519531, 'rewards/accuracy_reward': 0.321428582072258, 'rewards/format_reward': 1.0, 'reward': 1.321428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.0517578125, 'epoch': 2.65} 53%|█████▎ | 854/1610 [9:20:26<2:59:13, 14.22s/it] 53%|█████▎ | 855/1610 [9:20:38<2:50:32, 13.55s/it] {'loss': 0.0018, 'grad_norm': 1.3835937990442007, 'learning_rate': 4.68944099378882e-07, 'completion_length': 102.58929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409644439816475, 'kl': 0.0460205078125, 'epoch': 2.66} 53%|█████▎ | 855/1610 [9:20:38<2:50:32, 13.55s/it] 53%|█████▎ | 856/1610 [9:20:48<2:39:35, 12.70s/it] {'loss': 0.0016, 'grad_norm': 1.9711032815593446, 'learning_rate': 4.683229813664596e-07, 'completion_length': 99.01786422729492, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0390625, 'epoch': 2.66} 53%|█████▎ | 856/1610 [9:20:48<2:39:35, 12.70s/it] 53%|█████▎ | 857/1610 [9:21:03<2:48:16, 13.41s/it] {'loss': 0.0022, 'grad_norm': 0.3171885000628985, 'learning_rate': 4.6770186335403724e-07, 'completion_length': 135.0714340209961, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.05419921875, 'epoch': 2.66} 53%|█████▎ | 857/1610 [9:21:03<2:48:16, 13.41s/it] 53%|█████▎ | 858/1610 [9:21:18<2:54:17, 13.91s/it] {'loss': 0.0026, 'grad_norm': 3.4863623956350356, 'learning_rate': 4.670807453416149e-07, 'completion_length': 138.55357360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.21981073170900345, 'kl': 0.0643310546875, 'epoch': 2.66} 53%|█████▎ | 858/1610 [9:21:18<2:54:17, 13.91s/it] 53%|█████▎ | 859/1610 [9:21:34<2:59:46, 14.36s/it] {'loss': 0.0023, 'grad_norm': 1.8720564982994998, 'learning_rate': 4.664596273291925e-07, 'completion_length': 129.0357208251953, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.23086077719926834, 'kl': 0.0576171875, 'epoch': 2.67} 53%|█████▎ | 859/1610 [9:21:34<2:59:46, 14.36s/it] 53%|█████▎ | 860/1610 [9:21:49<3:03:44, 14.70s/it] {'loss': 0.0059, 'grad_norm': 1.6969062793711867, 'learning_rate': 4.6583850931677014e-07, 'completion_length': 124.8214340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.2937234044075012, 'kl': 0.14794921875, 'epoch': 2.67} 53%|█████▎ | 860/1610 [9:21:49<3:03:44, 14.70s/it] 53%|█████▎ | 861/1610 [9:22:04<3:03:10, 14.67s/it] {'loss': 0.0026, 'grad_norm': 1.425814605529147, 'learning_rate': 4.6521739130434777e-07, 'completion_length': 109.8035774230957, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857313156128, 'reward_std': 0.11266787722706795, 'kl': 0.0655517578125, 'epoch': 2.67} 53%|█████▎ | 861/1610 [9:22:04<3:03:10, 14.67s/it] 54%|█████▎ | 862/1610 [9:22:22<3:14:20, 15.59s/it] {'loss': 0.0022, 'grad_norm': 1.0112726583569862, 'learning_rate': 4.6459627329192546e-07, 'completion_length': 119.89286041259766, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.1785714365541935, 'kl': 0.055419921875, 'epoch': 2.68} 54%|█████▎ | 862/1610 [9:22:22<3:14:20, 15.59s/it] 54%|█████▎ | 863/1610 [9:22:37<3:12:03, 15.43s/it] {'loss': 0.0015, 'grad_norm': 1.3690073893471444, 'learning_rate': 4.639751552795031e-07, 'completion_length': 122.50000762939453, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.037109375, 'epoch': 2.68} 54%|█████▎ | 863/1610 [9:22:37<3:12:03, 15.43s/it] 54%|█████▎ | 864/1610 [9:22:50<3:05:23, 14.91s/it] {'loss': 0.0017, 'grad_norm': 1.3010272229064368, 'learning_rate': 4.633540372670807e-07, 'completion_length': 101.64286422729492, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.041259765625, 'epoch': 2.68} 54%|█████▎ | 864/1610 [9:22:50<3:05:23, 14.91s/it] 54%|█████▎ | 865/1610 [9:23:07<3:13:04, 15.55s/it] {'loss': 0.002, 'grad_norm': 1.6674197705446219, 'learning_rate': 4.6273291925465835e-07, 'completion_length': 124.25000381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.2610500305891037, 'kl': 0.0506591796875, 'epoch': 2.69} 54%|█████▎ | 865/1610 [9:23:07<3:13:04, 15.55s/it] 54%|█████▍ | 866/1610 [9:23:22<3:09:40, 15.30s/it] {'loss': 0.0017, 'grad_norm': 0.9473072990529187, 'learning_rate': 4.62111801242236e-07, 'completion_length': 128.4107208251953, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1071428619325161, 'kl': 0.0426025390625, 'epoch': 2.69} 54%|█████▍ | 866/1610 [9:23:22<3:09:40, 15.30s/it] 54%|█████▍ | 867/1610 [9:23:37<3:07:51, 15.17s/it] {'loss': 0.0025, 'grad_norm': 3.07510011601301, 'learning_rate': 4.6149068322981367e-07, 'completion_length': 105.9285774230957, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.2253357619047165, 'kl': 0.0634765625, 'epoch': 2.69} 54%|█████▍ | 867/1610 [9:23:37<3:07:51, 15.17s/it] 54%|█████▍ | 868/1610 [9:23:56<3:22:14, 16.35s/it] {'loss': 0.0064, 'grad_norm': 1.6028086822176801, 'learning_rate': 4.608695652173913e-07, 'completion_length': 143.55357360839844, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.2142857238650322, 'kl': 0.1605224609375, 'epoch': 2.7} 54%|█████▍ | 868/1610 [9:23:56<3:22:14, 16.35s/it] 54%|█████▍ | 869/1610 [9:24:09<3:10:53, 15.46s/it] {'loss': 0.002, 'grad_norm': 1.8763195136005284, 'learning_rate': 4.6024844720496894e-07, 'completion_length': 125.28571701049805, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.0511474609375, 'epoch': 2.7} 54%|█████▍ | 869/1610 [9:24:09<3:10:53, 15.46s/it] 54%|█████▍ | 870/1610 [9:24:23<3:05:14, 15.02s/it] {'loss': 0.0029, 'grad_norm': 1.6038767865370793, 'learning_rate': 4.596273291925465e-07, 'completion_length': 102.4285774230957, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.19514648616313934, 'kl': 0.0726318359375, 'epoch': 2.7} 54%|█████▍ | 870/1610 [9:24:23<3:05:14, 15.02s/it] 54%|█████▍ | 871/1610 [9:24:35<2:53:07, 14.06s/it] {'loss': 0.0013, 'grad_norm': 2.496579594361731, 'learning_rate': 4.590062111801242e-07, 'completion_length': 97.41072082519531, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2610500529408455, 'kl': 0.03369140625, 'epoch': 2.7} 54%|█████▍ | 871/1610 [9:24:35<2:53:07, 14.06s/it] 54%|█████▍ | 872/1610 [9:24:48<2:49:07, 13.75s/it] {'loss': 0.0025, 'grad_norm': 0.7781122475186133, 'learning_rate': 4.5838509316770183e-07, 'completion_length': 112.14286422729492, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.0714285746216774, 'kl': 0.0611572265625, 'epoch': 2.71} 54%|█████▍ | 872/1610 [9:24:48<2:49:07, 13.75s/it] 54%|█████▍ | 873/1610 [9:25:04<2:57:21, 14.44s/it] {'loss': 0.004, 'grad_norm': 3.859736493946111, 'learning_rate': 4.5776397515527947e-07, 'completion_length': 124.66072082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.2610500380396843, 'kl': 0.0989990234375, 'epoch': 2.71} 54%|█████▍ | 873/1610 [9:25:04<2:57:21, 14.44s/it] 54%|█████▍ | 874/1610 [9:25:18<2:52:35, 14.07s/it] {'loss': 0.0019, 'grad_norm': 1.3306308514490088, 'learning_rate': 4.571428571428571e-07, 'completion_length': 102.96429061889648, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.2253357544541359, 'kl': 0.0479736328125, 'epoch': 2.71} 54%|█████▍ | 874/1610 [9:25:18<2:52:35, 14.07s/it] 54%|█████▍ | 875/1610 [9:25:29<2:43:09, 13.32s/it] {'loss': 0.002, 'grad_norm': 1.6459595294106046, 'learning_rate': 4.5652173913043473e-07, 'completion_length': 85.07143020629883, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1896214671432972, 'kl': 0.0498046875, 'epoch': 2.72} 54%|█████▍ | 875/1610 [9:25:29<2:43:09, 13.32s/it] 54%|█████▍ | 876/1610 [9:25:45<2:52:33, 14.10s/it] {'loss': 0.0048, 'grad_norm': 0.9722808794952262, 'learning_rate': 4.559006211180124e-07, 'completion_length': 130.4464340209961, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.07695359364151955, 'kl': 0.119384765625, 'epoch': 2.72} 54%|█████▍ | 876/1610 [9:25:45<2:52:33, 14.10s/it] 54%|█████▍ | 877/1610 [9:25:57<2:45:01, 13.51s/it] {'loss': 0.0013, 'grad_norm': 1.414181835141683, 'learning_rate': 4.5527950310559005e-07, 'completion_length': 94.82143020629883, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.14838217198848724, 'kl': 0.03369140625, 'epoch': 2.72} 54%|█████▍ | 877/1610 [9:25:57<2:45:01, 13.51s/it] 55%|█████▍ | 878/1610 [9:26:09<2:40:22, 13.15s/it] {'loss': 0.0032, 'grad_norm': 1.955801873947124, 'learning_rate': 4.546583850931677e-07, 'completion_length': 109.96429061889648, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.33800362050533295, 'kl': 0.078857421875, 'epoch': 2.73} 55%|█████▍ | 878/1610 [9:26:09<2:40:22, 13.15s/it] 55%|█████▍ | 879/1610 [9:26:26<2:54:11, 14.30s/it] {'loss': 0.0062, 'grad_norm': 1.0229084406136773, 'learning_rate': 4.540372670807453e-07, 'completion_length': 126.89286422729492, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.16546404734253883, 'kl': 0.15380859375, 'epoch': 2.73} 55%|█████▍ | 879/1610 [9:26:26<2:54:11, 14.30s/it] 55%|█████▍ | 880/1610 [9:26:39<2:46:49, 13.71s/it] {'loss': 0.0018, 'grad_norm': 1.558751480603917, 'learning_rate': 4.53416149068323e-07, 'completion_length': 118.41072082519531, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1539071798324585, 'kl': 0.043701171875, 'epoch': 2.73} 55%|█████▍ | 880/1610 [9:26:39<2:46:49, 13.71s/it] 55%|█████▍ | 881/1610 [9:26:56<2:59:40, 14.79s/it] {'loss': 0.003, 'grad_norm': 0.9681057664465951, 'learning_rate': 4.5279503105590063e-07, 'completion_length': 152.62500762939453, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 1.0, 'reward': 1.5000001192092896, 'reward_std': 0.1539071872830391, 'kl': 0.074951171875, 'epoch': 2.74} 55%|█████▍ | 881/1610 [9:26:56<2:59:40, 14.79s/it] 55%|█████▍ | 882/1610 [9:27:11<3:01:34, 14.97s/it] {'loss': 0.0056, 'grad_norm': 2.5060467592210136, 'learning_rate': 4.521739130434782e-07, 'completion_length': 119.78571701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1539071835577488, 'kl': 0.140625, 'epoch': 2.74} 55%|█████▍ | 882/1610 [9:27:11<3:01:34, 14.97s/it] 55%|█████▍ | 883/1610 [9:27:27<3:03:38, 15.16s/it] {'loss': 0.0033, 'grad_norm': 0.7397342441469407, 'learning_rate': 4.5155279503105585e-07, 'completion_length': 128.57143020629883, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.1539071872830391, 'kl': 0.083251953125, 'epoch': 2.74} 55%|█████▍ | 883/1610 [9:27:27<3:03:38, 15.16s/it] 55%|█████▍ | 884/1610 [9:27:42<3:02:54, 15.12s/it] {'loss': 0.0048, 'grad_norm': 2.1115602465163263, 'learning_rate': 4.509316770186335e-07, 'completion_length': 139.21429443359375, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.1428571492433548, 'kl': 0.12109375, 'epoch': 2.75} 55%|█████▍ | 884/1610 [9:27:42<3:02:54, 15.12s/it] 55%|█████▍ | 885/1610 [9:27:57<3:00:52, 14.97s/it] {'loss': 0.0027, 'grad_norm': 2.0817564297258317, 'learning_rate': 4.5031055900621116e-07, 'completion_length': 113.51786041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.2142857238650322, 'kl': 0.0672607421875, 'epoch': 2.75} 55%|█████▍ | 885/1610 [9:27:57<3:00:52, 14.97s/it] 55%|█████▌ | 886/1610 [9:28:11<2:57:43, 14.73s/it] {'loss': 0.0043, 'grad_norm': 0.7058839951411132, 'learning_rate': 4.496894409937888e-07, 'completion_length': 127.60715103149414, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.07695358991622925, 'kl': 0.1083984375, 'epoch': 2.75} 55%|█████▌ | 886/1610 [9:28:11<2:57:43, 14.73s/it] 55%|█████▌ | 887/1610 [9:28:24<2:49:53, 14.10s/it] {'loss': 0.0022, 'grad_norm': 1.884556376166681, 'learning_rate': 4.4906832298136643e-07, 'completion_length': 109.14286041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214783191681, 'kl': 0.0543212890625, 'epoch': 2.75} 55%|█████▌ | 887/1610 [9:28:24<2:49:53, 14.10s/it] 55%|█████▌ | 888/1610 [9:28:36<2:44:48, 13.70s/it] {'loss': 0.0025, 'grad_norm': 2.3946303690991346, 'learning_rate': 4.4844720496894406e-07, 'completion_length': 119.33929061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.11266788095235825, 'kl': 0.0616455078125, 'epoch': 2.76} 55%|█████▌ | 888/1610 [9:28:36<2:44:48, 13.70s/it] 55%|█████▌ | 889/1610 [9:28:51<2:49:14, 14.08s/it] {'loss': 0.0022, 'grad_norm': 1.0916667311375154, 'learning_rate': 4.4782608695652175e-07, 'completion_length': 124.50000762939453, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.1181928962469101, 'kl': 0.0548095703125, 'epoch': 2.76} 55%|█████▌ | 889/1610 [9:28:51<2:49:14, 14.08s/it] 55%|█████▌ | 890/1610 [9:29:07<2:56:32, 14.71s/it] {'loss': 0.0028, 'grad_norm': 1.8677155651522668, 'learning_rate': 4.472049689440994e-07, 'completion_length': 122.1785774230957, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.18409645557403564, 'kl': 0.0703125, 'epoch': 2.76} 55%|█████▌ | 890/1610 [9:29:07<2:56:32, 14.71s/it] 55%|█████▌ | 891/1610 [9:29:19<2:43:56, 13.68s/it] {'loss': 0.002, 'grad_norm': 1.504390737038129, 'learning_rate': 4.46583850931677e-07, 'completion_length': 95.83929061889648, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035714626312256, 'reward_std': 0.1896214708685875, 'kl': 0.05126953125, 'epoch': 2.77} 55%|█████▌ | 891/1610 [9:29:19<2:43:56, 13.68s/it] 55%|█████▌ | 892/1610 [9:29:33<2:47:15, 13.98s/it] {'loss': 0.003, 'grad_norm': 1.1094082059998687, 'learning_rate': 4.4596273291925464e-07, 'completion_length': 112.1964340209961, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.073974609375, 'epoch': 2.77} 55%|█████▌ | 892/1610 [9:29:33<2:47:15, 13.98s/it] 55%|█████▌ | 893/1610 [9:29:48<2:49:56, 14.22s/it] {'loss': 0.0087, 'grad_norm': 1.3205948213440584, 'learning_rate': 4.453416149068323e-07, 'completion_length': 133.23214721679688, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.2186279296875, 'epoch': 2.77} 55%|█████▌ | 893/1610 [9:29:48<2:49:56, 14.22s/it] 56%|█████▌ | 894/1610 [9:30:03<2:51:24, 14.36s/it] {'loss': 0.0024, 'grad_norm': 1.8247424588216619, 'learning_rate': 4.447204968944099e-07, 'completion_length': 121.98214721679688, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.2253357470035553, 'kl': 0.0606689453125, 'epoch': 2.78} 56%|█████▌ | 894/1610 [9:30:03<2:51:24, 14.36s/it] 56%|█████▌ | 895/1610 [9:30:14<2:40:20, 13.45s/it] {'loss': 0.0022, 'grad_norm': 0.9796846819882619, 'learning_rate': 4.4409937888198754e-07, 'completion_length': 84.78572082519531, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.0357142873108387, 'kl': 0.0555419921875, 'epoch': 2.78} 56%|█████▌ | 895/1610 [9:30:14<2:40:20, 13.45s/it] 56%|█████▌ | 896/1610 [9:30:29<2:44:21, 13.81s/it] {'loss': 0.0017, 'grad_norm': 0.8288704885120125, 'learning_rate': 4.434782608695652e-07, 'completion_length': 108.0714340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.0357142873108387, 'kl': 0.04150390625, 'epoch': 2.78} 56%|█████▌ | 896/1610 [9:30:29<2:44:21, 13.81s/it] 56%|█████▌ | 897/1610 [9:30:44<2:47:43, 14.11s/it] {'loss': 0.0026, 'grad_norm': 1.741858964788862, 'learning_rate': 4.428571428571428e-07, 'completion_length': 131.05358123779297, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 1.0, 'reward': 1.446428656578064, 'reward_std': 0.2610500380396843, 'kl': 0.0660400390625, 'epoch': 2.79} 56%|█████▌ | 897/1610 [9:30:44<2:47:43, 14.11s/it] 56%|█████▌ | 898/1610 [9:30:56<2:42:08, 13.66s/it] {'loss': 0.0018, 'grad_norm': 0.9959774364797785, 'learning_rate': 4.422360248447205e-07, 'completion_length': 104.62500381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.0452880859375, 'epoch': 2.79} 56%|█████▌ | 898/1610 [9:30:56<2:42:08, 13.66s/it] 56%|█████▌ | 899/1610 [9:31:11<2:45:25, 13.96s/it] {'loss': 0.0022, 'grad_norm': 1.9905404269783489, 'learning_rate': 4.416149068322981e-07, 'completion_length': 124.00000381469727, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0714285746216774, 'kl': 0.0550537109375, 'epoch': 2.79} 56%|█████▌ | 899/1610 [9:31:11<2:45:25, 13.96s/it] 56%|█████▌ | 900/1610 [9:31:27<2:52:40, 14.59s/it] {'loss': 0.0075, 'grad_norm': 2.4259740603860798, 'learning_rate': 4.4099378881987576e-07, 'completion_length': 117.00000762939453, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.2610500380396843, 'kl': 0.18701171875, 'epoch': 2.8} 56%|█████▌ | 900/1610 [9:31:27<2:52:40, 14.59s/it] 56%|█████▌ | 901/1610 [9:34:50<14:01:06, 71.18s/it] {'loss': 0.0088, 'grad_norm': 1.8949732066541507, 'learning_rate': 4.403726708074534e-07, 'completion_length': 143.05357360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.17651408910751343, 'kl': 0.220703125, 'epoch': 2.8} 56%|█████▌ | 901/1610 [9:34:50<14:01:06, 71.18s/it] 56%|█████▌ | 902/1610 [9:35:08<10:50:16, 55.11s/it] {'loss': 0.0051, 'grad_norm': 1.7737266820106874, 'learning_rate': 4.39751552795031e-07, 'completion_length': 138.89286041259766, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.12841796875, 'epoch': 2.8} 56%|█████▌ | 902/1610 [9:35:08<10:50:16, 55.11s/it] 56%|█████▌ | 903/1610 [9:35:18<8:10:35, 41.63s/it] {'loss': 0.002, 'grad_norm': 0.9723499170523815, 'learning_rate': 4.391304347826087e-07, 'completion_length': 90.51786041259766, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.05078125, 'epoch': 2.8} 56%|█████▌ | 903/1610 [9:35:18<8:10:35, 41.63s/it] 56%|█████▌ | 904/1610 [9:35:31<6:28:34, 33.02s/it] {'loss': 0.0032, 'grad_norm': 1.2813608315809213, 'learning_rate': 4.3850931677018634e-07, 'completion_length': 132.0535774230957, 'rewards/accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1181928999722004, 'kl': 0.0802001953125, 'epoch': 2.81} 56%|█████▌ | 904/1610 [9:35:31<6:28:34, 33.02s/it] 56%|█████▌ | 905/1610 [9:35:47<5:29:47, 28.07s/it] {'loss': 0.0073, 'grad_norm': 1.8476056292547103, 'learning_rate': 4.3788819875776397e-07, 'completion_length': 116.33929061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.26657506078481674, 'kl': 0.1826171875, 'epoch': 2.81} 56%|█████▌ | 905/1610 [9:35:47<5:29:47, 28.07s/it] 56%|█████▋ | 906/1610 [9:36:03<4:44:49, 24.27s/it] {'loss': 0.0035, 'grad_norm': 2.3453059515115213, 'learning_rate': 4.3726708074534155e-07, 'completion_length': 127.0714340209961, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.2610500380396843, 'kl': 0.087890625, 'epoch': 2.81} 56%|█████▋ | 906/1610 [9:36:03<4:44:49, 24.27s/it] 56%|█████▋ | 907/1610 [9:36:18<4:13:30, 21.64s/it] {'loss': 0.0057, 'grad_norm': 1.1766754001832425, 'learning_rate': 4.3664596273291924e-07, 'completion_length': 140.5357208251953, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.1416015625, 'epoch': 2.82} 56%|█████▋ | 907/1610 [9:36:18<4:13:30, 21.64s/it] 56%|█████▋ | 908/1610 [9:36:32<3:44:30, 19.19s/it] {'loss': 0.0021, 'grad_norm': 0.5883574271460527, 'learning_rate': 4.3602484472049687e-07, 'completion_length': 118.98214340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.0357142873108387, 'kl': 0.0528564453125, 'epoch': 2.82} 56%|█████▋ | 908/1610 [9:36:32<3:44:30, 19.19s/it] 56%|█████▋ | 909/1610 [9:36:46<3:27:11, 17.73s/it] {'loss': 0.0039, 'grad_norm': 1.988185483871449, 'learning_rate': 4.354037267080745e-07, 'completion_length': 131.69643783569336, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 1.0, 'reward': 1.3750000596046448, 'reward_std': 0.2610500454902649, 'kl': 0.09619140625, 'epoch': 2.82} 56%|█████▋ | 909/1610 [9:36:46<3:27:11, 17.73s/it] 57%|█████▋ | 910/1610 [9:36:59<3:09:29, 16.24s/it] {'loss': 0.0026, 'grad_norm': 0.842422680244412, 'learning_rate': 4.3478260869565214e-07, 'completion_length': 117.4464340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0657958984375, 'epoch': 2.83} 57%|█████▋ | 910/1610 [9:36:59<3:09:29, 16.24s/it] 57%|█████▋ | 911/1610 [9:37:13<3:00:34, 15.50s/it] {'loss': 0.0032, 'grad_norm': 2.0556491383315705, 'learning_rate': 4.3416149068322977e-07, 'completion_length': 124.58929061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1181928999722004, 'kl': 0.0797119140625, 'epoch': 2.83} 57%|█████▋ | 911/1610 [9:37:13<3:00:34, 15.50s/it] 57%|█████▋ | 912/1610 [9:37:25<2:50:42, 14.67s/it] {'loss': 0.003, 'grad_norm': 1.1631402986016148, 'learning_rate': 4.3354037267080745e-07, 'completion_length': 102.00000381469727, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.14838215708732605, 'kl': 0.074951171875, 'epoch': 2.83} 57%|█████▋ | 912/1610 [9:37:25<2:50:42, 14.67s/it] 57%|█████▋ | 913/1610 [9:37:37<2:40:59, 13.86s/it] {'loss': 0.0018, 'grad_norm': 0.8701682731108782, 'learning_rate': 4.329192546583851e-07, 'completion_length': 107.83929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.0445556640625, 'epoch': 2.84} 57%|█████▋ | 913/1610 [9:37:37<2:40:59, 13.86s/it] 57%|█████▋ | 914/1610 [9:37:50<2:37:55, 13.61s/it] {'loss': 0.0026, 'grad_norm': 1.376082435082348, 'learning_rate': 4.322981366459627e-07, 'completion_length': 106.25000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.1428571492433548, 'kl': 0.06396484375, 'epoch': 2.84} 57%|█████▋ | 914/1610 [9:37:50<2:37:55, 13.61s/it] 57%|█████▋ | 915/1610 [9:38:07<2:48:03, 14.51s/it] {'loss': 0.0084, 'grad_norm': 1.2937090168097034, 'learning_rate': 4.3167701863354035e-07, 'completion_length': 145.67858123779297, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.21981073170900345, 'kl': 0.2110595703125, 'epoch': 2.84} 57%|█████▋ | 915/1610 [9:38:07<2:48:03, 14.51s/it] 57%|█████▋ | 916/1610 [9:38:21<2:47:00, 14.44s/it] {'loss': 0.0015, 'grad_norm': 1.9863005637738447, 'learning_rate': 4.3105590062111804e-07, 'completion_length': 85.26786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.0382080078125, 'epoch': 2.84} 57%|█████▋ | 916/1610 [9:38:21<2:47:00, 14.44s/it] 57%|█████▋ | 917/1610 [9:38:35<2:43:05, 14.12s/it] {'loss': 0.0024, 'grad_norm': 1.5981056022632103, 'learning_rate': 4.3043478260869567e-07, 'completion_length': 127.1785774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.25552502274513245, 'kl': 0.060302734375, 'epoch': 2.85} 57%|█████▋ | 917/1610 [9:38:35<2:43:05, 14.12s/it] 57%|█████▋ | 918/1610 [9:38:48<2:41:39, 14.02s/it] {'loss': 0.0019, 'grad_norm': 1.1792266236046258, 'learning_rate': 4.2981366459627325e-07, 'completion_length': 124.08928680419922, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1181928999722004, 'kl': 0.046142578125, 'epoch': 2.85} 57%|█████▋ | 918/1610 [9:38:48<2:41:39, 14.02s/it] 57%|█████▋ | 919/1610 [9:39:02<2:38:09, 13.73s/it] {'loss': 0.0031, 'grad_norm': 1.0179313740268463, 'learning_rate': 4.291925465838509e-07, 'completion_length': 118.4285774230957, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.078369140625, 'epoch': 2.85} 57%|█████▋ | 919/1610 [9:39:02<2:38:09, 13.73s/it] 57%|█████▋ | 920/1610 [9:39:14<2:34:57, 13.47s/it] {'loss': 0.0021, 'grad_norm': 1.5407142284045483, 'learning_rate': 4.285714285714285e-07, 'completion_length': 105.82143020629883, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.18409644439816475, 'kl': 0.052978515625, 'epoch': 2.86} 57%|█████▋ | 920/1610 [9:39:14<2:34:57, 13.47s/it] 57%|█████▋ | 921/1610 [9:39:27<2:31:00, 13.15s/it] {'loss': 0.0036, 'grad_norm': 1.2062195432112885, 'learning_rate': 4.279503105590062e-07, 'completion_length': 100.01786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.089111328125, 'epoch': 2.86} 57%|█████▋ | 921/1610 [9:39:27<2:31:00, 13.15s/it] 57%|█████▋ | 922/1610 [9:39:40<2:30:11, 13.10s/it] {'loss': 0.002, 'grad_norm': 1.5108444333074462, 'learning_rate': 4.2732919254658383e-07, 'completion_length': 119.35714721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216453790665, 'kl': 0.050537109375, 'epoch': 2.86} 57%|█████▋ | 922/1610 [9:39:40<2:30:11, 13.10s/it] 57%|█████▋ | 923/1610 [9:39:53<2:31:11, 13.20s/it] {'loss': 0.0025, 'grad_norm': 1.3645367807294477, 'learning_rate': 4.2670807453416146e-07, 'completion_length': 129.62500381469727, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1896214634180069, 'kl': 0.062744140625, 'epoch': 2.87} 57%|█████▋ | 923/1610 [9:39:53<2:31:11, 13.20s/it] 57%|█████▋ | 924/1610 [9:40:09<2:38:04, 13.83s/it] {'loss': 0.0019, 'grad_norm': 1.6162434282411628, 'learning_rate': 4.260869565217391e-07, 'completion_length': 135.89286041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.19514648616313934, 'kl': 0.047607421875, 'epoch': 2.87} 57%|█████▋ | 924/1610 [9:40:09<2:38:04, 13.83s/it] 57%|█████▋ | 925/1610 [9:40:20<2:30:29, 13.18s/it] {'loss': 0.0015, 'grad_norm': 8.297921449895245, 'learning_rate': 4.254658385093168e-07, 'completion_length': 91.37500381469727, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 1.0, 'reward': 1.3928572535514832, 'reward_std': 0.2142857238650322, 'kl': 0.0377197265625, 'epoch': 2.87} 57%|█████▋ | 925/1610 [9:40:20<2:30:29, 13.18s/it] 58%|█████▊ | 926/1610 [9:40:37<2:40:57, 14.12s/it] {'loss': 0.0019, 'grad_norm': 2.534012627258087, 'learning_rate': 4.248447204968944e-07, 'completion_length': 117.41072463989258, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.26657508313655853, 'kl': 0.0482177734375, 'epoch': 2.88} 58%|█████▊ | 926/1610 [9:40:37<2:40:57, 14.12s/it] 58%|█████▊ | 927/1610 [9:40:52<2:45:05, 14.50s/it] {'loss': 0.0024, 'grad_norm': 1.611928771294561, 'learning_rate': 4.2422360248447205e-07, 'completion_length': 113.33929443359375, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.060791015625, 'epoch': 2.88} 58%|█████▊ | 927/1610 [9:40:52<2:45:05, 14.50s/it] 58%|█████▊ | 928/1610 [9:41:05<2:41:11, 14.18s/it] {'loss': 0.0023, 'grad_norm': 0.8677159057502164, 'learning_rate': 4.236024844720497e-07, 'completion_length': 133.46429061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0565185546875, 'epoch': 2.88} 58%|█████▊ | 928/1610 [9:41:05<2:41:11, 14.18s/it] 58%|█████▊ | 929/1610 [9:41:18<2:36:09, 13.76s/it] {'loss': 0.0021, 'grad_norm': 1.8020707159732732, 'learning_rate': 4.229813664596273e-07, 'completion_length': 107.50000381469727, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.0357142873108387, 'kl': 0.051513671875, 'epoch': 2.89} 58%|█████▊ | 929/1610 [9:41:18<2:36:09, 13.76s/it] 58%|█████▊ | 930/1610 [9:41:30<2:30:02, 13.24s/it] {'loss': 0.0024, 'grad_norm': 1.3582676143384, 'learning_rate': 4.2236024844720495e-07, 'completion_length': 120.23214340209961, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.11266788095235825, 'kl': 0.060302734375, 'epoch': 2.89} 58%|█████▊ | 930/1610 [9:41:30<2:30:02, 13.24s/it] 58%|█████▊ | 931/1610 [9:41:44<2:33:34, 13.57s/it] {'loss': 0.0015, 'grad_norm': 1.8811589761493208, 'learning_rate': 4.217391304347826e-07, 'completion_length': 122.16072463989258, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.14838216453790665, 'kl': 0.0374755859375, 'epoch': 2.89} 58%|█████▊ | 931/1610 [9:41:44<2:33:34, 13.57s/it] 58%|█████▊ | 932/1610 [9:42:00<2:39:41, 14.13s/it] {'loss': 0.0021, 'grad_norm': 2.0840473624552107, 'learning_rate': 4.211180124223602e-07, 'completion_length': 137.48214721679688, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.052734375, 'epoch': 2.89} 58%|█████▊ | 932/1610 [9:42:00<2:39:41, 14.13s/it] 58%|█████▊ | 933/1610 [9:42:16<2:44:21, 14.57s/it] {'loss': 0.0024, 'grad_norm': 0.9068644361785534, 'learning_rate': 4.2049689440993784e-07, 'completion_length': 131.85715103149414, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1071428656578064, 'kl': 0.061279296875, 'epoch': 2.9} 58%|█████▊ | 933/1610 [9:42:16<2:44:21, 14.57s/it] 58%|█████▊ | 934/1610 [9:42:31<2:45:49, 14.72s/it] {'loss': 0.0037, 'grad_norm': 1.6178089070069814, 'learning_rate': 4.1987577639751553e-07, 'completion_length': 127.14286422729492, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.0924072265625, 'epoch': 2.9} 58%|█████▊ | 934/1610 [9:42:31<2:45:49, 14.72s/it] 58%|█████▊ | 935/1610 [9:42:44<2:40:21, 14.25s/it] {'loss': 0.0027, 'grad_norm': 1.665147120366698, 'learning_rate': 4.1925465838509316e-07, 'completion_length': 114.8035774230957, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.14838217198848724, 'kl': 0.06689453125, 'epoch': 2.9} 58%|█████▊ | 935/1610 [9:42:44<2:40:21, 14.25s/it] 58%|█████▊ | 936/1610 [9:42:57<2:35:10, 13.81s/it] {'loss': 0.0023, 'grad_norm': 12.97574090233003, 'learning_rate': 4.186335403726708e-07, 'completion_length': 94.25000762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.19514648616313934, 'kl': 0.0570068359375, 'epoch': 2.91} 58%|█████▊ | 936/1610 [9:42:57<2:35:10, 13.81s/it] 58%|█████▊ | 937/1610 [9:43:12<2:39:04, 14.18s/it] {'loss': 0.0032, 'grad_norm': 1.5590958684764775, 'learning_rate': 4.180124223602484e-07, 'completion_length': 137.10715103149414, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.20117833465337753, 'kl': 0.0794677734375, 'epoch': 2.91} 58%|█████▊ | 937/1610 [9:43:12<2:39:04, 14.18s/it] 58%|█████▊ | 938/1610 [9:43:27<2:44:31, 14.69s/it] {'loss': 0.0037, 'grad_norm': 1.0875498564916701, 'learning_rate': 4.1739130434782606e-07, 'completion_length': 153.23215103149414, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0938720703125, 'epoch': 2.91} 58%|█████▊ | 938/1610 [9:43:27<2:44:31, 14.69s/it] 58%|█████▊ | 939/1610 [9:43:41<2:41:52, 14.47s/it] {'loss': 0.0018, 'grad_norm': 0.8652182337360915, 'learning_rate': 4.1677018633540374e-07, 'completion_length': 115.58929061889648, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.0452880859375, 'epoch': 2.92} 58%|█████▊ | 939/1610 [9:43:41<2:41:52, 14.47s/it] 58%|█████▊ | 940/1610 [9:43:57<2:45:58, 14.86s/it] {'loss': 0.0021, 'grad_norm': 1.270925793845233, 'learning_rate': 4.161490683229814e-07, 'completion_length': 117.16072463989258, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.05322265625, 'epoch': 2.92} 58%|█████▊ | 940/1610 [9:43:57<2:45:58, 14.86s/it] 58%|█████▊ | 941/1610 [9:44:12<2:47:03, 14.98s/it] {'loss': 0.0036, 'grad_norm': 2.4580538472436513, 'learning_rate': 4.15527950310559e-07, 'completion_length': 117.14286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.089111328125, 'epoch': 2.92} 58%|█████▊ | 941/1610 [9:44:12<2:47:03, 14.98s/it] 59%|█████▊ | 942/1610 [9:44:27<2:45:30, 14.87s/it] {'loss': 0.0039, 'grad_norm': 2.5416576970725613, 'learning_rate': 4.149068322981366e-07, 'completion_length': 122.48215103149414, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.6964285969734192, 'reward_std': 0.1071428619325161, 'kl': 0.0960693359375, 'epoch': 2.93} 59%|█████▊ | 942/1610 [9:44:27<2:45:30, 14.87s/it] 59%|█████▊ | 943/1610 [9:44:43<2:49:57, 15.29s/it] {'loss': 0.0022, 'grad_norm': 1.9220421959880185, 'learning_rate': 4.142857142857143e-07, 'completion_length': 145.17857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178572535514832, 'reward_std': 0.1181928962469101, 'kl': 0.0538330078125, 'epoch': 2.93} 59%|█████▊ | 943/1610 [9:44:43<2:49:57, 15.29s/it] 59%|█████▊ | 944/1610 [9:44:58<2:46:54, 15.04s/it] {'loss': 0.0019, 'grad_norm': 1.3793401759551707, 'learning_rate': 4.136645962732919e-07, 'completion_length': 112.33929061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.19514648616313934, 'kl': 0.046875, 'epoch': 2.93} 59%|█████▊ | 944/1610 [9:44:58<2:46:54, 15.04s/it] 59%|█████▊ | 945/1610 [9:45:11<2:40:33, 14.49s/it] {'loss': 0.0016, 'grad_norm': 2.1775762487497654, 'learning_rate': 4.1304347826086954e-07, 'completion_length': 108.76786422729492, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.14838217198848724, 'kl': 0.0396728515625, 'epoch': 2.93} 59%|█████▊ | 945/1610 [9:45:11<2:40:33, 14.49s/it] 59%|█████▉ | 946/1610 [9:45:27<2:46:20, 15.03s/it] {'loss': 0.0019, 'grad_norm': 1.2449342173800722, 'learning_rate': 4.1242236024844717e-07, 'completion_length': 120.28572082519531, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0482177734375, 'epoch': 2.94} 59%|█████▉ | 946/1610 [9:45:27<2:46:20, 15.03s/it] 59%|█████▉ | 947/1610 [9:45:41<2:43:06, 14.76s/it] {'loss': 0.0026, 'grad_norm': 1.9041987259924043, 'learning_rate': 4.118012422360248e-07, 'completion_length': 107.4464340209961, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.0662841796875, 'epoch': 2.94} 59%|█████▉ | 947/1610 [9:45:41<2:43:06, 14.76s/it] 59%|█████▉ | 948/1610 [9:45:54<2:35:05, 14.06s/it] {'loss': 0.0019, 'grad_norm': 1.73637468483313, 'learning_rate': 4.111801242236025e-07, 'completion_length': 94.25000381469727, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409645557403564, 'kl': 0.0479736328125, 'epoch': 2.94} 59%|█████▉ | 948/1610 [9:45:54<2:35:05, 14.06s/it] 59%|█████▉ | 949/1610 [9:46:09<2:39:43, 14.50s/it] {'loss': 0.0069, 'grad_norm': 1.8047066478507618, 'learning_rate': 4.105590062111801e-07, 'completion_length': 142.76786422729492, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.31937122344970703, 'kl': 0.172119140625, 'epoch': 2.95} 59%|█████▉ | 949/1610 [9:46:09<2:39:43, 14.50s/it] 59%|█████▉ | 950/1610 [9:46:24<2:39:52, 14.53s/it] {'loss': 0.002, 'grad_norm': 1.2525172084518903, 'learning_rate': 4.0993788819875776e-07, 'completion_length': 128.71429061889648, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.1071428619325161, 'kl': 0.0496826171875, 'epoch': 2.95} 59%|█████▉ | 950/1610 [9:46:24<2:39:52, 14.53s/it] 59%|█████▉ | 951/1610 [9:46:41<2:47:36, 15.26s/it] {'loss': 0.0042, 'grad_norm': 1.5301030413909178, 'learning_rate': 4.093167701863354e-07, 'completion_length': 113.92857360839844, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1181928962469101, 'kl': 0.104736328125, 'epoch': 2.95} 59%|█████▉ | 951/1610 [9:46:41<2:47:36, 15.26s/it] 59%|█████▉ | 952/1610 [9:46:56<2:47:20, 15.26s/it] {'loss': 0.0019, 'grad_norm': 0.780894980613591, 'learning_rate': 4.0869565217391307e-07, 'completion_length': 135.75000762939453, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695359364151955, 'kl': 0.0477294921875, 'epoch': 2.96} 59%|█████▉ | 952/1610 [9:46:56<2:47:20, 15.26s/it] 59%|█████▉ | 953/1610 [9:47:10<2:43:10, 14.90s/it] {'loss': 0.0068, 'grad_norm': 1.3987998048209451, 'learning_rate': 4.080745341614907e-07, 'completion_length': 130.00000762939453, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.1710205078125, 'epoch': 2.96} 59%|█████▉ | 953/1610 [9:47:10<2:43:10, 14.90s/it] 59%|█████▉ | 954/1610 [9:47:25<2:43:48, 14.98s/it] {'loss': 0.0052, 'grad_norm': 1.1090505641602935, 'learning_rate': 4.074534161490683e-07, 'completion_length': 134.48215103149414, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1071428656578064, 'kl': 0.13037109375, 'epoch': 2.96} 59%|█████▉ | 954/1610 [9:47:25<2:43:48, 14.98s/it] 59%|█████▉ | 955/1610 [9:47:40<2:41:52, 14.83s/it] {'loss': 0.0024, 'grad_norm': 1.8983085215275082, 'learning_rate': 4.068322981366459e-07, 'completion_length': 124.4464340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.11266787722706795, 'kl': 0.0589599609375, 'epoch': 2.97} 59%|█████▉ | 955/1610 [9:47:40<2:41:52, 14.83s/it] 59%|█████▉ | 956/1610 [9:47:54<2:38:38, 14.56s/it] {'loss': 0.0025, 'grad_norm': 1.2984923497080052, 'learning_rate': 4.0621118012422355e-07, 'completion_length': 111.00000381469727, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.1539071798324585, 'kl': 0.063232421875, 'epoch': 2.97} 59%|█████▉ | 956/1610 [9:47:54<2:38:38, 14.56s/it] 59%|█████▉ | 957/1610 [9:48:08<2:36:57, 14.42s/it] {'loss': 0.0023, 'grad_norm': 5.668480792668886, 'learning_rate': 4.0559006211180124e-07, 'completion_length': 116.62500762939453, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.2967643141746521, 'kl': 0.057373046875, 'epoch': 2.97} 59%|█████▉ | 957/1610 [9:48:08<2:36:57, 14.42s/it] 60%|█████▉ | 958/1610 [9:48:24<2:43:41, 15.06s/it] {'loss': 0.0027, 'grad_norm': 1.9936504281720833, 'learning_rate': 4.0496894409937887e-07, 'completion_length': 122.01786422729492, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.26657507568597794, 'kl': 0.06640625, 'epoch': 2.98} 60%|█████▉ | 958/1610 [9:48:24<2:43:41, 15.06s/it] 60%|█████▉ | 959/1610 [9:48:38<2:37:33, 14.52s/it] {'loss': 0.0028, 'grad_norm': 1.2078631670841915, 'learning_rate': 4.043478260869565e-07, 'completion_length': 103.23214721679688, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.04123930633068085, 'kl': 0.070068359375, 'epoch': 2.98} 60%|█████▉ | 959/1610 [9:48:38<2:37:33, 14.52s/it] 60%|█████▉ | 960/1610 [9:48:54<2:43:19, 15.08s/it] {'loss': 0.0045, 'grad_norm': 2.3418993194845767, 'learning_rate': 4.0372670807453413e-07, 'completion_length': 97.66072082519531, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.13527478277683258, 'kl': 0.11181640625, 'epoch': 2.98} 60%|█████▉ | 960/1610 [9:48:54<2:43:19, 15.08s/it] 60%|█████▉ | 961/1610 [9:49:07<2:36:41, 14.49s/it] {'loss': 0.0033, 'grad_norm': 3.7444540727781135, 'learning_rate': 4.0310559006211177e-07, 'completion_length': 104.46429061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2610500454902649, 'kl': 0.0826416015625, 'epoch': 2.98} 60%|█████▉ | 961/1610 [9:49:07<2:36:41, 14.49s/it] 60%|█████▉ | 962/1610 [9:49:22<2:37:22, 14.57s/it] {'loss': 0.0037, 'grad_norm': 1.235001565434983, 'learning_rate': 4.0248447204968945e-07, 'completion_length': 135.75000381469727, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.07695358991622925, 'kl': 0.0933837890625, 'epoch': 2.99} 60%|█████▉ | 962/1610 [9:49:22<2:37:22, 14.57s/it] 60%|█████▉ | 963/1610 [9:49:34<2:30:25, 13.95s/it] {'loss': 0.0019, 'grad_norm': 1.2745087010000236, 'learning_rate': 4.018633540372671e-07, 'completion_length': 100.98214721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.0469970703125, 'epoch': 2.99} 60%|█████▉ | 963/1610 [9:49:34<2:30:25, 13.95s/it] 60%|█████▉ | 964/1610 [9:49:48<2:27:58, 13.74s/it] {'loss': 0.0026, 'grad_norm': 2.619808245010495, 'learning_rate': 4.012422360248447e-07, 'completion_length': 102.53572082519531, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.0357142873108387, 'kl': 0.06591796875, 'epoch': 2.99} 60%|█████▉ | 964/1610 [9:49:48<2:27:58, 13.74s/it] 60%|█████▉ | 965/1610 [9:50:00<2:23:22, 13.34s/it] {'loss': 0.0024, 'grad_norm': 3.0934562120181757, 'learning_rate': 4.006211180124223e-07, 'completion_length': 114.33929061889648, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 1.0, 'reward': 1.4464285969734192, 'reward_std': 0.21981074661016464, 'kl': 0.05908203125, 'epoch': 3.0} 60%|█████▉ | 965/1610 [9:50:00<2:23:22, 13.34s/it] 60%|██████ | 966/1610 [9:50:17<2:33:33, 14.31s/it] {'loss': 0.0114, 'grad_norm': 2.1168031993066307, 'learning_rate': 4e-07, 'completion_length': 139.9107208251953, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.2610500380396843, 'kl': 0.28369140625, 'epoch': 3.0} 60%|██████ | 966/1610 [9:50:17<2:33:33, 14.31s/it] 60%|██████ | 967/1610 [9:50:33<2:40:12, 14.95s/it] {'loss': 0.0034, 'grad_norm': 1.9617436659785576, 'learning_rate': 3.993788819875776e-07, 'completion_length': 124.76786041259766, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.410714328289032, 'reward_std': 0.1896214783191681, 'kl': 0.083740234375, 'epoch': 3.0} 60%|██████ | 967/1610 [9:50:33<2:40:12, 14.95s/it] 60%|██████ | 968/1610 [9:50:49<2:41:52, 15.13s/it] {'loss': 0.0021, 'grad_norm': 1.0391162265035392, 'learning_rate': 3.9875776397515525e-07, 'completion_length': 92.07143020629883, 'rewards/accuracy_reward': 0.6785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.14838215708732605, 'kl': 0.05224609375, 'epoch': 3.01} 60%|██████ | 968/1610 [9:50:49<2:41:52, 15.13s/it] 60%|██████ | 969/1610 [9:51:06<2:47:19, 15.66s/it] {'loss': 0.0023, 'grad_norm': 1.7352973634984847, 'learning_rate': 3.981366459627329e-07, 'completion_length': 117.76786422729492, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.1785714365541935, 'kl': 0.0565185546875, 'epoch': 3.01} 60%|██████ | 969/1610 [9:51:06<2:47:19, 15.66s/it] 60%|██████ | 970/1610 [9:51:18<2:37:44, 14.79s/it] {'loss': 0.0021, 'grad_norm': 2.3984865314891732, 'learning_rate': 3.975155279503105e-07, 'completion_length': 92.3214340209961, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.26657506078481674, 'kl': 0.051513671875, 'epoch': 3.01} 60%|██████ | 970/1610 [9:51:18<2:37:44, 14.79s/it] 60%|██████ | 971/1610 [9:51:32<2:33:16, 14.39s/it] {'loss': 0.0024, 'grad_norm': 2.1652720414159683, 'learning_rate': 3.968944099378882e-07, 'completion_length': 117.58929061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0824786126613617, 'kl': 0.058837890625, 'epoch': 3.02} 60%|██████ | 971/1610 [9:51:32<2:33:16, 14.39s/it] 60%|██████ | 972/1610 [9:51:47<2:36:45, 14.74s/it] {'loss': 0.0084, 'grad_norm': 1.5351247548848603, 'learning_rate': 3.9627329192546583e-07, 'completion_length': 126.8035774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.535714328289032, 'reward_std': 0.2253357619047165, 'kl': 0.210205078125, 'epoch': 3.02} 60%|██████ | 972/1610 [9:51:47<2:36:45, 14.74s/it] 60%|██████ | 973/1610 [9:52:02<2:35:09, 14.61s/it] {'loss': 0.0026, 'grad_norm': 1.35053600869931, 'learning_rate': 3.9565217391304346e-07, 'completion_length': 125.07143783569336, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.1896214708685875, 'kl': 0.0660400390625, 'epoch': 3.02} 60%|██████ | 973/1610 [9:52:02<2:35:09, 14.61s/it] 60%|██████ | 974/1610 [9:52:16<2:34:36, 14.59s/it] {'loss': 0.0045, 'grad_norm': 2.392203483977218, 'learning_rate': 3.950310559006211e-07, 'completion_length': 124.48214721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.111572265625, 'epoch': 3.02} 60%|██████ | 974/1610 [9:52:16<2:34:36, 14.59s/it] 61%|██████ | 975/1610 [9:52:33<2:39:55, 15.11s/it] {'loss': 0.0028, 'grad_norm': 0.7324771107480433, 'learning_rate': 3.944099378881988e-07, 'completion_length': 120.67858123779297, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.069580078125, 'epoch': 3.03} 61%|██████ | 975/1610 [9:52:33<2:39:55, 15.11s/it] 61%|██████ | 976/1610 [9:52:47<2:38:42, 15.02s/it] {'loss': 0.004, 'grad_norm': 3.0853890018402588, 'learning_rate': 3.937888198757764e-07, 'completion_length': 123.50000762939453, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.11266788095235825, 'kl': 0.1005859375, 'epoch': 3.03} 61%|██████ | 976/1610 [9:52:47<2:38:42, 15.02s/it] 61%|██████ | 977/1610 [9:53:00<2:32:03, 14.41s/it] {'loss': 0.0024, 'grad_norm': 1.9178775407007504, 'learning_rate': 3.93167701863354e-07, 'completion_length': 122.3035774230957, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.1428571529686451, 'kl': 0.060302734375, 'epoch': 3.03} 61%|██████ | 977/1610 [9:53:00<2:32:03, 14.41s/it] 61%|██████ | 978/1610 [9:53:18<2:41:02, 15.29s/it] {'loss': 0.0074, 'grad_norm': 3.1896772674883516, 'learning_rate': 3.925465838509316e-07, 'completion_length': 121.05357360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.30228933691978455, 'kl': 0.185546875, 'epoch': 3.04} 61%|██████ | 978/1610 [9:53:18<2:41:02, 15.29s/it] 61%|██████ | 979/1610 [9:53:35<2:45:45, 15.76s/it] {'loss': 0.0062, 'grad_norm': 1.754399450793363, 'learning_rate': 3.9192546583850926e-07, 'completion_length': 119.01786041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5714285969734192, 'reward_std': 0.2142857313156128, 'kl': 0.154541015625, 'epoch': 3.04} 61%|██████ | 979/1610 [9:53:35<2:45:45, 15.76s/it] 61%|██████ | 980/1610 [9:53:49<2:40:37, 15.30s/it] {'loss': 0.0024, 'grad_norm': 1.083157764722969, 'learning_rate': 3.9130434782608694e-07, 'completion_length': 108.76786422729492, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216453790665, 'kl': 0.0606689453125, 'epoch': 3.04} 61%|██████ | 980/1610 [9:53:49<2:40:37, 15.30s/it] 61%|██████ | 981/1610 [9:54:02<2:32:55, 14.59s/it] {'loss': 0.0027, 'grad_norm': 1.6230914826097715, 'learning_rate': 3.906832298136646e-07, 'completion_length': 108.98214721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.1428571492433548, 'kl': 0.066650390625, 'epoch': 3.05} 61%|██████ | 981/1610 [9:54:02<2:32:55, 14.59s/it] 61%|██████ | 982/1610 [9:54:21<2:48:23, 16.09s/it] {'loss': 0.0059, 'grad_norm': 1.53063776224178, 'learning_rate': 3.900621118012422e-07, 'completion_length': 133.91071701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.17098906636238098, 'kl': 0.14697265625, 'epoch': 3.05} 61%|██████ | 982/1610 [9:54:21<2:48:23, 16.09s/it] 61%|██████ | 983/1610 [9:54:35<2:41:55, 15.49s/it] {'loss': 0.0021, 'grad_norm': 1.920451353844692, 'learning_rate': 3.8944099378881984e-07, 'completion_length': 129.9107208251953, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.23638580739498138, 'kl': 0.0518798828125, 'epoch': 3.05} 61%|██████ | 983/1610 [9:54:35<2:41:55, 15.49s/it] 61%|██████ | 984/1610 [9:54:52<2:46:13, 15.93s/it] {'loss': 0.0026, 'grad_norm': 1.9651445373972043, 'learning_rate': 3.8881987577639753e-07, 'completion_length': 128.3035774230957, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.14838216826319695, 'kl': 0.0645751953125, 'epoch': 3.06} 61%|██████ | 984/1610 [9:54:52<2:46:13, 15.93s/it] 61%|██████ | 985/1610 [9:55:09<2:48:51, 16.21s/it] {'loss': 0.0096, 'grad_norm': 1.9822427228254098, 'learning_rate': 3.8819875776397516e-07, 'completion_length': 137.37500762939453, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.24241765588521957, 'kl': 0.2396240234375, 'epoch': 3.06} 61%|██████ | 985/1610 [9:55:09<2:48:51, 16.21s/it] 61%|██████ | 986/1610 [9:55:23<2:39:36, 15.35s/it] {'loss': 0.0021, 'grad_norm': 1.033118140164736, 'learning_rate': 3.875776397515528e-07, 'completion_length': 110.28571701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.05126953125, 'epoch': 3.06} 61%|██████ | 986/1610 [9:55:23<2:39:36, 15.35s/it] 61%|██████▏ | 987/1610 [9:55:36<2:32:16, 14.67s/it] {'loss': 0.0025, 'grad_norm': 5.9344140334423034, 'learning_rate': 3.869565217391304e-07, 'completion_length': 94.14286041259766, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.07695359364151955, 'kl': 0.0635986328125, 'epoch': 3.07} 61%|██████▏ | 987/1610 [9:55:36<2:32:16, 14.67s/it] 61%|██████▏ | 988/1610 [9:55:48<2:23:47, 13.87s/it] {'loss': 0.0023, 'grad_norm': 1.1025934976276508, 'learning_rate': 3.8633540372670806e-07, 'completion_length': 106.58929061889648, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.1539071872830391, 'kl': 0.0565185546875, 'epoch': 3.07} 61%|██████▏ | 988/1610 [9:55:48<2:23:47, 13.87s/it] 61%|██████▏ | 989/1610 [9:55:58<2:13:46, 12.92s/it] {'loss': 0.0016, 'grad_norm': 3.3802701120056153, 'learning_rate': 3.857142857142857e-07, 'completion_length': 81.73214721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1896214671432972, 'kl': 0.0396728515625, 'epoch': 3.07} 61%|██████▏ | 989/1610 [9:55:58<2:13:46, 12.92s/it] 61%|██████▏ | 990/1610 [9:56:14<2:21:16, 13.67s/it] {'loss': 0.0082, 'grad_norm': 1.966978427165534, 'learning_rate': 3.850931677018633e-07, 'completion_length': 124.76786422729492, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2253357470035553, 'kl': 0.20556640625, 'epoch': 3.07} 61%|██████▏ | 990/1610 [9:56:14<2:21:16, 13.67s/it] 62%|██████▏ | 991/1610 [9:56:30<2:29:41, 14.51s/it] {'loss': 0.0026, 'grad_norm': 1.7501386308131561, 'learning_rate': 3.8447204968944095e-07, 'completion_length': 128.10714721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1785714365541935, 'kl': 0.06494140625, 'epoch': 3.08} 62%|██████▏ | 991/1610 [9:56:30<2:29:41, 14.51s/it] 62%|██████▏ | 992/1610 [9:56:43<2:24:02, 13.98s/it] {'loss': 0.0017, 'grad_norm': 0.5329069334214473, 'learning_rate': 3.838509316770186e-07, 'completion_length': 109.8214340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.0418701171875, 'epoch': 3.08} 62%|██████▏ | 992/1610 [9:56:43<2:24:02, 13.98s/it] 62%|██████▏ | 993/1610 [9:56:56<2:22:19, 13.84s/it] {'loss': 0.0023, 'grad_norm': 1.2172285607718687, 'learning_rate': 3.8322981366459627e-07, 'completion_length': 104.85714721679688, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.1539071798324585, 'kl': 0.056396484375, 'epoch': 3.08} 62%|██████▏ | 993/1610 [9:56:56<2:22:19, 13.84s/it] 62%|██████▏ | 994/1610 [9:57:10<2:20:42, 13.71s/it] {'loss': 0.0026, 'grad_norm': 1.8066448624478821, 'learning_rate': 3.826086956521739e-07, 'completion_length': 107.9464340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.18409644439816475, 'kl': 0.06591796875, 'epoch': 3.09} 62%|██████▏ | 994/1610 [9:57:10<2:20:42, 13.71s/it] 62%|██████▏ | 995/1610 [9:57:26<2:26:57, 14.34s/it] {'loss': 0.0023, 'grad_norm': 1.1415526636624906, 'learning_rate': 3.8198757763975154e-07, 'completion_length': 145.32143020629883, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071835577488, 'kl': 0.0579833984375, 'epoch': 3.09} 62%|██████▏ | 995/1610 [9:57:26<2:26:57, 14.34s/it] 62%|██████▏ | 996/1610 [9:57:37<2:16:18, 13.32s/it] {'loss': 0.0021, 'grad_norm': 1.7313986157329997, 'learning_rate': 3.8136645962732917e-07, 'completion_length': 78.83928680419922, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.11266788095235825, 'kl': 0.0518798828125, 'epoch': 3.09} 62%|██████▏ | 996/1610 [9:57:37<2:16:18, 13.32s/it] 62%|██████▏ | 997/1610 [9:57:50<2:15:11, 13.23s/it] {'loss': 0.0043, 'grad_norm': 2.0294980269764586, 'learning_rate': 3.807453416149068e-07, 'completion_length': 113.1964340209961, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.1080322265625, 'epoch': 3.1} 62%|██████▏ | 997/1610 [9:57:50<2:15:11, 13.23s/it] 62%|██████▏ | 998/1610 [9:58:03<2:15:03, 13.24s/it] {'loss': 0.0041, 'grad_norm': 1.0520290792044378, 'learning_rate': 3.801242236024845e-07, 'completion_length': 115.85714721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.07695358991622925, 'kl': 0.1015625, 'epoch': 3.1} 62%|██████▏ | 998/1610 [9:58:03<2:15:03, 13.24s/it] 62%|██████▏ | 999/1610 [9:58:19<2:24:48, 14.22s/it] {'loss': 0.0056, 'grad_norm': 1.806999932849579, 'learning_rate': 3.795031055900621e-07, 'completion_length': 136.48215103149414, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.21981074661016464, 'kl': 0.140380859375, 'epoch': 3.1} 62%|██████▏ | 999/1610 [9:58:19<2:24:48, 14.22s/it] 62%|██████▏ | 1000/1610 [9:58:38<2:38:22, 15.58s/it] {'loss': 0.0028, 'grad_norm': 0.5169370841419287, 'learning_rate': 3.7888198757763975e-07, 'completion_length': 142.39286041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.0357142873108387, 'kl': 0.069580078125, 'epoch': 3.11} 62%|██████▏ | 1000/1610 [9:58:38<2:38:22, 15.58s/it] 62%|██████▏ | 1001/1610 [10:01:41<11:08:23, 65.85s/it] {'loss': 0.0019, 'grad_norm': 0.6146080404372699, 'learning_rate': 3.7826086956521733e-07, 'completion_length': 130.05358123779297, 'rewards/accuracy_reward': 0.8750000596046448, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.0472412109375, 'epoch': 3.11} 62%|██████▏ | 1001/1610 [10:01:41<11:08:23, 65.85s/it] 62%|██████▏ | 1002/1610 [10:01:53<8:21:42, 49.51s/it] {'loss': 0.0027, 'grad_norm': 1.0691678347192695, 'learning_rate': 3.77639751552795e-07, 'completion_length': 90.55357360839844, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 1.0, 'reward': 1.410714328289032, 'reward_std': 0.07695359364151955, 'kl': 0.0665283203125, 'epoch': 3.11} 62%|██████▏ | 1002/1610 [10:01:53<8:21:42, 49.51s/it] 62%|██████▏ | 1003/1610 [10:02:06<6:31:37, 38.71s/it] {'loss': 0.0021, 'grad_norm': 2.0590240267326565, 'learning_rate': 3.7701863354037265e-07, 'completion_length': 112.39286422729492, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.05322265625, 'epoch': 3.11} 62%|██████▏ | 1003/1610 [10:02:06<6:31:37, 38.71s/it] 62%|██████▏ | 1004/1610 [10:02:19<5:12:22, 30.93s/it] {'loss': 0.0024, 'grad_norm': 1.6955994342903107, 'learning_rate': 3.763975155279503e-07, 'completion_length': 124.98214721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.11266788095235825, 'kl': 0.06005859375, 'epoch': 3.12} 62%|██████▏ | 1004/1610 [10:02:19<5:12:22, 30.93s/it] 62%|██████▏ | 1005/1610 [10:02:35<4:26:38, 26.44s/it] {'loss': 0.0137, 'grad_norm': 1.7909428826811826, 'learning_rate': 3.757763975155279e-07, 'completion_length': 107.16071701049805, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250001192092896, 'reward_std': 0.31333938986063004, 'kl': 0.34228515625, 'epoch': 3.12} 62%|██████▏ | 1005/1610 [10:02:35<4:26:38, 26.44s/it] 62%|██████▏ | 1006/1610 [10:02:48<3:44:58, 22.35s/it] {'loss': 0.0027, 'grad_norm': 1.4531290655822977, 'learning_rate': 3.7515527950310555e-07, 'completion_length': 110.76786422729492, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1896214783191681, 'kl': 0.0665283203125, 'epoch': 3.12} 62%|██████▏ | 1006/1610 [10:02:48<3:44:58, 22.35s/it] 63%|██████▎ | 1007/1610 [10:03:04<3:25:46, 20.47s/it] {'loss': 0.0036, 'grad_norm': 1.1171841409585141, 'learning_rate': 3.7453416149068323e-07, 'completion_length': 106.48214721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.14838215708732605, 'kl': 0.08984375, 'epoch': 3.13} 63%|██████▎ | 1007/1610 [10:03:04<3:25:46, 20.47s/it] 63%|██████▎ | 1008/1610 [10:03:21<3:15:15, 19.46s/it] {'loss': 0.0065, 'grad_norm': 0.8305016826523194, 'learning_rate': 3.7391304347826087e-07, 'completion_length': 121.98214721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3928571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.162109375, 'epoch': 3.13} 63%|██████▎ | 1008/1610 [10:03:21<3:15:15, 19.46s/it] 63%|██████▎ | 1009/1610 [10:03:34<2:55:20, 17.51s/it] {'loss': 0.0036, 'grad_norm': 1.148699856000728, 'learning_rate': 3.732919254658385e-07, 'completion_length': 90.10714721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.11266787722706795, 'kl': 0.08984375, 'epoch': 3.13} 63%|██████▎ | 1009/1610 [10:03:34<2:55:20, 17.51s/it] 63%|██████▎ | 1010/1610 [10:03:47<2:41:07, 16.11s/it] {'loss': 0.0016, 'grad_norm': 1.5070709878390662, 'learning_rate': 3.7267080745341613e-07, 'completion_length': 116.60714721679688, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.14838216826319695, 'kl': 0.0390625, 'epoch': 3.14} 63%|██████▎ | 1010/1610 [10:03:47<2:41:07, 16.11s/it] 63%|██████▎ | 1011/1610 [10:04:01<2:35:19, 15.56s/it] {'loss': 0.0037, 'grad_norm': 0.9082630062648192, 'learning_rate': 3.720496894409938e-07, 'completion_length': 115.03572082519531, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0928955078125, 'epoch': 3.14} 63%|██████▎ | 1011/1610 [10:04:01<2:35:19, 15.56s/it] 63%|██████▎ | 1012/1610 [10:04:18<2:40:35, 16.11s/it] {'loss': 0.0029, 'grad_norm': 2.0979418340752325, 'learning_rate': 3.7142857142857145e-07, 'completion_length': 157.1964340209961, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.0723876953125, 'epoch': 3.14} 63%|██████▎ | 1012/1610 [10:04:18<2:40:35, 16.11s/it] 63%|██████▎ | 1013/1610 [10:04:35<2:40:45, 16.16s/it] {'loss': 0.0089, 'grad_norm': 1.2669331036050375, 'learning_rate': 3.7080745341614903e-07, 'completion_length': 132.6785774230957, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.22265625, 'epoch': 3.15} 63%|██████▎ | 1013/1610 [10:04:35<2:40:45, 16.16s/it] 63%|██████▎ | 1014/1610 [10:04:49<2:34:00, 15.50s/it] {'loss': 0.0035, 'grad_norm': 2.5329846810421532, 'learning_rate': 3.7018633540372666e-07, 'completion_length': 104.3035774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.1539071872830391, 'kl': 0.0872802734375, 'epoch': 3.15} 63%|██████▎ | 1014/1610 [10:04:49<2:34:00, 15.50s/it] 63%|██████▎ | 1015/1610 [10:05:05<2:36:15, 15.76s/it] {'loss': 0.0095, 'grad_norm': 3.77984214434267, 'learning_rate': 3.695652173913043e-07, 'completion_length': 141.98214721679688, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.5000001192092896, 'reward_std': 0.2142857201397419, 'kl': 0.23486328125, 'epoch': 3.15} 63%|██████▎ | 1015/1610 [10:05:05<2:36:15, 15.76s/it] 63%|██████▎ | 1016/1610 [10:05:22<2:38:56, 16.05s/it] {'loss': 0.0055, 'grad_norm': 2.483585987545676, 'learning_rate': 3.68944099378882e-07, 'completion_length': 137.98214721679688, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.25552502274513245, 'kl': 0.138427734375, 'epoch': 3.16} 63%|██████▎ | 1016/1610 [10:05:22<2:38:56, 16.05s/it] 63%|██████▎ | 1017/1610 [10:05:39<2:43:31, 16.55s/it] {'loss': 0.0102, 'grad_norm': 1.1489930076121015, 'learning_rate': 3.683229813664596e-07, 'completion_length': 125.83929443359375, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.12974976375699043, 'kl': 0.256591796875, 'epoch': 3.16} 63%|██████▎ | 1017/1610 [10:05:39<2:43:31, 16.55s/it] 63%|██████▎ | 1018/1610 [10:05:56<2:42:31, 16.47s/it] {'loss': 0.0022, 'grad_norm': 1.9039725654238702, 'learning_rate': 3.6770186335403724e-07, 'completion_length': 123.51786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.25552502274513245, 'kl': 0.0560302734375, 'epoch': 3.16} 63%|██████▎ | 1018/1610 [10:05:56<2:42:31, 16.47s/it] 63%|██████▎ | 1019/1610 [10:06:09<2:33:40, 15.60s/it] {'loss': 0.003, 'grad_norm': 1.7907355814155137, 'learning_rate': 3.670807453416149e-07, 'completion_length': 120.37500762939453, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.2006715089082718, 'kl': 0.075439453125, 'epoch': 3.16} 63%|██████▎ | 1019/1610 [10:06:09<2:33:40, 15.60s/it] 63%|██████▎ | 1020/1610 [10:06:26<2:37:07, 15.98s/it] {'loss': 0.0103, 'grad_norm': 0.9948315955230218, 'learning_rate': 3.6645962732919256e-07, 'completion_length': 133.2321548461914, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.2569580078125, 'epoch': 3.17} 63%|██████▎ | 1020/1610 [10:06:26<2:37:07, 15.98s/it] 63%|██████▎ | 1021/1610 [10:06:40<2:31:32, 15.44s/it] {'loss': 0.0021, 'grad_norm': 1.4776353463647236, 'learning_rate': 3.658385093167702e-07, 'completion_length': 113.89286041259766, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.1539071872830391, 'kl': 0.0523681640625, 'epoch': 3.17} 63%|██████▎ | 1021/1610 [10:06:40<2:31:32, 15.44s/it] 63%|██████▎ | 1022/1610 [10:06:53<2:24:27, 14.74s/it] {'loss': 0.0032, 'grad_norm': 2.6555045519112634, 'learning_rate': 3.6521739130434783e-07, 'completion_length': 107.1964340209961, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.14838216826319695, 'kl': 0.07958984375, 'epoch': 3.17} 63%|██████▎ | 1022/1610 [10:06:53<2:24:27, 14.74s/it] 64%|██████▎ | 1023/1610 [10:07:10<2:30:11, 15.35s/it] {'loss': 0.0024, 'grad_norm': 1.5380343473927875, 'learning_rate': 3.6459627329192546e-07, 'completion_length': 127.96429061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0606689453125, 'epoch': 3.18} 64%|██████▎ | 1023/1610 [10:07:10<2:30:11, 15.35s/it] 64%|██████▎ | 1024/1610 [10:07:23<2:23:34, 14.70s/it] {'loss': 0.0036, 'grad_norm': 1.1071691452208334, 'learning_rate': 3.6397515527950304e-07, 'completion_length': 116.33929061889648, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.14838216453790665, 'kl': 0.090576171875, 'epoch': 3.18} 64%|██████▎ | 1024/1610 [10:07:23<2:23:34, 14.70s/it] 64%|██████▎ | 1025/1610 [10:07:41<2:31:55, 15.58s/it] {'loss': 0.0086, 'grad_norm': 1.937189595278816, 'learning_rate': 3.633540372670807e-07, 'completion_length': 124.05357360839844, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6071429252624512, 'reward_std': 0.2142857313156128, 'kl': 0.21435546875, 'epoch': 3.18} 64%|██████▎ | 1025/1610 [10:07:41<2:31:55, 15.58s/it] 64%|██████▎ | 1026/1610 [10:07:57<2:33:20, 15.75s/it] {'loss': 0.0174, 'grad_norm': 2.41152858472115, 'learning_rate': 3.6273291925465836e-07, 'completion_length': 105.78572082519531, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.3324785977602005, 'kl': 0.43359375, 'epoch': 3.19} 64%|██████▎ | 1026/1610 [10:07:57<2:33:20, 15.75s/it] 64%|██████▍ | 1027/1610 [10:08:10<2:24:34, 14.88s/it] {'loss': 0.0022, 'grad_norm': 1.546754485109646, 'learning_rate': 3.62111801242236e-07, 'completion_length': 93.55357360839844, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.1428571529686451, 'kl': 0.0545654296875, 'epoch': 3.19} 64%|██████▍ | 1027/1610 [10:08:10<2:24:34, 14.88s/it] 64%|██████▍ | 1028/1610 [10:08:26<2:27:02, 15.16s/it] {'loss': 0.0105, 'grad_norm': 1.7409195795684316, 'learning_rate': 3.614906832298136e-07, 'completion_length': 125.83929443359375, 'rewards/accuracy_reward': 0.3214285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.2678571939468384, 'reward_std': 0.21981074661016464, 'kl': 0.26220703125, 'epoch': 3.19} 64%|██████▍ | 1028/1610 [10:08:26<2:27:02, 15.16s/it] 64%|██████▍ | 1029/1610 [10:08:37<2:16:12, 14.07s/it] {'loss': 0.0016, 'grad_norm': 1.9149114564539351, 'learning_rate': 3.608695652173913e-07, 'completion_length': 103.37500381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.1539071872830391, 'kl': 0.04071044921875, 'epoch': 3.2} 64%|██████▍ | 1029/1610 [10:08:37<2:16:12, 14.07s/it] 64%|██████▍ | 1030/1610 [10:08:50<2:11:09, 13.57s/it] {'loss': 0.0019, 'grad_norm': 5.861790207204798, 'learning_rate': 3.6024844720496894e-07, 'completion_length': 96.42857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838215708732605, 'kl': 0.04833984375, 'epoch': 3.2} 64%|██████▍ | 1030/1610 [10:08:50<2:11:09, 13.57s/it] 64%|██████▍ | 1031/1610 [10:09:00<2:02:18, 12.67s/it] {'loss': 0.0017, 'grad_norm': 0.772291794621318, 'learning_rate': 3.596273291925466e-07, 'completion_length': 98.28572082519531, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 1.0, 'reward': 1.785714328289032, 'reward_std': 0.0714285746216774, 'kl': 0.0433349609375, 'epoch': 3.2} 64%|██████▍ | 1031/1610 [10:09:00<2:02:18, 12.67s/it] 64%|██████▍ | 1032/1610 [10:09:14<2:05:46, 13.06s/it] {'loss': 0.0019, 'grad_norm': 1.810205479841748, 'learning_rate': 3.590062111801242e-07, 'completion_length': 131.4107208251953, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.1539071872830391, 'kl': 0.0482177734375, 'epoch': 3.2} 64%|██████▍ | 1032/1610 [10:09:14<2:05:46, 13.06s/it] 64%|██████▍ | 1033/1610 [10:09:29<2:09:16, 13.44s/it] {'loss': 0.002, 'grad_norm': 1.0117195666593584, 'learning_rate': 3.5838509316770184e-07, 'completion_length': 112.66072082519531, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.14838216826319695, 'kl': 0.0496826171875, 'epoch': 3.21} 64%|██████▍ | 1033/1610 [10:09:29<2:09:16, 13.44s/it] 64%|██████▍ | 1034/1610 [10:09:44<2:15:23, 14.10s/it] {'loss': 0.009, 'grad_norm': 1.8121681610658196, 'learning_rate': 3.577639751552795e-07, 'completion_length': 110.83929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.1428571529686451, 'kl': 0.22412109375, 'epoch': 3.21} 64%|██████▍ | 1034/1610 [10:09:44<2:15:23, 14.10s/it] 64%|██████▍ | 1035/1610 [10:09:57<2:11:32, 13.73s/it] {'loss': 0.0052, 'grad_norm': 1.08961370343603, 'learning_rate': 3.5714285714285716e-07, 'completion_length': 117.10714340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535714626312256, 'reward_std': 0.10962696373462677, 'kl': 0.12890625, 'epoch': 3.21} 64%|██████▍ | 1035/1610 [10:09:57<2:11:32, 13.73s/it] 64%|██████▍ | 1036/1610 [10:10:10<2:09:53, 13.58s/it] {'loss': 0.01, 'grad_norm': 2.1107464439930976, 'learning_rate': 3.5652173913043474e-07, 'completion_length': 120.16071701049805, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4285714626312256, 'reward_std': 0.25552502274513245, 'kl': 0.249755859375, 'epoch': 3.22} 64%|██████▍ | 1036/1610 [10:10:10<2:09:53, 13.58s/it] 64%|██████▍ | 1037/1610 [10:10:27<2:16:58, 14.34s/it] {'loss': 0.0059, 'grad_norm': 2.5387757762878813, 'learning_rate': 3.5590062111801237e-07, 'completion_length': 109.30357360839844, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7500000596046448, 'reward_std': 0.19514648616313934, 'kl': 0.1466064453125, 'epoch': 3.22} 64%|██████▍ | 1037/1610 [10:10:27<2:16:58, 14.34s/it] 64%|██████▍ | 1038/1610 [10:10:42<2:20:16, 14.71s/it] {'loss': 0.0137, 'grad_norm': 2.7586439889430263, 'learning_rate': 3.5527950310559005e-07, 'completion_length': 115.87500381469727, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.342041015625, 'epoch': 3.22} 64%|██████▍ | 1038/1610 [10:10:42<2:20:16, 14.71s/it] 65%|██████▍ | 1039/1610 [10:10:56<2:17:03, 14.40s/it] {'loss': 0.0019, 'grad_norm': 2.2215484357515036, 'learning_rate': 3.546583850931677e-07, 'completion_length': 120.08929061889648, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1539071872830391, 'kl': 0.0467529296875, 'epoch': 3.23} 65%|██████▍ | 1039/1610 [10:10:56<2:17:03, 14.40s/it] 65%|██████▍ | 1040/1610 [10:11:12<2:20:49, 14.82s/it] {'loss': 0.0043, 'grad_norm': 2.513514128012837, 'learning_rate': 3.540372670807453e-07, 'completion_length': 117.58928680419922, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.2363857924938202, 'kl': 0.1082763671875, 'epoch': 3.23} 65%|██████▍ | 1040/1610 [10:11:12<2:20:49, 14.82s/it] 65%|██████▍ | 1041/1610 [10:11:27<2:21:36, 14.93s/it] {'loss': 0.008, 'grad_norm': 1.6121078333549883, 'learning_rate': 3.5341614906832295e-07, 'completion_length': 104.92857360839844, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4464285969734192, 'reward_std': 0.21981074661016464, 'kl': 0.199462890625, 'epoch': 3.23} 65%|██████▍ | 1041/1610 [10:11:27<2:21:36, 14.93s/it] 65%|██████▍ | 1042/1610 [10:11:43<2:23:50, 15.19s/it] {'loss': 0.0152, 'grad_norm': 1.2003194773183627, 'learning_rate': 3.527950310559006e-07, 'completion_length': 124.35715103149414, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.23689262941479683, 'kl': 0.3798828125, 'epoch': 3.24} 65%|██████▍ | 1042/1610 [10:11:43<2:23:50, 15.19s/it] 65%|██████▍ | 1043/1610 [10:11:57<2:21:21, 14.96s/it] {'loss': 0.0066, 'grad_norm': 2.1063479643453165, 'learning_rate': 3.5217391304347827e-07, 'completion_length': 128.4285774230957, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4285715222358704, 'reward_std': 0.20117834210395813, 'kl': 0.16455078125, 'epoch': 3.24} 65%|██████▍ | 1043/1610 [10:11:57<2:21:21, 14.96s/it] 65%|██████▍ | 1044/1610 [10:12:13<2:23:17, 15.19s/it] {'loss': 0.0145, 'grad_norm': 2.2245136238263346, 'learning_rate': 3.515527950310559e-07, 'completion_length': 128.16071701049805, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5357143878936768, 'reward_std': 0.2580091208219528, 'kl': 0.36328125, 'epoch': 3.24} 65%|██████▍ | 1044/1610 [10:12:13<2:23:17, 15.19s/it] 65%|██████▍ | 1045/1610 [10:12:29<2:26:17, 15.54s/it] {'loss': 0.0104, 'grad_norm': 1.1717896398399081, 'learning_rate': 3.5093167701863354e-07, 'completion_length': 126.0535774230957, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.2611083984375, 'epoch': 3.25} 65%|██████▍ | 1045/1610 [10:12:29<2:26:17, 15.54s/it] 65%|██████▍ | 1046/1610 [10:12:46<2:29:25, 15.90s/it] {'loss': 0.012, 'grad_norm': 1.5601881711814023, 'learning_rate': 3.5031055900621117e-07, 'completion_length': 147.73214721679688, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250001192092896, 'reward_std': 0.22229483723640442, 'kl': 0.2998046875, 'epoch': 3.25} 65%|██████▍ | 1046/1610 [10:12:46<2:29:25, 15.90s/it] 65%|██████▌ | 1047/1610 [10:12:59<2:22:49, 15.22s/it] {'loss': 0.0098, 'grad_norm': 3.643726213563598, 'learning_rate': 3.4968944099378885e-07, 'completion_length': 103.01786041259766, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3750000596046448, 'reward_std': 0.29123931378126144, 'kl': 0.24560546875, 'epoch': 3.25} 65%|██████▌ | 1047/1610 [10:12:59<2:22:49, 15.22s/it] 65%|██████▌ | 1048/1610 [10:13:19<2:35:03, 16.55s/it] {'loss': 0.0174, 'grad_norm': 4.7414997245991515, 'learning_rate': 3.4906832298136643e-07, 'completion_length': 130.8214340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.38677529990673065, 'kl': 0.43359375, 'epoch': 3.25} 65%|██████▌ | 1048/1610 [10:13:19<2:35:03, 16.55s/it] 65%|██████▌ | 1049/1610 [10:13:34<2:29:38, 16.01s/it] {'loss': 0.0022, 'grad_norm': 0.9141021622924754, 'learning_rate': 3.4844720496894407e-07, 'completion_length': 134.1607208251953, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.12371791899204254, 'kl': 0.0543212890625, 'epoch': 3.26} 65%|██████▌ | 1049/1610 [10:13:34<2:29:38, 16.01s/it] 65%|██████▌ | 1050/1610 [10:13:46<2:19:13, 14.92s/it] {'loss': 0.0027, 'grad_norm': 0.912853651677156, 'learning_rate': 3.478260869565217e-07, 'completion_length': 105.78571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1181928999722004, 'kl': 0.067138671875, 'epoch': 3.26} 65%|██████▌ | 1050/1610 [10:13:46<2:19:13, 14.92s/it] 65%|██████▌ | 1051/1610 [10:13:58<2:10:30, 14.01s/it] {'loss': 0.0026, 'grad_norm': 2.936081622516616, 'learning_rate': 3.4720496894409933e-07, 'completion_length': 103.33929061889648, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2967643439769745, 'kl': 0.0660400390625, 'epoch': 3.26} 65%|██████▌ | 1051/1610 [10:13:58<2:10:30, 14.01s/it] 65%|██████▌ | 1052/1610 [10:14:18<2:25:45, 15.67s/it] {'loss': 0.0222, 'grad_norm': 2.2718764231193393, 'learning_rate': 3.46583850931677e-07, 'completion_length': 129.17857360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5535715222358704, 'reward_std': 0.3762020319700241, 'kl': 0.552734375, 'epoch': 3.27} 65%|██████▌ | 1052/1610 [10:14:18<2:25:45, 15.67s/it] 65%|██████▌ | 1053/1610 [10:14:34<2:26:46, 15.81s/it] {'loss': 0.0132, 'grad_norm': 2.5273420974944423, 'learning_rate': 3.4596273291925465e-07, 'completion_length': 100.5535774230957, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.2253357544541359, 'kl': 0.3302001953125, 'epoch': 3.27} 65%|██████▌ | 1053/1610 [10:14:34<2:26:46, 15.81s/it] 65%|██████▌ | 1054/1610 [10:14:51<2:30:28, 16.24s/it] {'loss': 0.0101, 'grad_norm': 24.217754824783555, 'learning_rate': 3.453416149068323e-07, 'completion_length': 121.91072463989258, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.18409644439816475, 'kl': 0.251708984375, 'epoch': 3.27} 65%|██████▌ | 1054/1610 [10:14:51<2:30:28, 16.24s/it] 66%|██████▌ | 1055/1610 [10:15:06<2:27:55, 15.99s/it] {'loss': 0.0101, 'grad_norm': 1.936638278111634, 'learning_rate': 3.447204968944099e-07, 'completion_length': 121.17858123779297, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.2947069779038429, 'kl': 0.25048828125, 'epoch': 3.28} 66%|██████▌ | 1055/1610 [10:15:06<2:27:55, 15.99s/it] 66%|██████▌ | 1056/1610 [10:15:22<2:25:19, 15.74s/it] {'loss': 0.0104, 'grad_norm': 1.1724685426266854, 'learning_rate': 3.440993788819876e-07, 'completion_length': 112.66072082519531, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.21676981821656227, 'kl': 0.2607421875, 'epoch': 3.28} 66%|██████▌ | 1056/1610 [10:15:22<2:25:19, 15.74s/it] 66%|██████▌ | 1057/1610 [10:15:38<2:27:52, 16.05s/it] {'loss': 0.014, 'grad_norm': 1.6684404873021583, 'learning_rate': 3.4347826086956523e-07, 'completion_length': 124.96429061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.2500000149011612, 'kl': 0.349609375, 'epoch': 3.28} 66%|██████▌ | 1057/1610 [10:15:38<2:27:52, 16.05s/it] 66%|██████▌ | 1058/1610 [10:15:54<2:26:36, 15.94s/it] {'loss': 0.0127, 'grad_norm': 1.6951128253937855, 'learning_rate': 3.4285714285714286e-07, 'completion_length': 123.48214721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6428572535514832, 'reward_std': 0.18409645557403564, 'kl': 0.31903076171875, 'epoch': 3.29} 66%|██████▌ | 1058/1610 [10:15:54<2:26:36, 15.94s/it] 66%|██████▌ | 1059/1610 [10:16:05<2:13:26, 14.53s/it] {'loss': 0.0095, 'grad_norm': 2.19866754750811, 'learning_rate': 3.422360248447205e-07, 'completion_length': 105.01786041259766, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4107143878936768, 'reward_std': 0.32391268014907837, 'kl': 0.23828125, 'epoch': 3.29} 66%|██████▌ | 1059/1610 [10:16:05<2:13:26, 14.53s/it] 66%|██████▌ | 1060/1610 [10:16:21<2:16:59, 14.94s/it] {'loss': 0.0294, 'grad_norm': 2.215475334065113, 'learning_rate': 3.416149068322981e-07, 'completion_length': 114.39286041259766, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6785714626312256, 'reward_std': 0.2253357619047165, 'kl': 0.73486328125, 'epoch': 3.29} 66%|██████▌ | 1060/1610 [10:16:21<2:16:59, 14.94s/it] 66%|██████▌ | 1061/1610 [10:16:35<2:14:17, 14.68s/it] {'loss': 0.0044, 'grad_norm': 1.2172959501769227, 'learning_rate': 3.4099378881987576e-07, 'completion_length': 112.08929061889648, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.16546405479311943, 'kl': 0.10986328125, 'epoch': 3.3} 66%|██████▌ | 1061/1610 [10:16:35<2:14:17, 14.68s/it] 66%|██████▌ | 1062/1610 [10:16:56<2:29:31, 16.37s/it] {'loss': 0.0323, 'grad_norm': 2.198351404080532, 'learning_rate': 3.403726708074534e-07, 'completion_length': 124.1785774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.17098907008767128, 'kl': 0.810546875, 'epoch': 3.3} 66%|██████▌ | 1062/1610 [10:16:56<2:29:31, 16.37s/it] 66%|██████▌ | 1063/1610 [10:17:16<2:40:06, 17.56s/it] {'loss': 0.0238, 'grad_norm': 2.1123554441203902, 'learning_rate': 3.3975155279503103e-07, 'completion_length': 127.73215103149414, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3928571939468384, 'reward_std': 0.39838217198848724, 'kl': 0.59375, 'epoch': 3.3} 66%|██████▌ | 1063/1610 [10:17:16<2:40:06, 17.56s/it] 66%|██████▌ | 1064/1610 [10:17:33<2:38:45, 17.45s/it] {'loss': 0.0217, 'grad_norm': 1.3580757409248467, 'learning_rate': 3.3913043478260866e-07, 'completion_length': 130.30358123779297, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.17553051188588142, 'kl': 0.5400390625, 'epoch': 3.3} 66%|██████▌ | 1064/1610 [10:17:33<2:38:45, 17.45s/it] 66%|██████▌ | 1065/1610 [10:17:48<2:31:26, 16.67s/it] {'loss': 0.0044, 'grad_norm': 4.789451374689688, 'learning_rate': 3.385093167701863e-07, 'completion_length': 116.17857360839844, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.410714328289032, 'reward_std': 0.14838216826319695, 'kl': 0.110595703125, 'epoch': 3.31} 66%|██████▌ | 1065/1610 [10:17:48<2:31:26, 16.67s/it] 66%|██████▌ | 1066/1610 [10:17:59<2:15:17, 14.92s/it] {'loss': 0.0074, 'grad_norm': 1.112066150788986, 'learning_rate': 3.37888198757764e-07, 'completion_length': 103.35714721679688, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.1854248046875, 'epoch': 3.31} 66%|██████▌ | 1066/1610 [10:17:59<2:15:17, 14.92s/it] 66%|██████▋ | 1067/1610 [10:18:16<2:22:24, 15.74s/it] {'loss': 0.0204, 'grad_norm': 1.6493162400486738, 'learning_rate': 3.372670807453416e-07, 'completion_length': 113.37500762939453, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4285715222358704, 'reward_std': 0.2142857201397419, 'kl': 0.510498046875, 'epoch': 3.31} 66%|██████▋ | 1067/1610 [10:18:16<2:22:24, 15.74s/it] 66%|██████▋ | 1068/1610 [10:18:33<2:24:47, 16.03s/it] {'loss': 0.0216, 'grad_norm': 1.5991488908506901, 'learning_rate': 3.3664596273291924e-07, 'completion_length': 140.08929061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5000000596046448, 'reward_std': 0.27260691300034523, 'kl': 0.5419921875, 'epoch': 3.32} 66%|██████▋ | 1068/1610 [10:18:33<2:24:47, 16.03s/it] 66%|██████▋ | 1069/1610 [10:18:50<2:25:36, 16.15s/it] {'loss': 0.0182, 'grad_norm': 1.1893593854435862, 'learning_rate': 3.360248447204969e-07, 'completion_length': 129.7857208251953, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.453857421875, 'epoch': 3.32} 66%|██████▋ | 1069/1610 [10:18:50<2:25:36, 16.15s/it] 66%|██████▋ | 1070/1610 [10:19:00<2:10:13, 14.47s/it] {'loss': 0.0082, 'grad_norm': 1.82430022785456, 'learning_rate': 3.3540372670807456e-07, 'completion_length': 89.69643020629883, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.206298828125, 'epoch': 3.32} 66%|██████▋ | 1070/1610 [10:19:00<2:10:13, 14.47s/it] 67%|██████▋ | 1071/1610 [10:19:12<2:02:42, 13.66s/it] {'loss': 0.0061, 'grad_norm': 1.8144494808706637, 'learning_rate': 3.347826086956522e-07, 'completion_length': 101.14286041259766, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.23689261823892593, 'kl': 0.151611328125, 'epoch': 3.33} 67%|██████▋ | 1071/1610 [10:19:12<2:02:42, 13.66s/it] 67%|██████▋ | 1072/1610 [10:19:23<1:55:45, 12.91s/it] {'loss': 0.0092, 'grad_norm': 3.7630075345103795, 'learning_rate': 3.3416149068322977e-07, 'completion_length': 103.4464340209961, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7500001192092896, 'reward_std': 0.17098906636238098, 'kl': 0.23046875, 'epoch': 3.33} 67%|██████▋ | 1072/1610 [10:19:23<1:55:45, 12.91s/it] 67%|██████▋ | 1073/1610 [10:19:39<2:03:25, 13.79s/it] {'loss': 0.0096, 'grad_norm': 1.3729599183639443, 'learning_rate': 3.335403726708074e-07, 'completion_length': 109.26786041259766, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1896214783191681, 'kl': 0.2392578125, 'epoch': 3.33} 67%|██████▋ | 1073/1610 [10:19:39<2:03:25, 13.79s/it] 67%|██████▋ | 1074/1610 [10:19:51<1:57:55, 13.20s/it] {'loss': 0.0264, 'grad_norm': 2.1581127472879036, 'learning_rate': 3.3291925465838504e-07, 'completion_length': 107.35714721679688, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.4642857313156128, 'reward_std': 0.3792429566383362, 'kl': 0.658203125, 'epoch': 3.34} 67%|██████▋ | 1074/1610 [10:19:51<1:57:55, 13.20s/it] 67%|██████▋ | 1075/1610 [10:20:07<2:05:44, 14.10s/it] {'loss': 0.034, 'grad_norm': 1.4959625979197515, 'learning_rate': 3.322981366459627e-07, 'completion_length': 110.05357360839844, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.732142984867096, 'reward_std': 0.1896214783191681, 'kl': 0.849365234375, 'epoch': 3.34} 67%|██████▋ | 1075/1610 [10:20:07<2:05:44, 14.10s/it] 67%|██████▋ | 1076/1610 [10:20:19<1:59:44, 13.45s/it] {'loss': 0.0082, 'grad_norm': 1.1924643161769166, 'learning_rate': 3.3167701863354036e-07, 'completion_length': 91.76786041259766, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.732142984867096, 'reward_std': 0.21981074661016464, 'kl': 0.20361328125, 'epoch': 3.34} 67%|██████▋ | 1076/1610 [10:20:19<1:59:44, 13.45s/it] 67%|██████▋ | 1077/1610 [10:20:31<1:55:39, 13.02s/it] {'loss': 0.0169, 'grad_norm': 1.8727263021156688, 'learning_rate': 3.31055900621118e-07, 'completion_length': 97.85714721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6428572535514832, 'reward_std': 0.2534676864743233, 'kl': 0.421630859375, 'epoch': 3.34} 67%|██████▋ | 1077/1610 [10:20:31<1:55:39, 13.02s/it] 67%|██████▋ | 1078/1610 [10:20:43<1:52:49, 12.72s/it] {'loss': 0.0021, 'grad_norm': 1.7181591122937816, 'learning_rate': 3.304347826086956e-07, 'completion_length': 85.9285774230957, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1071428619325161, 'kl': 0.052001953125, 'epoch': 3.35} 67%|██████▋ | 1078/1610 [10:20:43<1:52:49, 12.72s/it] 67%|██████▋ | 1079/1610 [10:20:58<1:58:47, 13.42s/it] {'loss': 0.0229, 'grad_norm': 3.5037665331244385, 'learning_rate': 3.298136645962733e-07, 'completion_length': 112.14286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5178571939468384, 'reward_std': 0.36266787350177765, 'kl': 0.57421875, 'epoch': 3.35} 67%|██████▋ | 1079/1610 [10:20:58<1:58:47, 13.42s/it] 67%|██████▋ | 1080/1610 [10:21:09<1:53:00, 12.79s/it] {'loss': 0.0029, 'grad_norm': 1.7483019067023728, 'learning_rate': 3.2919254658385094e-07, 'completion_length': 91.85714721679688, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.21981073170900345, 'kl': 0.072265625, 'epoch': 3.35} 67%|██████▋ | 1080/1610 [10:21:09<1:53:00, 12.79s/it] 67%|██████▋ | 1081/1610 [10:21:21<1:49:08, 12.38s/it] {'loss': 0.0281, 'grad_norm': 2.816684434061445, 'learning_rate': 3.2857142857142857e-07, 'completion_length': 92.83929061889648, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.7017822265625, 'epoch': 3.36} 67%|██████▋ | 1081/1610 [10:21:21<1:49:08, 12.38s/it] 67%|██████▋ | 1082/1610 [10:21:31<1:43:56, 11.81s/it] {'loss': 0.0122, 'grad_norm': 1.8340305008391555, 'learning_rate': 3.279503105590062e-07, 'completion_length': 89.4464340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5714285969734192, 'reward_std': 0.2967643439769745, 'kl': 0.3055419921875, 'epoch': 3.36} 67%|██████▋ | 1082/1610 [10:21:31<1:43:56, 11.81s/it] 67%|██████▋ | 1083/1610 [10:21:45<1:48:50, 12.39s/it] {'loss': 0.0204, 'grad_norm': 2.9888386456450147, 'learning_rate': 3.273291925465838e-07, 'completion_length': 109.3214340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6071429252624512, 'reward_std': 0.2967643290758133, 'kl': 0.508544921875, 'epoch': 3.36} 67%|██████▋ | 1083/1610 [10:21:45<1:48:50, 12.39s/it] 67%|██████▋ | 1084/1610 [10:21:57<1:47:18, 12.24s/it] {'loss': 0.0068, 'grad_norm': 1.9573358328127723, 'learning_rate': 3.2670807453416147e-07, 'completion_length': 91.10714721679688, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.20670336857438087, 'kl': 0.16943359375, 'epoch': 3.37} 67%|██████▋ | 1084/1610 [10:21:57<1:47:18, 12.24s/it] 67%|██████▋ | 1085/1610 [10:22:08<1:44:43, 11.97s/it] {'loss': 0.0148, 'grad_norm': 1.0310102751130084, 'learning_rate': 3.260869565217391e-07, 'completion_length': 99.0714340209961, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6964285969734192, 'reward_std': 0.1071428656578064, 'kl': 0.3670654296875, 'epoch': 3.37} 67%|██████▋ | 1085/1610 [10:22:08<1:44:43, 11.97s/it] 67%|██████▋ | 1086/1610 [10:22:18<1:37:54, 11.21s/it] {'loss': 0.0016, 'grad_norm': 0.0986109297315979, 'learning_rate': 3.2546583850931673e-07, 'completion_length': 82.17857360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0, 'kl': 0.0391845703125, 'epoch': 3.37} 67%|██████▋ | 1086/1610 [10:22:18<1:37:54, 11.21s/it] 68%|██████▊ | 1087/1610 [10:22:29<1:37:14, 11.16s/it] {'loss': 0.0042, 'grad_norm': 0.8101112771132012, 'learning_rate': 3.2484472049689437e-07, 'completion_length': 97.71429061889648, 'rewards/accuracy_reward': 0.6607143133878708, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.15943220257759094, 'kl': 0.1051025390625, 'epoch': 3.38} 68%|██████▊ | 1087/1610 [10:22:29<1:37:14, 11.16s/it] 68%|██████▊ | 1088/1610 [10:22:43<1:45:07, 12.08s/it] {'loss': 0.0243, 'grad_norm': 2.753681418639235, 'learning_rate': 3.2422360248447205e-07, 'completion_length': 108.89286422729492, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.40639129281044006, 'kl': 0.609375, 'epoch': 3.38} 68%|██████▊ | 1088/1610 [10:22:43<1:45:07, 12.08s/it] 68%|██████▊ | 1089/1610 [10:22:54<1:43:15, 11.89s/it] {'loss': 0.0022, 'grad_norm': 1.3519620616031867, 'learning_rate': 3.236024844720497e-07, 'completion_length': 92.51786041259766, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.0546875, 'epoch': 3.38} 68%|██████▊ | 1089/1610 [10:22:54<1:43:15, 11.89s/it] 68%|██████▊ | 1090/1610 [10:23:05<1:40:26, 11.59s/it] {'loss': 0.0027, 'grad_norm': 2.6798285723271906, 'learning_rate': 3.229813664596273e-07, 'completion_length': 92.4285774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.14838215708732605, 'kl': 0.067626953125, 'epoch': 3.39} 68%|██████▊ | 1090/1610 [10:23:05<1:40:26, 11.59s/it] 68%|██████▊ | 1091/1610 [10:23:19<1:45:16, 12.17s/it] {'loss': 0.0121, 'grad_norm': 2.002429857217018, 'learning_rate': 3.2236024844720495e-07, 'completion_length': 95.50000381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.2937234044075012, 'kl': 0.302734375, 'epoch': 3.39} 68%|██████▊ | 1091/1610 [10:23:19<1:45:16, 12.17s/it] 68%|██████▊ | 1092/1610 [10:23:29<1:40:16, 11.61s/it] {'loss': 0.0123, 'grad_norm': 2.7390020543779117, 'learning_rate': 3.217391304347826e-07, 'completion_length': 101.75000381469727, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7500000596046448, 'reward_std': 0.2580091208219528, 'kl': 0.306884765625, 'epoch': 3.39} 68%|██████▊ | 1092/1610 [10:23:29<1:40:16, 11.61s/it] 68%|██████▊ | 1093/1610 [10:23:45<1:50:08, 12.78s/it] {'loss': 0.0267, 'grad_norm': 1.5460970849959408, 'learning_rate': 3.2111801242236027e-07, 'completion_length': 104.91071701049805, 'rewards/accuracy_reward': 0.5357143133878708, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.66552734375, 'epoch': 3.39} 68%|██████▊ | 1093/1610 [10:23:45<1:50:08, 12.78s/it] 68%|██████▊ | 1094/1610 [10:23:55<1:42:48, 11.96s/it] {'loss': 0.0169, 'grad_norm': 1.9930973274274901, 'learning_rate': 3.204968944099379e-07, 'completion_length': 83.1785774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5357143878936768, 'reward_std': 0.38172708451747894, 'kl': 0.423095703125, 'epoch': 3.4} 68%|██████▊ | 1094/1610 [10:23:55<1:42:48, 11.96s/it] 68%|██████▊ | 1095/1610 [10:24:05<1:39:48, 11.63s/it] {'loss': 0.0084, 'grad_norm': 1.3039959580405505, 'learning_rate': 3.198757763975155e-07, 'completion_length': 94.87500762939453, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.14838216826319695, 'kl': 0.20947265625, 'epoch': 3.4} 68%|██████▊ | 1095/1610 [10:24:05<1:39:48, 11.63s/it] 68%|██████▊ | 1096/1610 [10:24:19<1:45:39, 12.33s/it] {'loss': 0.0343, 'grad_norm': 2.5130524021866814, 'learning_rate': 3.192546583850931e-07, 'completion_length': 114.98214340209961, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3571428656578064, 'reward_std': 0.38172706961631775, 'kl': 0.857421875, 'epoch': 3.4} 68%|██████▊ | 1096/1610 [10:24:19<1:45:39, 12.33s/it] 68%|██████▊ | 1097/1610 [10:24:31<1:42:14, 11.96s/it] {'loss': 0.0078, 'grad_norm': 2.376216551888104, 'learning_rate': 3.186335403726708e-07, 'completion_length': 89.26786422729492, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.20117834210395813, 'kl': 0.195068359375, 'epoch': 3.41} 68%|██████▊ | 1097/1610 [10:24:31<1:42:14, 11.96s/it] 68%|██████▊ | 1098/1610 [10:24:40<1:35:10, 11.15s/it] {'loss': 0.0026, 'grad_norm': 1.7913936336601688, 'learning_rate': 3.1801242236024843e-07, 'completion_length': 75.50000381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.14838215708732605, 'kl': 0.0643310546875, 'epoch': 3.41} 68%|██████▊ | 1098/1610 [10:24:40<1:35:10, 11.15s/it] 68%|██████▊ | 1099/1610 [10:24:50<1:33:18, 10.96s/it] {'loss': 0.0033, 'grad_norm': 1.267105568937748, 'learning_rate': 3.1739130434782606e-07, 'completion_length': 86.76786041259766, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1071428619325161, 'kl': 0.0823974609375, 'epoch': 3.41} 68%|██████▊ | 1099/1610 [10:24:50<1:33:18, 10.96s/it] 68%|██████▊ | 1100/1610 [10:25:00<1:31:09, 10.73s/it] {'loss': 0.0064, 'grad_norm': 3.1815993681661388, 'learning_rate': 3.167701863354037e-07, 'completion_length': 76.50000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.321428582072258, 'kl': 0.15966796875, 'epoch': 3.42} 68%|██████▊ | 1100/1610 [10:25:00<1:31:09, 10.73s/it] 68%|██████▊ | 1101/1610 [10:28:11<9:08:52, 64.70s/it] {'loss': 0.0023, 'grad_norm': 1.00375575287413, 'learning_rate': 3.1614906832298133e-07, 'completion_length': 97.33929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0565185546875, 'epoch': 3.42} 68%|██████▊ | 1101/1610 [10:28:11<9:08:52, 64.70s/it] 68%|██████▊ | 1102/1610 [10:28:22<6:50:03, 48.43s/it] {'loss': 0.0056, 'grad_norm': 2.0199810347967078, 'learning_rate': 3.15527950310559e-07, 'completion_length': 91.87500762939453, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.2253357470035553, 'kl': 0.138916015625, 'epoch': 3.42} 68%|██████▊ | 1102/1610 [10:28:22<6:50:03, 48.43s/it] 69%|██████▊ | 1103/1610 [10:28:36<5:23:45, 38.31s/it] {'loss': 0.0178, 'grad_norm': 2.880391668778059, 'learning_rate': 3.1490683229813665e-07, 'completion_length': 86.78571701049805, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.892857164144516, 'reward': 1.535714328289032, 'reward_std': 0.2967643439769745, 'kl': 0.44384765625, 'epoch': 3.43} 69%|██████▊ | 1103/1610 [10:28:36<5:23:45, 38.31s/it] 69%|██████▊ | 1104/1610 [10:28:47<4:13:46, 30.09s/it] {'loss': 0.02, 'grad_norm': 1.9076119248822072, 'learning_rate': 3.142857142857143e-07, 'completion_length': 95.98214721679688, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3035715222358704, 'reward_std': 0.32639679312705994, 'kl': 0.4990234375, 'epoch': 3.43} 69%|██████▊ | 1104/1610 [10:28:47<4:13:46, 30.09s/it] 69%|██████▊ | 1105/1610 [10:28:57<3:23:11, 24.14s/it] {'loss': 0.0145, 'grad_norm': 3.6470483757661265, 'learning_rate': 3.136645962732919e-07, 'completion_length': 89.60714721679688, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.910714328289032, 'reward': 1.410714328289032, 'reward_std': 0.33496272563934326, 'kl': 0.3603515625, 'epoch': 3.43} 69%|██████▊ | 1105/1610 [10:28:57<3:23:11, 24.14s/it] 69%|██████▊ | 1106/1610 [10:29:12<2:59:29, 21.37s/it] {'loss': 0.0226, 'grad_norm': 2.353720779925604, 'learning_rate': 3.130434782608696e-07, 'completion_length': 90.21428680419922, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.446428656578064, 'reward_std': 0.30626384913921356, 'kl': 0.56689453125, 'epoch': 3.43} 69%|██████▊ | 1106/1610 [10:29:12<2:59:29, 21.37s/it] 69%|██████▉ | 1107/1610 [10:29:23<2:32:46, 18.22s/it] {'loss': 0.004, 'grad_norm': 3.012107835715237, 'learning_rate': 3.1242236024844723e-07, 'completion_length': 83.35714721679688, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7500001192092896, 'reward_std': 0.2857143059372902, 'kl': 0.100341796875, 'epoch': 3.44} 69%|██████▉ | 1107/1610 [10:29:23<2:32:46, 18.22s/it] 69%|██████▉ | 1108/1610 [10:29:38<2:24:04, 17.22s/it] {'loss': 0.0522, 'grad_norm': 2.8230777039457062, 'learning_rate': 3.118012422360248e-07, 'completion_length': 94.48214721679688, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.892857164144516, 'reward': 1.3750000596046448, 'reward_std': 0.29166606068611145, 'kl': 1.30224609375, 'epoch': 3.44} 69%|██████▉ | 1108/1610 [10:29:38<2:24:04, 17.22s/it] 69%|██████▉ | 1109/1610 [10:29:49<2:07:42, 15.29s/it] {'loss': 0.022, 'grad_norm': 2.192301148572249, 'learning_rate': 3.1118012422360244e-07, 'completion_length': 90.64286041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5000000596046448, 'reward_std': 0.357569620013237, 'kl': 0.548828125, 'epoch': 3.44} 69%|██████▉ | 1109/1610 [10:29:49<2:07:42, 15.29s/it] 69%|██████▉ | 1110/1610 [10:30:04<2:08:08, 15.38s/it] {'loss': 0.0429, 'grad_norm': 2.192845996695257, 'learning_rate': 3.105590062111801e-07, 'completion_length': 96.00000381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.446428656578064, 'reward_std': 0.3434786796569824, 'kl': 1.07421875, 'epoch': 3.45} 69%|██████▉ | 1110/1610 [10:30:05<2:08:08, 15.38s/it] 69%|██████▉ | 1111/1610 [10:30:15<1:55:03, 13.83s/it] {'loss': 0.021, 'grad_norm': 5.654341670698152, 'learning_rate': 3.0993788819875776e-07, 'completion_length': 70.58928871154785, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.535714328289032, 'reward_std': 0.24695909023284912, 'kl': 0.526123046875, 'epoch': 3.45} 69%|██████▉ | 1111/1610 [10:30:15<1:55:03, 13.83s/it] 69%|██████▉ | 1112/1610 [10:30:29<1:55:57, 13.97s/it] {'loss': 0.0348, 'grad_norm': 3.0768999630487905, 'learning_rate': 3.093167701863354e-07, 'completion_length': 102.08929061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.3571429252624512, 'reward_std': 0.32695361226797104, 'kl': 0.8720703125, 'epoch': 3.45} 69%|██████▉ | 1112/1610 [10:30:29<1:55:57, 13.97s/it] 69%|██████▉ | 1113/1610 [10:30:38<1:44:22, 12.60s/it] {'loss': 0.0078, 'grad_norm': 1.8112992505213874, 'learning_rate': 3.08695652173913e-07, 'completion_length': 75.41071701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.21981074661016464, 'kl': 0.19482421875, 'epoch': 3.46} 69%|██████▉ | 1113/1610 [10:30:38<1:44:22, 12.60s/it] 69%|██████▉ | 1114/1610 [10:30:53<1:49:16, 13.22s/it] {'loss': 0.0343, 'grad_norm': 3.503383717139712, 'learning_rate': 3.0807453416149066e-07, 'completion_length': 98.17857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8392857611179352, 'reward': 1.3571429252624512, 'reward_std': 0.43353964388370514, 'kl': 0.857421875, 'epoch': 3.46} 69%|██████▉ | 1114/1610 [10:30:53<1:49:16, 13.22s/it] 69%|██████▉ | 1115/1610 [10:31:09<1:55:15, 13.97s/it] {'loss': 0.0478, 'grad_norm': 3.5600909014096747, 'learning_rate': 3.0745341614906834e-07, 'completion_length': 88.64286041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.3928571939468384, 'reward_std': 0.24695908278226852, 'kl': 1.19775390625, 'epoch': 3.46} 69%|██████▉ | 1115/1610 [10:31:09<1:55:15, 13.97s/it] 69%|██████▉ | 1116/1610 [10:31:24<1:58:42, 14.42s/it] {'loss': 0.0373, 'grad_norm': 3.1199359866573055, 'learning_rate': 3.06832298136646e-07, 'completion_length': 101.9464340209961, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 0.8035714626312256, 'reward': 1.285714328289032, 'reward_std': 0.26657506823539734, 'kl': 0.9296875, 'epoch': 3.47} 69%|██████▉ | 1116/1610 [10:31:24<1:58:42, 14.42s/it] 69%|██████▉ | 1117/1610 [10:31:36<1:51:34, 13.58s/it] {'loss': 0.0331, 'grad_norm': 3.298954877300793, 'learning_rate': 3.062111801242236e-07, 'completion_length': 90.05357360839844, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3928571939468384, 'reward_std': 0.37919294834136963, 'kl': 0.82421875, 'epoch': 3.47} 69%|██████▉ | 1117/1610 [10:31:36<1:51:34, 13.58s/it] 69%|██████▉ | 1118/1610 [10:31:47<1:44:11, 12.71s/it] {'loss': 0.0363, 'grad_norm': 2.8586410089135037, 'learning_rate': 3.0559006211180124e-07, 'completion_length': 76.10714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5178571939468384, 'reward_std': 0.40639129281044006, 'kl': 0.91015625, 'epoch': 3.47} 69%|██████▉ | 1118/1610 [10:31:47<1:44:11, 12.71s/it] 70%|██████▉ | 1119/1610 [10:31:58<1:40:59, 12.34s/it] {'loss': 0.0416, 'grad_norm': 3.2982781494192306, 'learning_rate': 3.049689440993788e-07, 'completion_length': 78.17857360839844, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.3928571939468384, 'reward_std': 0.18409644439816475, 'kl': 1.04296875, 'epoch': 3.48} 70%|██████▉ | 1119/1610 [10:31:58<1:40:59, 12.34s/it] 70%|██████▉ | 1120/1610 [10:32:08<1:33:55, 11.50s/it] {'loss': 0.0279, 'grad_norm': 5.702741245493839, 'learning_rate': 3.043478260869565e-07, 'completion_length': 72.01786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8750000596046448, 'reward': 1.446428656578064, 'reward_std': 0.42098909616470337, 'kl': 0.6953125, 'epoch': 3.48} 70%|██████▉ | 1120/1610 [10:32:08<1:33:55, 11.50s/it] 70%|██████▉ | 1121/1610 [10:32:18<1:31:12, 11.19s/it] {'loss': 0.0217, 'grad_norm': 2.153294198978398, 'learning_rate': 3.0372670807453414e-07, 'completion_length': 78.64286041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.2836569547653198, 'kl': 0.54296875, 'epoch': 3.48} 70%|██████▉ | 1121/1610 [10:32:18<1:31:12, 11.19s/it] 70%|██████▉ | 1122/1610 [10:32:27<1:25:49, 10.55s/it] {'loss': 0.0088, 'grad_norm': 3.3391899460365635, 'learning_rate': 3.0310559006211177e-07, 'completion_length': 68.83929061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4642857909202576, 'reward_std': 0.2253357619047165, 'kl': 0.22119140625, 'epoch': 3.48} 70%|██████▉ | 1122/1610 [10:32:27<1:25:49, 10.55s/it] 70%|██████▉ | 1123/1610 [10:32:38<1:25:32, 10.54s/it] {'loss': 0.0083, 'grad_norm': 2.2020250449411147, 'learning_rate': 3.024844720496894e-07, 'completion_length': 90.26786041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.535714328289032, 'reward_std': 0.21676981821656227, 'kl': 0.208251953125, 'epoch': 3.49} 70%|██████▉ | 1123/1610 [10:32:38<1:25:32, 10.54s/it] 70%|██████▉ | 1124/1610 [10:32:46<1:19:55, 9.87s/it] {'loss': 0.0091, 'grad_norm': 1.976667111963289, 'learning_rate': 3.018633540372671e-07, 'completion_length': 62.10714530944824, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678571939468384, 'reward_std': 0.1785714402794838, 'kl': 0.226806640625, 'epoch': 3.49} 70%|██████▉ | 1124/1610 [10:32:46<1:19:55, 9.87s/it] 70%|██████▉ | 1125/1610 [10:32:56<1:19:27, 9.83s/it] {'loss': 0.0152, 'grad_norm': 1.8886413009943357, 'learning_rate': 3.012422360248447e-07, 'completion_length': 82.50000381469727, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.910714328289032, 'reward': 1.321428656578064, 'reward_std': 0.1539071872830391, 'kl': 0.3818359375, 'epoch': 3.49} 70%|██████▉ | 1125/1610 [10:32:56<1:19:27, 9.83s/it] 70%|██████▉ | 1126/1610 [10:33:06<1:19:45, 9.89s/it] {'loss': 0.0048, 'grad_norm': 1.7000767038358706, 'learning_rate': 3.0062111801242235e-07, 'completion_length': 72.48214340209961, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7857143878936768, 'reward_std': 0.18409644439816475, 'kl': 0.120361328125, 'epoch': 3.5} 70%|██████▉ | 1126/1610 [10:33:06<1:19:45, 9.89s/it] 70%|███████ | 1127/1610 [10:33:15<1:19:09, 9.83s/it] {'loss': 0.0095, 'grad_norm': 2.7722819088894504, 'learning_rate': 3e-07, 'completion_length': 73.41071701049805, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.25552502274513245, 'kl': 0.23828125, 'epoch': 3.5} 70%|███████ | 1127/1610 [10:33:15<1:19:09, 9.83s/it] 70%|███████ | 1128/1610 [10:33:25<1:18:18, 9.75s/it] {'loss': 0.0096, 'grad_norm': 1.174696846375138, 'learning_rate': 2.993788819875776e-07, 'completion_length': 76.51786041259766, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.1071428656578064, 'kl': 0.241455078125, 'epoch': 3.5} 70%|███████ | 1128/1610 [10:33:25<1:18:18, 9.75s/it] 70%|███████ | 1129/1610 [10:33:34<1:16:31, 9.54s/it] {'loss': 0.0052, 'grad_norm': 3.5105587735450916, 'learning_rate': 2.987577639751553e-07, 'completion_length': 72.76786041259766, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3750000596046448, 'reward_std': 0.29123932123184204, 'kl': 0.1298828125, 'epoch': 3.51} 70%|███████ | 1129/1610 [10:33:34<1:16:31, 9.54s/it] 70%|███████ | 1130/1610 [10:33:50<1:30:55, 11.37s/it] {'loss': 0.0197, 'grad_norm': 2.3929837819323616, 'learning_rate': 2.9813664596273294e-07, 'completion_length': 91.62500381469727, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2857143878936768, 'reward_std': 0.3294377103447914, 'kl': 0.49169921875, 'epoch': 3.51} 70%|███████ | 1130/1610 [10:33:50<1:30:55, 11.37s/it] 70%|███████ | 1131/1610 [10:34:04<1:38:16, 12.31s/it] {'loss': 0.013, 'grad_norm': 4.153945467001085, 'learning_rate': 2.975155279503105e-07, 'completion_length': 80.94643020629883, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.21981073915958405, 'kl': 0.32421875, 'epoch': 3.51} 70%|███████ | 1131/1610 [10:34:04<1:38:16, 12.31s/it] 70%|███████ | 1132/1610 [10:34:14<1:32:17, 11.59s/it] {'loss': 0.027, 'grad_norm': 2.9733074847573575, 'learning_rate': 2.9689440993788815e-07, 'completion_length': 76.58929061889648, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3392857313156128, 'reward_std': 0.29123930633068085, 'kl': 0.6748046875, 'epoch': 3.52} 70%|███████ | 1132/1610 [10:34:14<1:32:17, 11.59s/it] 70%|███████ | 1133/1610 [10:34:24<1:28:53, 11.18s/it] {'loss': 0.0139, 'grad_norm': 2.372921113600613, 'learning_rate': 2.9627329192546583e-07, 'completion_length': 78.44643020629883, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5357143878936768, 'reward_std': 0.18807094916701317, 'kl': 0.34765625, 'epoch': 3.52} 70%|███████ | 1133/1610 [10:34:24<1:28:53, 11.18s/it] 70%|███████ | 1134/1610 [10:34:39<1:37:31, 12.29s/it] {'loss': 0.0235, 'grad_norm': 2.2378622433369806, 'learning_rate': 2.9565217391304347e-07, 'completion_length': 77.78571701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4285715222358704, 'reward_std': 0.21676982194185257, 'kl': 0.5869140625, 'epoch': 3.52} 70%|███████ | 1134/1610 [10:34:39<1:37:31, 12.29s/it] 70%|███████ | 1135/1610 [10:34:49<1:30:54, 11.48s/it] {'loss': 0.0095, 'grad_norm': 1.5545429036336662, 'learning_rate': 2.950310559006211e-07, 'completion_length': 76.03571701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.1071428619325161, 'kl': 0.237060546875, 'epoch': 3.52} 70%|███████ | 1135/1610 [10:34:49<1:30:54, 11.48s/it] 71%|███████ | 1136/1610 [10:34:59<1:28:26, 11.20s/it] {'loss': 0.0261, 'grad_norm': 2.9874675700766224, 'learning_rate': 2.9440993788819873e-07, 'completion_length': 70.07143020629883, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2534676790237427, 'kl': 0.6552734375, 'epoch': 3.53} 71%|███████ | 1136/1610 [10:34:59<1:28:26, 11.20s/it] 71%|███████ | 1137/1610 [10:35:09<1:25:01, 10.79s/it] {'loss': 0.0175, 'grad_norm': 2.3298350244072665, 'learning_rate': 2.9378881987577636e-07, 'completion_length': 73.89286041259766, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4285714626312256, 'reward_std': 0.21676981449127197, 'kl': 0.4365234375, 'epoch': 3.53} 71%|███████ | 1137/1610 [10:35:09<1:25:01, 10.79s/it] 71%|███████ | 1138/1610 [10:35:20<1:23:56, 10.67s/it] {'loss': 0.013, 'grad_norm': 1.7617456972197674, 'learning_rate': 2.9316770186335405e-07, 'completion_length': 77.39286041259766, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.32470703125, 'epoch': 3.53} 71%|███████ | 1138/1610 [10:35:20<1:23:56, 10.67s/it] 71%|███████ | 1139/1610 [10:35:29<1:20:30, 10.26s/it] {'loss': 0.0042, 'grad_norm': 1.4089390727040374, 'learning_rate': 2.925465838509317e-07, 'completion_length': 70.37500381469727, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.1044921875, 'epoch': 3.54} 71%|███████ | 1139/1610 [10:35:29<1:20:30, 10.26s/it] 71%|███████ | 1140/1610 [10:35:38<1:17:26, 9.89s/it] {'loss': 0.0058, 'grad_norm': 3.4415021526803566, 'learning_rate': 2.919254658385093e-07, 'completion_length': 68.5, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.3324786126613617, 'kl': 0.143798828125, 'epoch': 3.54} 71%|███████ | 1140/1610 [10:35:38<1:17:26, 9.89s/it] 71%|███████ | 1141/1610 [10:35:48<1:17:41, 9.94s/it] {'loss': 0.0191, 'grad_norm': 5.154209632809036, 'learning_rate': 2.9130434782608695e-07, 'completion_length': 76.19643020629883, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6607143878936768, 'reward_std': 0.2937234044075012, 'kl': 0.4775390625, 'epoch': 3.54} 71%|███████ | 1141/1610 [10:35:48<1:17:41, 9.94s/it] 71%|███████ | 1142/1610 [10:35:57<1:15:45, 9.71s/it] {'loss': 0.0194, 'grad_norm': 2.1733969177276644, 'learning_rate': 2.9068322981366463e-07, 'completion_length': 65.10714530944824, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.15788168087601662, 'kl': 0.484375, 'epoch': 3.55} 71%|███████ | 1142/1610 [10:35:57<1:15:45, 9.71s/it] 71%|███████ | 1143/1610 [10:36:07<1:14:56, 9.63s/it] {'loss': 0.0524, 'grad_norm': 3.1133035104582807, 'learning_rate': 2.900621118012422e-07, 'completion_length': 77.83929061889648, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.5535714626312256, 'reward_std': 0.42098909616470337, 'kl': 1.3095703125, 'epoch': 3.55} 71%|███████ | 1143/1610 [10:36:07<1:14:56, 9.63s/it] 71%|███████ | 1144/1610 [10:36:17<1:16:10, 9.81s/it] {'loss': 0.0119, 'grad_norm': 2.683634295960676, 'learning_rate': 2.8944099378881985e-07, 'completion_length': 63.25000190734863, 'rewards/accuracy_reward': 0.3928571715950966, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.18409644439816475, 'kl': 0.2978515625, 'epoch': 3.55} 71%|███████ | 1144/1610 [10:36:17<1:16:10, 9.81s/it] 71%|███████ | 1145/1610 [10:36:26<1:15:01, 9.68s/it] {'loss': 0.0093, 'grad_norm': 2.5455409657205066, 'learning_rate': 2.888198757763975e-07, 'completion_length': 81.75000381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178572535514832, 'reward_std': 0.20670334994792938, 'kl': 0.232177734375, 'epoch': 3.56} 71%|███████ | 1145/1610 [10:36:26<1:15:01, 9.68s/it] 71%|███████ | 1146/1610 [10:36:35<1:13:07, 9.46s/it] {'loss': 0.0116, 'grad_norm': 2.326479460110233, 'learning_rate': 2.881987577639751e-07, 'completion_length': 67.28571701049805, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5000000596046448, 'reward_std': 0.2142857313156128, 'kl': 0.289794921875, 'epoch': 3.56} 71%|███████ | 1146/1610 [10:36:35<1:13:07, 9.46s/it] 71%|███████ | 1147/1610 [10:36:50<1:25:08, 11.03s/it] {'loss': 0.0037, 'grad_norm': 1.6040337285718198, 'learning_rate': 2.875776397515528e-07, 'completion_length': 77.48214721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.1428571492433548, 'kl': 0.0914306640625, 'epoch': 3.56} 71%|███████ | 1147/1610 [10:36:50<1:25:08, 11.03s/it] 71%|███████▏ | 1148/1610 [10:36:59<1:21:41, 10.61s/it] {'loss': 0.0079, 'grad_norm': 2.2897659349155712, 'learning_rate': 2.8695652173913043e-07, 'completion_length': 73.01786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.19775390625, 'epoch': 3.57} 71%|███████▏ | 1148/1610 [10:36:59<1:21:41, 10.61s/it] 71%|███████▏ | 1149/1610 [10:37:14<1:30:39, 11.80s/it] {'loss': 0.0363, 'grad_norm': 2.147263627726297, 'learning_rate': 2.8633540372670806e-07, 'completion_length': 78.76786041259766, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178572535514832, 'reward_std': 0.26956599950790405, 'kl': 0.91015625, 'epoch': 3.57} 71%|███████▏ | 1149/1610 [10:37:14<1:30:39, 11.80s/it] 71%|███████▏ | 1150/1610 [10:37:24<1:25:39, 11.17s/it] {'loss': 0.0376, 'grad_norm': 2.773335709415106, 'learning_rate': 2.857142857142857e-07, 'completion_length': 75.0, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.3392857909202576, 'reward_std': 0.3022393584251404, 'kl': 0.935546875, 'epoch': 3.57} 71%|███████▏ | 1150/1610 [10:37:24<1:25:39, 11.17s/it] 71%|███████▏ | 1151/1610 [10:37:41<1:39:47, 13.04s/it] {'loss': 0.0805, 'grad_norm': 4.003710892857772, 'learning_rate': 2.850931677018634e-07, 'completion_length': 103.67857360839844, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.785714328289032, 'reward': 1.285714328289032, 'reward_std': 0.6130946576595306, 'kl': 2.0078125, 'epoch': 3.57} 71%|███████▏ | 1151/1610 [10:37:41<1:39:47, 13.04s/it] 72%|███████▏ | 1152/1610 [10:37:50<1:29:44, 11.76s/it] {'loss': 0.0221, 'grad_norm': 12.164400514103825, 'learning_rate': 2.84472049689441e-07, 'completion_length': 62.55357360839844, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.2826733738183975, 'kl': 0.55029296875, 'epoch': 3.58} 72%|███████▏ | 1152/1610 [10:37:50<1:29:44, 11.76s/it] 72%|███████▏ | 1153/1610 [10:37:59<1:23:00, 10.90s/it] {'loss': 0.003, 'grad_norm': 2.2269850741030917, 'learning_rate': 2.8385093167701864e-07, 'completion_length': 61.07143211364746, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.18409645557403564, 'kl': 0.073974609375, 'epoch': 3.58} 72%|███████▏ | 1153/1610 [10:37:59<1:23:00, 10.90s/it] 72%|███████▏ | 1154/1610 [10:38:07<1:17:10, 10.16s/it] {'loss': 0.0149, 'grad_norm': 2.3968509623944128, 'learning_rate': 2.832298136645963e-07, 'completion_length': 64.37500381469727, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5000000596046448, 'reward_std': 0.1428571529686451, 'kl': 0.372314453125, 'epoch': 3.58} 72%|███████▏ | 1154/1610 [10:38:07<1:17:10, 10.16s/it] 72%|███████▏ | 1155/1610 [10:38:18<1:19:23, 10.47s/it] {'loss': 0.0378, 'grad_norm': 2.7917222057529307, 'learning_rate': 2.8260869565217386e-07, 'completion_length': 80.82143020629883, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5535715222358704, 'reward_std': 0.278131939470768, 'kl': 0.9453125, 'epoch': 3.59} 72%|███████▏ | 1155/1610 [10:38:18<1:19:23, 10.47s/it] 72%|███████▏ | 1156/1610 [10:38:28<1:16:48, 10.15s/it] {'loss': 0.0364, 'grad_norm': 3.7411811716915824, 'learning_rate': 2.8198757763975154e-07, 'completion_length': 72.30357360839844, 'rewards/accuracy_reward': 0.2678571566939354, 'rewards/format_reward': 0.892857164144516, 'reward': 1.160714328289032, 'reward_std': 0.2851574718952179, 'kl': 0.914794921875, 'epoch': 3.59} 72%|███████▏ | 1156/1610 [10:38:28<1:16:48, 10.15s/it] 72%|███████▏ | 1157/1610 [10:38:42<1:26:09, 11.41s/it] {'loss': 0.0717, 'grad_norm': 5.7856941410135265, 'learning_rate': 2.813664596273292e-07, 'completion_length': 80.9464340209961, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.285714328289032, 'reward_std': 0.4637289196252823, 'kl': 1.79296875, 'epoch': 3.59} 72%|███████▏ | 1157/1610 [10:38:42<1:26:09, 11.41s/it] 72%|███████▏ | 1158/1610 [10:38:52<1:22:36, 10.97s/it] {'loss': 0.0145, 'grad_norm': 2.8212988352788253, 'learning_rate': 2.807453416149068e-07, 'completion_length': 64.19643020629883, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.23689261823892593, 'kl': 0.36328125, 'epoch': 3.6} 72%|███████▏ | 1158/1610 [10:38:52<1:22:36, 10.97s/it] 72%|███████▏ | 1159/1610 [10:39:07<1:31:47, 12.21s/it] {'loss': 0.0669, 'grad_norm': 4.927282332970749, 'learning_rate': 2.8012422360248444e-07, 'completion_length': 83.28572082519531, 'rewards/accuracy_reward': 0.4107142984867096, 'rewards/format_reward': 0.785714328289032, 'reward': 1.196428656578064, 'reward_std': 0.37371793389320374, 'kl': 1.671875, 'epoch': 3.6} 72%|███████▏ | 1159/1610 [10:39:07<1:31:47, 12.21s/it] 72%|███████▏ | 1160/1610 [10:39:17<1:25:13, 11.36s/it] {'loss': 0.013, 'grad_norm': 2.25659327256268, 'learning_rate': 2.7950310559006207e-07, 'completion_length': 65.17857360839844, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.32470703125, 'epoch': 3.6} 72%|███████▏ | 1160/1610 [10:39:17<1:25:13, 11.36s/it] 72%|███████▏ | 1161/1610 [10:39:31<1:32:08, 12.31s/it] {'loss': 0.0397, 'grad_norm': 2.868418874927156, 'learning_rate': 2.7888198757763976e-07, 'completion_length': 78.85714721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.571428656578064, 'reward_std': 0.2967643328011036, 'kl': 0.9921875, 'epoch': 3.61} 72%|███████▏ | 1161/1610 [10:39:31<1:32:08, 12.31s/it] 72%|███████▏ | 1162/1610 [10:39:40<1:25:17, 11.42s/it] {'loss': 0.0063, 'grad_norm': 3.3918215255745574, 'learning_rate': 2.782608695652174e-07, 'completion_length': 70.33928680419922, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.1572265625, 'epoch': 3.61} 72%|███████▏ | 1162/1610 [10:39:40<1:25:17, 11.42s/it] 72%|███████▏ | 1163/1610 [10:39:56<1:34:44, 12.72s/it] {'loss': 0.0757, 'grad_norm': 4.215897206205773, 'learning_rate': 2.77639751552795e-07, 'completion_length': 81.9464340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.446428656578064, 'reward_std': 0.32391268014907837, 'kl': 1.890625, 'epoch': 3.61} 72%|███████▏ | 1163/1610 [10:39:56<1:34:44, 12.72s/it] 72%|███████▏ | 1164/1610 [10:40:05<1:26:51, 11.69s/it] {'loss': 0.0142, 'grad_norm': 1.2923367469443552, 'learning_rate': 2.7701863354037266e-07, 'completion_length': 64.94643211364746, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7857143878936768, 'reward_std': 0.14079980179667473, 'kl': 0.35595703125, 'epoch': 3.61} 72%|███████▏ | 1164/1610 [10:40:05<1:26:51, 11.69s/it] 72%|███████▏ | 1165/1610 [10:40:15<1:20:54, 10.91s/it] {'loss': 0.0081, 'grad_norm': 1.956747828970122, 'learning_rate': 2.7639751552795034e-07, 'completion_length': 58.267860412597656, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.20263671875, 'epoch': 3.62} 72%|███████▏ | 1165/1610 [10:40:15<1:20:54, 10.91s/it] 72%|███████▏ | 1166/1610 [10:40:25<1:20:26, 10.87s/it] {'loss': 0.0394, 'grad_norm': 2.3275078306804486, 'learning_rate': 2.7577639751552797e-07, 'completion_length': 68.58928680419922, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6250001192092896, 'reward_std': 0.2851574793457985, 'kl': 0.986328125, 'epoch': 3.62} 72%|███████▏ | 1166/1610 [10:40:25<1:20:26, 10.87s/it] 72%|███████▏ | 1167/1610 [10:40:33<1:13:48, 10.00s/it] {'loss': 0.0079, 'grad_norm': 3.104391588760859, 'learning_rate': 2.7515527950310555e-07, 'completion_length': 56.12500190734863, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.21981073170900345, 'kl': 0.197265625, 'epoch': 3.62} 72%|███████▏ | 1167/1610 [10:40:33<1:13:48, 10.00s/it] 73%|███████▎ | 1168/1610 [10:40:47<1:22:01, 11.13s/it] {'loss': 0.0284, 'grad_norm': 1.7409669085681831, 'learning_rate': 2.745341614906832e-07, 'completion_length': 74.4285774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.14838217198848724, 'kl': 0.7109375, 'epoch': 3.63} 73%|███████▎ | 1168/1610 [10:40:47<1:22:01, 11.13s/it] 73%|███████▎ | 1169/1610 [10:40:56<1:17:49, 10.59s/it] {'loss': 0.016, 'grad_norm': 2.682012241619492, 'learning_rate': 2.739130434782608e-07, 'completion_length': 68.30357551574707, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.14838216453790665, 'kl': 0.4013671875, 'epoch': 3.63} 73%|███████▎ | 1169/1610 [10:40:56<1:17:49, 10.59s/it] 73%|███████▎ | 1170/1610 [10:41:06<1:16:31, 10.44s/it] {'loss': 0.0167, 'grad_norm': 3.17205508494937, 'learning_rate': 2.732919254658385e-07, 'completion_length': 63.42857360839844, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.41552734375, 'epoch': 3.63} 73%|███████▎ | 1170/1610 [10:41:06<1:16:31, 10.44s/it] 73%|███████▎ | 1171/1610 [10:41:16<1:13:30, 10.05s/it] {'loss': 0.0196, 'grad_norm': 2.6702508537805643, 'learning_rate': 2.7267080745341614e-07, 'completion_length': 67.71428871154785, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.489990234375, 'epoch': 3.64} 73%|███████▎ | 1171/1610 [10:41:16<1:13:30, 10.05s/it] 73%|███████▎ | 1172/1610 [10:41:25<1:11:57, 9.86s/it] {'loss': 0.0032, 'grad_norm': 1.352365356528889, 'learning_rate': 2.7204968944099377e-07, 'completion_length': 73.91071701049805, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695358991622925, 'kl': 0.080810546875, 'epoch': 3.64} 73%|███████▎ | 1172/1610 [10:41:25<1:11:57, 9.86s/it] 73%|███████▎ | 1173/1610 [10:41:39<1:21:30, 11.19s/it] {'loss': 0.0161, 'grad_norm': 2.519262476989013, 'learning_rate': 2.714285714285714e-07, 'completion_length': 80.53571701049805, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.2826733775436878, 'kl': 0.402587890625, 'epoch': 3.64} 73%|███████▎ | 1173/1610 [10:41:39<1:21:30, 11.19s/it] 73%|███████▎ | 1174/1610 [10:41:48<1:16:07, 10.48s/it] {'loss': 0.0048, 'grad_norm': 1.983436134991586, 'learning_rate': 2.708074534161491e-07, 'completion_length': 61.00000190734863, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.14838216453790665, 'kl': 0.120849609375, 'epoch': 3.65} 73%|███████▎ | 1174/1610 [10:41:48<1:16:07, 10.48s/it] 73%|███████▎ | 1175/1610 [10:41:57<1:12:11, 9.96s/it] {'loss': 0.0051, 'grad_norm': 2.7728482750174654, 'learning_rate': 2.701863354037267e-07, 'completion_length': 67.53571701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.126953125, 'epoch': 3.65} 73%|███████▎ | 1175/1610 [10:41:57<1:12:11, 9.96s/it] 73%|███████▎ | 1176/1610 [10:42:06<1:09:15, 9.57s/it] {'loss': 0.0052, 'grad_norm': 2.135734311497738, 'learning_rate': 2.6956521739130435e-07, 'completion_length': 62.250003814697266, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.1896214708685875, 'kl': 0.12939453125, 'epoch': 3.65} 73%|███████▎ | 1176/1610 [10:42:06<1:09:15, 9.57s/it] 73%|███████▎ | 1177/1610 [10:42:21<1:21:24, 11.28s/it] {'loss': 0.0498, 'grad_norm': 4.120429028517874, 'learning_rate': 2.68944099378882e-07, 'completion_length': 79.89286041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.910714328289032, 'reward': 1.410714328289032, 'reward_std': 0.1785714402794838, 'kl': 1.24609375, 'epoch': 3.66} 73%|███████▎ | 1177/1610 [10:42:21<1:21:24, 11.28s/it] 73%|███████▎ | 1178/1610 [10:42:30<1:17:39, 10.78s/it] {'loss': 0.0029, 'grad_norm': 1.1978456735391874, 'learning_rate': 2.6832298136645956e-07, 'completion_length': 65.83929061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.07373046875, 'epoch': 3.66} 73%|███████▎ | 1178/1610 [10:42:30<1:17:39, 10.78s/it] 73%|███████▎ | 1179/1610 [10:42:41<1:16:52, 10.70s/it] {'loss': 0.0205, 'grad_norm': 2.539059665057293, 'learning_rate': 2.6770186335403725e-07, 'completion_length': 85.3214340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.5107421875, 'epoch': 3.66} 73%|███████▎ | 1179/1610 [10:42:41<1:16:52, 10.70s/it] 73%|███████▎ | 1180/1610 [10:42:51<1:15:25, 10.52s/it] {'loss': 0.0112, 'grad_norm': 2.372774100510462, 'learning_rate': 2.670807453416149e-07, 'completion_length': 84.23214340209961, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4464285969734192, 'reward_std': 0.30228935927152634, 'kl': 0.27978515625, 'epoch': 3.66} 73%|███████▎ | 1180/1610 [10:42:51<1:15:25, 10.52s/it] 73%|███████▎ | 1181/1610 [10:43:03<1:18:54, 11.04s/it] {'loss': 0.0235, 'grad_norm': 2.488700018897818, 'learning_rate': 2.664596273291925e-07, 'completion_length': 77.37500381469727, 'rewards/accuracy_reward': 0.6250000447034836, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.17098906636238098, 'kl': 0.587890625, 'epoch': 3.67} 73%|███████▎ | 1181/1610 [10:43:03<1:18:54, 11.04s/it] 73%|███████▎ | 1182/1610 [10:43:13<1:15:07, 10.53s/it] {'loss': 0.0222, 'grad_norm': 2.9619301558341364, 'learning_rate': 2.6583850931677015e-07, 'completion_length': 58.30357360839844, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6607143878936768, 'reward_std': 0.29123930633068085, 'kl': 0.5546875, 'epoch': 3.67} 73%|███████▎ | 1182/1610 [10:43:13<1:15:07, 10.53s/it] 73%|███████▎ | 1183/1610 [10:43:23<1:13:51, 10.38s/it] {'loss': 0.0197, 'grad_norm': 2.0663206825118556, 'learning_rate': 2.6521739130434783e-07, 'completion_length': 69.85714721679688, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.18203910440206528, 'kl': 0.4931640625, 'epoch': 3.67} 73%|███████▎ | 1183/1610 [10:43:23<1:13:51, 10.38s/it] 74%|███████▎ | 1184/1610 [10:43:37<1:21:45, 11.52s/it] {'loss': 0.0445, 'grad_norm': 4.407930302437439, 'learning_rate': 2.6459627329192547e-07, 'completion_length': 75.98214340209961, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.23086076974868774, 'kl': 1.111328125, 'epoch': 3.68} 74%|███████▎ | 1184/1610 [10:43:37<1:21:45, 11.52s/it] 74%|███████▎ | 1185/1610 [10:43:47<1:19:26, 11.22s/it] {'loss': 0.0069, 'grad_norm': 2.0261586425300417, 'learning_rate': 2.639751552795031e-07, 'completion_length': 88.55357360839844, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.07695359364151955, 'kl': 0.1728515625, 'epoch': 3.68} 74%|███████▎ | 1185/1610 [10:43:47<1:19:26, 11.22s/it] 74%|███████▎ | 1186/1610 [10:43:56<1:13:38, 10.42s/it] {'loss': 0.0118, 'grad_norm': 3.659199341517798, 'learning_rate': 2.6335403726708073e-07, 'completion_length': 68.03571891784668, 'rewards/accuracy_reward': 0.464285746216774, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.27762509882450104, 'kl': 0.2958984375, 'epoch': 3.68} 74%|███████▎ | 1186/1610 [10:43:56<1:13:38, 10.42s/it] 74%|███████▎ | 1187/1610 [10:44:06<1:12:06, 10.23s/it] {'loss': 0.0033, 'grad_norm': 2.000231258349146, 'learning_rate': 2.6273291925465836e-07, 'completion_length': 66.60714721679688, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.1539071872830391, 'kl': 0.0826416015625, 'epoch': 3.69} 74%|███████▎ | 1187/1610 [10:44:06<1:12:06, 10.23s/it] 74%|███████▍ | 1188/1610 [10:44:14<1:08:30, 9.74s/it] {'loss': 0.0026, 'grad_norm': 1.5863441072991462, 'learning_rate': 2.6211180124223605e-07, 'completion_length': 63.875003814697266, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.0657958984375, 'epoch': 3.69} 74%|███████▍ | 1188/1610 [10:44:14<1:08:30, 9.74s/it] 74%|███████▍ | 1189/1610 [10:44:24<1:07:22, 9.60s/it] {'loss': 0.0089, 'grad_norm': 0.7619992722287028, 'learning_rate': 2.614906832298137e-07, 'completion_length': 68.05357360839844, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142858505249023, 'reward_std': 0.0, 'kl': 0.221923828125, 'epoch': 3.69} 74%|███████▍ | 1189/1610 [10:44:24<1:07:22, 9.60s/it] 74%|███████▍ | 1190/1610 [10:44:33<1:07:14, 9.61s/it] {'loss': 0.01, 'grad_norm': 1.6756140321627084, 'learning_rate': 2.6086956521739126e-07, 'completion_length': 80.0535774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.17098907008767128, 'kl': 0.25048828125, 'epoch': 3.7} 74%|███████▍ | 1190/1610 [10:44:33<1:07:14, 9.61s/it] 74%|███████▍ | 1191/1610 [10:44:42<1:04:40, 9.26s/it] {'loss': 0.0057, 'grad_norm': 1.7364938199038107, 'learning_rate': 2.602484472049689e-07, 'completion_length': 62.96428871154785, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.1539071872830391, 'kl': 0.14306640625, 'epoch': 3.7} 74%|███████▍ | 1191/1610 [10:44:42<1:04:40, 9.26s/it] 74%|███████▍ | 1192/1610 [10:44:50<1:03:36, 9.13s/it] {'loss': 0.0033, 'grad_norm': 2.437593887000652, 'learning_rate': 2.596273291925466e-07, 'completion_length': 70.51786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.082763671875, 'epoch': 3.7} 74%|███████▍ | 1192/1610 [10:44:50<1:03:36, 9.13s/it] 74%|███████▍ | 1193/1610 [10:45:04<1:13:36, 10.59s/it] {'loss': 0.0188, 'grad_norm': 2.0776530086466645, 'learning_rate': 2.590062111801242e-07, 'completion_length': 79.89286231994629, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5535714626312256, 'reward_std': 0.2610500454902649, 'kl': 0.4686279296875, 'epoch': 3.7} 74%|███████▍ | 1193/1610 [10:45:04<1:13:36, 10.59s/it] 74%|███████▍ | 1194/1610 [10:45:19<1:22:18, 11.87s/it] {'loss': 0.0141, 'grad_norm': 2.2899437186547003, 'learning_rate': 2.5838509316770184e-07, 'completion_length': 84.83929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178571939468384, 'reward_std': 0.3495605140924454, 'kl': 0.353515625, 'epoch': 3.71} 74%|███████▍ | 1194/1610 [10:45:19<1:22:18, 11.87s/it] 74%|███████▍ | 1195/1610 [10:45:33<1:26:36, 12.52s/it] {'loss': 0.0124, 'grad_norm': 2.211787447490736, 'learning_rate': 2.577639751552795e-07, 'completion_length': 78.12500381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2253357470035553, 'kl': 0.310546875, 'epoch': 3.71} 74%|███████▍ | 1195/1610 [10:45:33<1:26:36, 12.52s/it] 74%|███████▍ | 1196/1610 [10:45:44<1:23:07, 12.05s/it] {'loss': 0.0122, 'grad_norm': 1.4720874063614127, 'learning_rate': 2.571428571428571e-07, 'completion_length': 83.10714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.14838215708732605, 'kl': 0.304443359375, 'epoch': 3.71} 74%|███████▍ | 1196/1610 [10:45:44<1:23:07, 12.05s/it] 74%|███████▍ | 1197/1610 [10:45:54<1:18:21, 11.38s/it] {'loss': 0.0049, 'grad_norm': 2.444543647372491, 'learning_rate': 2.565217391304348e-07, 'completion_length': 75.58929061889648, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.1539071798324585, 'kl': 0.123291015625, 'epoch': 3.72} 74%|███████▍ | 1197/1610 [10:45:54<1:18:21, 11.38s/it] 74%|███████▍ | 1198/1610 [10:46:07<1:20:36, 11.74s/it] {'loss': 0.0081, 'grad_norm': 2.8098617183590138, 'learning_rate': 2.5590062111801243e-07, 'completion_length': 88.4464340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.15943220630288124, 'kl': 0.20361328125, 'epoch': 3.72} 74%|███████▍ | 1198/1610 [10:46:07<1:20:36, 11.74s/it] 74%|███████▍ | 1199/1610 [10:46:16<1:15:32, 11.03s/it] {'loss': 0.0033, 'grad_norm': 1.275652190236347, 'learning_rate': 2.5527950310559006e-07, 'completion_length': 72.33928680419922, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.081787109375, 'epoch': 3.72} 74%|███████▍ | 1199/1610 [10:46:16<1:15:32, 11.03s/it] 75%|███████▍ | 1200/1610 [10:46:25<1:11:11, 10.42s/it] {'loss': 0.0024, 'grad_norm': 1.130882852840487, 'learning_rate': 2.546583850931677e-07, 'completion_length': 66.46428680419922, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.0611572265625, 'epoch': 3.73} 75%|███████▍ | 1200/1610 [10:46:25<1:11:11, 10.42s/it] 75%|███████▍ | 1201/1610 [10:49:11<6:29:17, 57.11s/it] {'loss': 0.0034, 'grad_norm': 3.6600726356307582, 'learning_rate': 2.540372670807454e-07, 'completion_length': 59.392860412597656, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.18409645557403564, 'kl': 0.083740234375, 'epoch': 3.73} 75%|███████▍ | 1201/1610 [10:49:11<6:29:17, 57.11s/it] 75%|███████▍ | 1202/1610 [10:49:22<4:53:03, 43.10s/it] {'loss': 0.0058, 'grad_norm': 1.8371000854870254, 'learning_rate': 2.5341614906832296e-07, 'completion_length': 87.39286422729492, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.1785714402794838, 'kl': 0.144775390625, 'epoch': 3.73} 75%|███████▍ | 1202/1610 [10:49:22<4:53:03, 43.10s/it] 75%|███████▍ | 1203/1610 [10:49:36<3:55:06, 34.66s/it] {'loss': 0.0188, 'grad_norm': 2.006297467886213, 'learning_rate': 2.527950310559006e-07, 'completion_length': 93.1785774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5357143878936768, 'reward_std': 0.2253357470035553, 'kl': 0.47021484375, 'epoch': 3.74} 75%|███████▍ | 1203/1610 [10:49:36<3:55:06, 34.66s/it] 75%|███████▍ | 1204/1610 [10:49:47<3:05:45, 27.45s/it] {'loss': 0.0036, 'grad_norm': 1.7135647525984972, 'learning_rate': 2.521739130434782e-07, 'completion_length': 78.21429061889648, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.0908203125, 'epoch': 3.74} 75%|███████▍ | 1204/1610 [10:49:47<3:05:45, 27.45s/it] 75%|███████▍ | 1205/1610 [10:50:00<2:35:14, 23.00s/it] {'loss': 0.0055, 'grad_norm': 1.4860743580469873, 'learning_rate': 2.5155279503105585e-07, 'completion_length': 86.75000381469727, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.1376953125, 'epoch': 3.74} 75%|███████▍ | 1205/1610 [10:50:00<2:35:14, 23.00s/it] 75%|███████▍ | 1206/1610 [10:50:10<2:08:48, 19.13s/it] {'loss': 0.0043, 'grad_norm': 2.781974846921685, 'learning_rate': 2.5093167701863354e-07, 'completion_length': 83.12500381469727, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.1075439453125, 'epoch': 3.75} 75%|███████▍ | 1206/1610 [10:50:10<2:08:48, 19.13s/it] 75%|███████▍ | 1207/1610 [10:50:20<1:49:28, 16.30s/it] {'loss': 0.0104, 'grad_norm': 2.0147121067037927, 'learning_rate': 2.5031055900621117e-07, 'completion_length': 81.98214721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.26171875, 'epoch': 3.75} 75%|███████▍ | 1207/1610 [10:50:20<1:49:28, 16.30s/it] 75%|███████▌ | 1208/1610 [10:50:30<1:37:46, 14.59s/it] {'loss': 0.0027, 'grad_norm': 3.9199065358714513, 'learning_rate': 2.496894409937888e-07, 'completion_length': 88.98214721679688, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.21981073915958405, 'kl': 0.0675048828125, 'epoch': 3.75} 75%|███████▌ | 1208/1610 [10:50:30<1:37:46, 14.59s/it] 75%|███████▌ | 1209/1610 [10:50:40<1:28:57, 13.31s/it] {'loss': 0.0046, 'grad_norm': 1.6855490852978088, 'learning_rate': 2.4906832298136644e-07, 'completion_length': 82.85714721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.11572265625, 'epoch': 3.75} 75%|███████▌ | 1209/1610 [10:50:40<1:28:57, 13.31s/it] 75%|███████▌ | 1210/1610 [10:50:52<1:24:28, 12.67s/it] {'loss': 0.0027, 'grad_norm': 2.062375736125467, 'learning_rate': 2.4844720496894407e-07, 'completion_length': 91.66072082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.06640625, 'epoch': 3.76} 75%|███████▌ | 1210/1610 [10:50:52<1:24:28, 12.67s/it] 75%|███████▌ | 1211/1610 [10:51:03<1:20:41, 12.13s/it] {'loss': 0.0081, 'grad_norm': 1.4871750569802435, 'learning_rate': 2.4782608695652176e-07, 'completion_length': 89.35714721679688, 'rewards/accuracy_reward': 0.8928571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.203125, 'epoch': 3.76} 75%|███████▌ | 1211/1610 [10:51:03<1:20:41, 12.13s/it] 75%|███████▌ | 1212/1610 [10:51:11<1:14:10, 11.18s/it] {'loss': 0.002, 'grad_norm': 1.3679615971612789, 'learning_rate': 2.472049689440994e-07, 'completion_length': 70.01786231994629, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428656578064, 'kl': 0.0496826171875, 'epoch': 3.76} 75%|███████▌ | 1212/1610 [10:51:11<1:14:10, 11.18s/it] 75%|███████▌ | 1213/1610 [10:51:23<1:14:02, 11.19s/it] {'loss': 0.0032, 'grad_norm': 0.4726960753190523, 'learning_rate': 2.46583850931677e-07, 'completion_length': 78.64286041259766, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.04123930633068085, 'kl': 0.079833984375, 'epoch': 3.77} 75%|███████▌ | 1213/1610 [10:51:23<1:14:02, 11.19s/it] 75%|███████▌ | 1214/1610 [10:51:34<1:13:23, 11.12s/it] {'loss': 0.0032, 'grad_norm': 4.158778084975251, 'learning_rate': 2.4596273291925465e-07, 'completion_length': 86.44643020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.21981074661016464, 'kl': 0.078857421875, 'epoch': 3.77} 75%|███████▌ | 1214/1610 [10:51:34<1:13:23, 11.12s/it] 75%|███████▌ | 1215/1610 [10:51:43<1:10:12, 10.67s/it] {'loss': 0.0029, 'grad_norm': 2.3675481841730175, 'learning_rate': 2.453416149068323e-07, 'completion_length': 78.83929061889648, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1071428656578064, 'kl': 0.072021484375, 'epoch': 3.77} 75%|███████▌ | 1215/1610 [10:51:43<1:10:12, 10.67s/it] 76%|███████▌ | 1216/1610 [10:51:54<1:09:21, 10.56s/it] {'loss': 0.0023, 'grad_norm': 2.9206530556224344, 'learning_rate': 2.447204968944099e-07, 'completion_length': 85.69643020629883, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 1.0, 'reward': 1.4107143878936768, 'reward_std': 0.14838216826319695, 'kl': 0.0584716796875, 'epoch': 3.78} 76%|███████▌ | 1216/1610 [10:51:54<1:09:21, 10.56s/it] 76%|███████▌ | 1217/1610 [10:52:04<1:08:07, 10.40s/it] {'loss': 0.003, 'grad_norm': 1.2712297030122575, 'learning_rate': 2.4409937888198755e-07, 'completion_length': 79.80357360839844, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.0760498046875, 'epoch': 3.78} 76%|███████▌ | 1217/1610 [10:52:04<1:08:07, 10.40s/it] 76%|███████▌ | 1218/1610 [10:52:14<1:07:52, 10.39s/it] {'loss': 0.0101, 'grad_norm': 2.2331372623655916, 'learning_rate': 2.4347826086956524e-07, 'completion_length': 87.87500381469727, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.2142857201397419, 'kl': 0.25146484375, 'epoch': 3.78} 76%|███████▌ | 1218/1610 [10:52:14<1:07:52, 10.39s/it] 76%|███████▌ | 1219/1610 [10:52:25<1:09:26, 10.66s/it] {'loss': 0.0036, 'grad_norm': 2.2958830091654194, 'learning_rate': 2.4285714285714287e-07, 'completion_length': 88.37500381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1896214634180069, 'kl': 0.090087890625, 'epoch': 3.79} 76%|███████▌ | 1219/1610 [10:52:25<1:09:26, 10.66s/it] 76%|███████▌ | 1220/1610 [10:52:35<1:07:14, 10.35s/it] {'loss': 0.0031, 'grad_norm': 1.8247290342805738, 'learning_rate': 2.422360248447205e-07, 'completion_length': 80.10714721679688, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.1896214708685875, 'kl': 0.07666015625, 'epoch': 3.79} 76%|███████▌ | 1220/1610 [10:52:35<1:07:14, 10.35s/it] 76%|███████▌ | 1221/1610 [10:52:45<1:05:55, 10.17s/it] {'loss': 0.0023, 'grad_norm': 0.8719992677241141, 'learning_rate': 2.4161490683229813e-07, 'completion_length': 84.01786041259766, 'rewards/accuracy_reward': 0.3571428656578064, 'rewards/format_reward': 1.0, 'reward': 1.3571428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.0570068359375, 'epoch': 3.79} 76%|███████▌ | 1221/1610 [10:52:45<1:05:55, 10.17s/it] 76%|███████▌ | 1222/1610 [10:52:55<1:06:18, 10.25s/it] {'loss': 0.0024, 'grad_norm': 1.9240750119033334, 'learning_rate': 2.4099378881987577e-07, 'completion_length': 85.3214340209961, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.061279296875, 'epoch': 3.8} 76%|███████▌ | 1222/1610 [10:52:55<1:06:18, 10.25s/it] 76%|███████▌ | 1223/1610 [10:53:06<1:08:11, 10.57s/it] {'loss': 0.0022, 'grad_norm': 1.1123627996432803, 'learning_rate': 2.403726708074534e-07, 'completion_length': 79.42857360839844, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.054931640625, 'epoch': 3.8} 76%|███████▌ | 1223/1610 [10:53:06<1:08:11, 10.57s/it] 76%|███████▌ | 1224/1610 [10:53:17<1:07:27, 10.48s/it] {'loss': 0.0024, 'grad_norm': 1.6137596678475394, 'learning_rate': 2.3975155279503103e-07, 'completion_length': 87.89286041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.14838216826319695, 'kl': 0.0592041015625, 'epoch': 3.8} 76%|███████▌ | 1224/1610 [10:53:17<1:07:27, 10.48s/it] 76%|███████▌ | 1225/1610 [10:53:28<1:08:40, 10.70s/it] {'loss': 0.0025, 'grad_norm': 1.2236256462494386, 'learning_rate': 2.391304347826087e-07, 'completion_length': 88.87500381469727, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.0625, 'epoch': 3.8} 76%|███████▌ | 1225/1610 [10:53:28<1:08:40, 10.70s/it] 76%|███████▌ | 1226/1610 [10:53:38<1:06:37, 10.41s/it] {'loss': 0.0022, 'grad_norm': 1.9489555864070993, 'learning_rate': 2.385093167701863e-07, 'completion_length': 86.0714340209961, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.910714328289032, 'reward_std': 0.1181928962469101, 'kl': 0.0546875, 'epoch': 3.81} 76%|███████▌ | 1226/1610 [10:53:38<1:06:37, 10.41s/it] 76%|███████▌ | 1227/1610 [10:53:49<1:08:32, 10.74s/it] {'loss': 0.0091, 'grad_norm': 2.019988650128239, 'learning_rate': 2.3788819875776398e-07, 'completion_length': 92.8035774230957, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.910714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.22705078125, 'epoch': 3.81} 76%|███████▌ | 1227/1610 [10:53:49<1:08:32, 10.74s/it] 76%|███████▋ | 1228/1610 [10:54:00<1:09:27, 10.91s/it] {'loss': 0.0058, 'grad_norm': 2.086129942160762, 'learning_rate': 2.3726708074534161e-07, 'completion_length': 87.3035774230957, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1896214708685875, 'kl': 0.14501953125, 'epoch': 3.81} 76%|███████▋ | 1228/1610 [10:54:00<1:09:27, 10.91s/it] 76%|███████▋ | 1229/1610 [10:54:10<1:05:54, 10.38s/it] {'loss': 0.0025, 'grad_norm': 1.9153284770856012, 'learning_rate': 2.3664596273291925e-07, 'completion_length': 71.66071701049805, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1896214634180069, 'kl': 0.0628662109375, 'epoch': 3.82} 76%|███████▋ | 1229/1610 [10:54:10<1:05:54, 10.38s/it] 76%|███████▋ | 1230/1610 [10:54:24<1:13:33, 11.62s/it] {'loss': 0.0164, 'grad_norm': 2.5209802586117207, 'learning_rate': 2.3602484472049688e-07, 'completion_length': 90.83928680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5714285969734192, 'reward_std': 0.27260690927505493, 'kl': 0.407958984375, 'epoch': 3.82} 76%|███████▋ | 1230/1610 [10:54:24<1:13:33, 11.62s/it] 76%|███████▋ | 1231/1610 [10:54:33<1:08:15, 10.81s/it] {'loss': 0.0022, 'grad_norm': 1.202159294790654, 'learning_rate': 2.354037267080745e-07, 'completion_length': 76.91071701049805, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.054443359375, 'epoch': 3.82} 76%|███████▋ | 1231/1610 [10:54:33<1:08:15, 10.81s/it] 77%|███████▋ | 1232/1610 [10:54:43<1:06:10, 10.50s/it] {'loss': 0.003, 'grad_norm': 1.3677683328384835, 'learning_rate': 2.3478260869565217e-07, 'completion_length': 78.42857360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.0714285746216774, 'kl': 0.074462890625, 'epoch': 3.83} 77%|███████▋ | 1232/1610 [10:54:43<1:06:10, 10.50s/it] 77%|███████▋ | 1233/1610 [10:54:54<1:07:22, 10.72s/it] {'loss': 0.0022, 'grad_norm': 1.8151030474399774, 'learning_rate': 2.341614906832298e-07, 'completion_length': 100.46429061889648, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266788095235825, 'kl': 0.055908203125, 'epoch': 3.83} 77%|███████▋ | 1233/1610 [10:54:54<1:07:22, 10.72s/it] 77%|███████▋ | 1234/1610 [10:55:05<1:08:08, 10.87s/it] {'loss': 0.0035, 'grad_norm': 1.7036784435041121, 'learning_rate': 2.3354037267080746e-07, 'completion_length': 106.08929061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.3078143745660782, 'kl': 0.0869140625, 'epoch': 3.83} 77%|███████▋ | 1234/1610 [10:55:05<1:08:08, 10.87s/it] 77%|███████▋ | 1235/1610 [10:55:16<1:08:06, 10.90s/it] {'loss': 0.0042, 'grad_norm': 1.5919195286366403, 'learning_rate': 2.3291925465838507e-07, 'completion_length': 100.87500381469727, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216453790665, 'kl': 0.104248046875, 'epoch': 3.84} 77%|███████▋ | 1235/1610 [10:55:16<1:08:06, 10.90s/it] 77%|███████▋ | 1236/1610 [10:55:28<1:09:25, 11.14s/it] {'loss': 0.0119, 'grad_norm': 1.8914634124690177, 'learning_rate': 2.3229813664596273e-07, 'completion_length': 99.08929061889648, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.22781985998153687, 'kl': 0.2989501953125, 'epoch': 3.84} 77%|███████▋ | 1236/1610 [10:55:28<1:09:25, 11.14s/it] 77%|███████▋ | 1237/1610 [10:55:38<1:07:37, 10.88s/it] {'loss': 0.0043, 'grad_norm': 1.9517460968185494, 'learning_rate': 2.3167701863354036e-07, 'completion_length': 96.83929061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.29123930633068085, 'kl': 0.10693359375, 'epoch': 3.84} 77%|███████▋ | 1237/1610 [10:55:38<1:07:37, 10.88s/it] 77%|███████▋ | 1238/1610 [10:55:48<1:05:53, 10.63s/it] {'loss': 0.0087, 'grad_norm': 1.6772004292147673, 'learning_rate': 2.31055900621118e-07, 'completion_length': 84.9464340209961, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.13981622457504272, 'kl': 0.216064453125, 'epoch': 3.84} 77%|███████▋ | 1238/1610 [10:55:48<1:05:53, 10.63s/it] 77%|███████▋ | 1239/1610 [10:56:03<1:13:20, 11.86s/it] {'loss': 0.02, 'grad_norm': 2.143428269002827, 'learning_rate': 2.3043478260869565e-07, 'completion_length': 94.53572082519531, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.21981073170900345, 'kl': 0.501708984375, 'epoch': 3.85} 77%|███████▋ | 1239/1610 [10:56:03<1:13:20, 11.86s/it] 77%|███████▋ | 1240/1610 [10:56:13<1:10:17, 11.40s/it] {'loss': 0.0046, 'grad_norm': 2.823055660515307, 'learning_rate': 2.2981366459627326e-07, 'completion_length': 89.8035774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.2142857313156128, 'kl': 0.114990234375, 'epoch': 3.85} 77%|███████▋ | 1240/1610 [10:56:13<1:10:17, 11.40s/it] 77%|███████▋ | 1241/1610 [10:56:28<1:15:33, 12.29s/it] {'loss': 0.0215, 'grad_norm': 1.7548224309562876, 'learning_rate': 2.2919254658385092e-07, 'completion_length': 87.89286041259766, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7500001192092896, 'reward_std': 0.2967643290758133, 'kl': 0.5400390625, 'epoch': 3.85} 77%|███████▋ | 1241/1610 [10:56:28<1:15:33, 12.29s/it] 77%|███████▋ | 1242/1610 [10:56:37<1:10:33, 11.50s/it] {'loss': 0.0025, 'grad_norm': 2.295886346887834, 'learning_rate': 2.2857142857142855e-07, 'completion_length': 84.44643020629883, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.2500000074505806, 'kl': 0.0628662109375, 'epoch': 3.86} 77%|███████▋ | 1242/1610 [10:56:37<1:10:33, 11.50s/it] 77%|███████▋ | 1243/1610 [10:56:53<1:17:25, 12.66s/it] {'loss': 0.0094, 'grad_norm': 2.2944961928226775, 'learning_rate': 2.279503105590062e-07, 'completion_length': 93.48214721679688, 'rewards/accuracy_reward': 0.5535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.24695909023284912, 'kl': 0.23486328125, 'epoch': 3.86} 77%|███████▋ | 1243/1610 [10:56:53<1:17:25, 12.66s/it] 77%|███████▋ | 1244/1610 [10:57:07<1:19:58, 13.11s/it] {'loss': 0.0254, 'grad_norm': 2.1667720571872104, 'learning_rate': 2.2732919254658384e-07, 'completion_length': 90.91072082519531, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.446428656578064, 'reward_std': 0.23086076974868774, 'kl': 0.637451171875, 'epoch': 3.86} 77%|███████▋ | 1244/1610 [10:57:07<1:19:58, 13.11s/it] 77%|███████▋ | 1245/1610 [10:57:18<1:15:40, 12.44s/it] {'loss': 0.0022, 'grad_norm': 0.986019629953823, 'learning_rate': 2.267080745341615e-07, 'completion_length': 97.39286422729492, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.053955078125, 'epoch': 3.87} 77%|███████▋ | 1245/1610 [10:57:18<1:15:40, 12.44s/it] 77%|███████▋ | 1246/1610 [10:57:28<1:12:16, 11.91s/it] {'loss': 0.0031, 'grad_norm': 2.3593680215928825, 'learning_rate': 2.260869565217391e-07, 'completion_length': 86.57143020629883, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.076904296875, 'epoch': 3.87} 77%|███████▋ | 1246/1610 [10:57:28<1:12:16, 11.91s/it] 77%|███████▋ | 1247/1610 [10:57:39<1:10:05, 11.59s/it] {'loss': 0.0026, 'grad_norm': 3.49220052783408, 'learning_rate': 2.2546583850931674e-07, 'completion_length': 99.39286422729492, 'rewards/accuracy_reward': 0.4821428805589676, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.1896214783191681, 'kl': 0.06640625, 'epoch': 3.87} 77%|███████▋ | 1247/1610 [10:57:39<1:10:05, 11.59s/it] 78%|███████▊ | 1248/1610 [10:57:55<1:17:15, 12.81s/it] {'loss': 0.0158, 'grad_norm': 1.7548811050776028, 'learning_rate': 2.248447204968944e-07, 'completion_length': 117.64286422729492, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.2690591663122177, 'kl': 0.396240234375, 'epoch': 3.88} 78%|███████▊ | 1248/1610 [10:57:55<1:17:15, 12.81s/it] 78%|███████▊ | 1249/1610 [10:58:05<1:12:25, 12.04s/it] {'loss': 0.0048, 'grad_norm': 1.7281520450640229, 'learning_rate': 2.2422360248447203e-07, 'completion_length': 91.17857360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1181928962469101, 'kl': 0.1201171875, 'epoch': 3.88} 78%|███████▊ | 1249/1610 [10:58:05<1:12:25, 12.04s/it] 78%|███████▊ | 1250/1610 [10:58:16<1:09:24, 11.57s/it] {'loss': 0.0088, 'grad_norm': 2.4636758824770024, 'learning_rate': 2.236024844720497e-07, 'completion_length': 98.62500381469727, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 1.0, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.220458984375, 'epoch': 3.88} 78%|███████▊ | 1250/1610 [10:58:16<1:09:24, 11.57s/it] 78%|███████▊ | 1251/1610 [10:58:27<1:08:40, 11.48s/it] {'loss': 0.0093, 'grad_norm': 2.869769297821933, 'learning_rate': 2.2298136645962732e-07, 'completion_length': 94.37500381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.19514648616313934, 'kl': 0.2315673828125, 'epoch': 3.89} 78%|███████▊ | 1251/1610 [10:58:27<1:08:40, 11.48s/it] 78%|███████▊ | 1252/1610 [10:58:36<1:04:57, 10.89s/it] {'loss': 0.0052, 'grad_norm': 1.803097026855725, 'learning_rate': 2.2236024844720495e-07, 'completion_length': 82.28572082519531, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.1298828125, 'epoch': 3.89} 78%|███████▊ | 1252/1610 [10:58:36<1:04:57, 10.89s/it] 78%|███████▊ | 1253/1610 [10:58:52<1:12:59, 12.27s/it] {'loss': 0.0173, 'grad_norm': 1.4330092815860598, 'learning_rate': 2.217391304347826e-07, 'completion_length': 118.78572082519531, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.20117834210395813, 'kl': 0.433837890625, 'epoch': 3.89} 78%|███████▊ | 1253/1610 [10:58:52<1:12:59, 12.27s/it] 78%|███████▊ | 1254/1610 [10:59:03<1:10:24, 11.87s/it] {'loss': 0.0028, 'grad_norm': 1.265918689706466, 'learning_rate': 2.2111801242236025e-07, 'completion_length': 96.0535774230957, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1071428656578064, 'kl': 0.069091796875, 'epoch': 3.89} 78%|███████▊ | 1254/1610 [10:59:03<1:10:24, 11.87s/it] 78%|███████▊ | 1255/1610 [10:59:15<1:10:42, 11.95s/it] {'loss': 0.0164, 'grad_norm': 1.5252722949735675, 'learning_rate': 2.2049689440993788e-07, 'completion_length': 92.33928680419922, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.18409644439816475, 'kl': 0.4091796875, 'epoch': 3.9} 78%|███████▊ | 1255/1610 [10:59:15<1:10:42, 11.95s/it] 78%|███████▊ | 1256/1610 [10:59:30<1:15:42, 12.83s/it] {'loss': 0.0498, 'grad_norm': 3.9584469645251756, 'learning_rate': 2.198757763975155e-07, 'completion_length': 98.28571701049805, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.732142984867096, 'reward_std': 0.0357142873108387, 'kl': 1.2442626953125, 'epoch': 3.9} 78%|███████▊ | 1256/1610 [10:59:30<1:15:42, 12.83s/it] 78%|███████▊ | 1257/1610 [10:59:41<1:11:48, 12.21s/it] {'loss': 0.0056, 'grad_norm': 1.7859137672720702, 'learning_rate': 2.1925465838509317e-07, 'completion_length': 94.62500381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.14534124732017517, 'kl': 0.140380859375, 'epoch': 3.9} 78%|███████▊ | 1257/1610 [10:59:41<1:11:48, 12.21s/it] 78%|███████▊ | 1258/1610 [11:00:00<1:24:17, 14.37s/it] {'loss': 0.0461, 'grad_norm': 2.1260893421505154, 'learning_rate': 2.1863354037267078e-07, 'completion_length': 104.53572082519531, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5535715222358704, 'reward_std': 0.2393767237663269, 'kl': 1.158203125, 'epoch': 3.91} 78%|███████▊ | 1258/1610 [11:00:00<1:24:17, 14.37s/it] 78%|███████▊ | 1259/1610 [11:00:16<1:27:19, 14.93s/it] {'loss': 0.0116, 'grad_norm': 2.0021474011492297, 'learning_rate': 2.1801242236024844e-07, 'completion_length': 112.14286422729492, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.261050034314394, 'kl': 0.290283203125, 'epoch': 3.91} 78%|███████▊ | 1259/1610 [11:00:16<1:27:19, 14.93s/it] 78%|███████▊ | 1260/1610 [11:00:26<1:18:58, 13.54s/it] {'loss': 0.0035, 'grad_norm': 0.9620044701915684, 'learning_rate': 2.1739130434782607e-07, 'completion_length': 95.30357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0888671875, 'epoch': 3.91} 78%|███████▊ | 1260/1610 [11:00:26<1:18:58, 13.54s/it] 78%|███████▊ | 1261/1610 [11:00:38<1:15:46, 13.03s/it] {'loss': 0.0351, 'grad_norm': 3.570344906238507, 'learning_rate': 2.1677018633540373e-07, 'completion_length': 97.98214721679688, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.3750000596046448, 'reward_std': 0.3822338730096817, 'kl': 0.880859375, 'epoch': 3.92} 78%|███████▊ | 1261/1610 [11:00:38<1:15:46, 13.03s/it] 78%|███████▊ | 1262/1610 [11:00:49<1:11:32, 12.33s/it] {'loss': 0.0094, 'grad_norm': 2.2540894589083322, 'learning_rate': 2.1614906832298136e-07, 'completion_length': 86.0535774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535714626312256, 'reward_std': 0.14838217198848724, 'kl': 0.235107421875, 'epoch': 3.92} 78%|███████▊ | 1262/1610 [11:00:49<1:11:32, 12.33s/it] 78%|███████▊ | 1263/1610 [11:01:00<1:09:08, 11.96s/it] {'loss': 0.0109, 'grad_norm': 3.3809678539417765, 'learning_rate': 2.1552795031055902e-07, 'completion_length': 89.78572082519531, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.15943220257759094, 'kl': 0.2724609375, 'epoch': 3.92} 78%|███████▊ | 1263/1610 [11:01:00<1:09:08, 11.96s/it] 79%|███████▊ | 1264/1610 [11:01:20<1:22:19, 14.28s/it] {'loss': 0.0234, 'grad_norm': 3.3269007511216317, 'learning_rate': 2.1490683229813662e-07, 'completion_length': 98.4464340209961, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.21981073170900345, 'kl': 0.583984375, 'epoch': 3.93} 79%|███████▊ | 1264/1610 [11:01:20<1:22:19, 14.28s/it] 79%|███████▊ | 1265/1610 [11:01:36<1:25:04, 14.80s/it] {'loss': 0.029, 'grad_norm': 1.2482404075029023, 'learning_rate': 2.1428571428571426e-07, 'completion_length': 107.91071701049805, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857909202576, 'reward_std': 0.2036624401807785, 'kl': 0.722900390625, 'epoch': 3.93} 79%|███████▊ | 1265/1610 [11:01:36<1:25:04, 14.80s/it] 79%|███████▊ | 1266/1610 [11:01:53<1:28:09, 15.38s/it] {'loss': 0.0398, 'grad_norm': 3.6406063777465687, 'learning_rate': 2.1366459627329192e-07, 'completion_length': 113.37500381469727, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.3750000596046448, 'reward_std': 0.1896214708685875, 'kl': 0.998291015625, 'epoch': 3.93} 79%|███████▊ | 1266/1610 [11:01:53<1:28:09, 15.38s/it] 79%|███████▊ | 1267/1610 [11:02:08<1:27:36, 15.33s/it] {'loss': 0.0102, 'grad_norm': 1.970457614207248, 'learning_rate': 2.1304347826086955e-07, 'completion_length': 101.71429061889648, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.14838215708732605, 'kl': 0.255859375, 'epoch': 3.93} 79%|███████▊ | 1267/1610 [11:02:08<1:27:36, 15.33s/it] 79%|███████▉ | 1268/1610 [11:02:23<1:26:56, 15.25s/it] {'loss': 0.0188, 'grad_norm': 32.99810638595112, 'learning_rate': 2.124223602484472e-07, 'completion_length': 95.3035774230957, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785715222358704, 'reward_std': 0.21676982194185257, 'kl': 0.470947265625, 'epoch': 3.94} 79%|███████▉ | 1268/1610 [11:02:23<1:26:56, 15.25s/it] 79%|███████▉ | 1269/1610 [11:02:36<1:22:47, 14.57s/it] {'loss': 0.032, 'grad_norm': 3.311268715362234, 'learning_rate': 2.1180124223602484e-07, 'completion_length': 97.50000381469727, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.26657505333423615, 'kl': 0.80029296875, 'epoch': 3.94} 79%|███████▉ | 1269/1610 [11:02:36<1:22:47, 14.57s/it] 79%|███████▉ | 1270/1610 [11:02:46<1:15:34, 13.34s/it] {'loss': 0.0023, 'grad_norm': 2.7206886375072465, 'learning_rate': 2.1118012422360247e-07, 'completion_length': 83.69643020629883, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.04123930633068085, 'kl': 0.05810546875, 'epoch': 3.94} 79%|███████▉ | 1270/1610 [11:02:46<1:15:34, 13.34s/it] 79%|███████▉ | 1271/1610 [11:02:59<1:14:29, 13.19s/it] {'loss': 0.0093, 'grad_norm': 1.6124539395543873, 'learning_rate': 2.105590062111801e-07, 'completion_length': 108.37500381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.1785714402794838, 'kl': 0.233642578125, 'epoch': 3.95} 79%|███████▉ | 1271/1610 [11:02:59<1:14:29, 13.19s/it] 79%|███████▉ | 1272/1610 [11:03:14<1:17:16, 13.72s/it] {'loss': 0.0195, 'grad_norm': 2.5380019548148964, 'learning_rate': 2.0993788819875776e-07, 'completion_length': 99.67857360839844, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.20117833837866783, 'kl': 0.486328125, 'epoch': 3.95} 79%|███████▉ | 1272/1610 [11:03:14<1:17:16, 13.72s/it] 79%|███████▉ | 1273/1610 [11:03:27<1:15:23, 13.42s/it] {'loss': 0.013, 'grad_norm': 1.6373708819870094, 'learning_rate': 2.093167701863354e-07, 'completion_length': 99.96428680419922, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.2253357470035553, 'kl': 0.32568359375, 'epoch': 3.95} 79%|███████▉ | 1273/1610 [11:03:27<1:15:23, 13.42s/it] 79%|███████▉ | 1274/1610 [11:03:42<1:18:55, 14.09s/it] {'loss': 0.0191, 'grad_norm': 1.5982674810535815, 'learning_rate': 2.0869565217391303e-07, 'completion_length': 118.14286041259766, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.47607421875, 'epoch': 3.96} 79%|███████▉ | 1274/1610 [11:03:42<1:18:55, 14.09s/it] 79%|███████▉ | 1275/1610 [11:03:53<1:13:28, 13.16s/it] {'loss': 0.0022, 'grad_norm': 1.1250190331089855, 'learning_rate': 2.080745341614907e-07, 'completion_length': 94.08929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.1071428656578064, 'kl': 0.0557861328125, 'epoch': 3.96} 79%|███████▉ | 1275/1610 [11:03:53<1:13:28, 13.16s/it] 79%|███████▉ | 1276/1610 [11:04:04<1:08:31, 12.31s/it] {'loss': 0.0081, 'grad_norm': 2.1758938173749907, 'learning_rate': 2.074534161490683e-07, 'completion_length': 84.82143020629883, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.1428571492433548, 'kl': 0.20263671875, 'epoch': 3.96} 79%|███████▉ | 1276/1610 [11:04:04<1:08:31, 12.31s/it] 79%|███████▉ | 1277/1610 [11:04:19<1:13:17, 13.21s/it] {'loss': 0.0193, 'grad_norm': 1.360066136617007, 'learning_rate': 2.0683229813664595e-07, 'completion_length': 95.41071701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.48193359375, 'epoch': 3.97} 79%|███████▉ | 1277/1610 [11:04:19<1:13:17, 13.21s/it] 79%|███████▉ | 1278/1610 [11:04:31<1:10:25, 12.73s/it] {'loss': 0.0303, 'grad_norm': 4.512861016698354, 'learning_rate': 2.0621118012422359e-07, 'completion_length': 91.96429061889648, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178572535514832, 'reward_std': 0.29521381482481956, 'kl': 0.755859375, 'epoch': 3.97} 79%|███████▉ | 1278/1610 [11:04:31<1:10:25, 12.73s/it] 79%|███████▉ | 1279/1610 [11:04:40<1:05:07, 11.81s/it] {'loss': 0.0083, 'grad_norm': 1.9687778924295958, 'learning_rate': 2.0559006211180125e-07, 'completion_length': 85.87500381469727, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.2253357619047165, 'kl': 0.2059326171875, 'epoch': 3.97} 79%|███████▉ | 1279/1610 [11:04:40<1:05:07, 11.81s/it] 80%|███████▉ | 1280/1610 [11:05:00<1:17:41, 14.13s/it] {'loss': 0.0444, 'grad_norm': 2.796811787521982, 'learning_rate': 2.0496894409937888e-07, 'completion_length': 112.76786422729492, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.910714328289032, 'reward': 1.660714328289032, 'reward_std': 0.3822338730096817, 'kl': 1.111328125, 'epoch': 3.98} 80%|███████▉ | 1280/1610 [11:05:00<1:17:41, 14.13s/it] 80%|███████▉ | 1281/1610 [11:05:14<1:17:38, 14.16s/it] {'loss': 0.0638, 'grad_norm': 3.174326286492456, 'learning_rate': 2.0434782608695654e-07, 'completion_length': 111.42857360839844, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.8571428954601288, 'reward': 1.446428656578064, 'reward_std': 0.4450965076684952, 'kl': 1.59765625, 'epoch': 3.98} 80%|███████▉ | 1281/1610 [11:05:14<1:17:38, 14.16s/it] 80%|███████▉ | 1282/1610 [11:05:24<1:11:13, 13.03s/it] {'loss': 0.0048, 'grad_norm': 0.8177272388970175, 'learning_rate': 2.0372670807453414e-07, 'completion_length': 90.03571701049805, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.11266787722706795, 'kl': 0.118896484375, 'epoch': 3.98} 80%|███████▉ | 1282/1610 [11:05:24<1:11:13, 13.03s/it] 80%|███████▉ | 1283/1610 [11:05:38<1:12:08, 13.24s/it] {'loss': 0.0168, 'grad_norm': 2.1302986841163465, 'learning_rate': 2.0310559006211178e-07, 'completion_length': 104.10714721679688, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.785714328289032, 'reward_std': 0.17553052306175232, 'kl': 0.4208984375, 'epoch': 3.98} 80%|███████▉ | 1283/1610 [11:05:38<1:12:08, 13.24s/it] 80%|███████▉ | 1284/1610 [11:05:49<1:08:34, 12.62s/it] {'loss': 0.0151, 'grad_norm': 3.275569837475392, 'learning_rate': 2.0248447204968943e-07, 'completion_length': 103.1785774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6250000596046448, 'reward_std': 0.3324786126613617, 'kl': 0.37890625, 'epoch': 3.99} 80%|███████▉ | 1284/1610 [11:05:49<1:08:34, 12.62s/it] 80%|███████▉ | 1285/1610 [11:06:05<1:12:50, 13.45s/it] {'loss': 0.0385, 'grad_norm': 2.6578649041114675, 'learning_rate': 2.0186335403726707e-07, 'completion_length': 100.21429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178572535514832, 'reward_std': 0.25248411297798157, 'kl': 0.9638671875, 'epoch': 3.99} 80%|███████▉ | 1285/1610 [11:06:05<1:12:50, 13.45s/it] 80%|███████▉ | 1286/1610 [11:06:24<1:21:55, 15.17s/it] {'loss': 0.0468, 'grad_norm': 2.978763343576752, 'learning_rate': 2.0124223602484473e-07, 'completion_length': 124.6964340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5000001192092896, 'reward_std': 0.2967643216252327, 'kl': 1.16796875, 'epoch': 3.99} 80%|███████▉ | 1286/1610 [11:06:24<1:21:55, 15.17s/it] 80%|███████▉ | 1287/1610 [11:06:40<1:23:00, 15.42s/it] {'loss': 0.0844, 'grad_norm': 4.492489808982987, 'learning_rate': 2.0062111801242236e-07, 'completion_length': 100.3035774230957, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.3750000596046448, 'reward_std': 0.34099456667900085, 'kl': 2.11328125, 'epoch': 4.0} 80%|███████▉ | 1287/1610 [11:06:40<1:23:00, 15.42s/it] 80%|████████ | 1288/1610 [11:06:50<1:14:43, 13.92s/it] {'loss': 0.0154, 'grad_norm': 1.7356009835762876, 'learning_rate': 2e-07, 'completion_length': 83.01786041259766, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7500000596046448, 'reward_std': 0.12974976375699043, 'kl': 0.3837890625, 'epoch': 4.0} 80%|████████ | 1288/1610 [11:06:50<1:14:43, 13.92s/it] 80%|████████ | 1289/1610 [11:07:01<1:09:03, 12.91s/it] {'loss': 0.0067, 'grad_norm': 2.8176065743159207, 'learning_rate': 1.9937888198757762e-07, 'completion_length': 90.23214721679688, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.785714328289032, 'reward_std': 0.20117833465337753, 'kl': 0.168701171875, 'epoch': 4.0} 80%|████████ | 1289/1610 [11:07:01<1:09:03, 12.91s/it] 80%|████████ | 1290/1610 [11:07:11<1:04:06, 12.02s/it] {'loss': 0.0111, 'grad_norm': 1.2284970168816327, 'learning_rate': 1.9875776397515526e-07, 'completion_length': 78.1964340209961, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7321429252624512, 'reward_std': 0.1071428656578064, 'kl': 0.2783203125, 'epoch': 4.01} 80%|████████ | 1290/1610 [11:07:11<1:04:06, 12.02s/it] 80%|████████ | 1291/1610 [11:07:22<1:02:50, 11.82s/it] {'loss': 0.0022, 'grad_norm': 1.2124769189938969, 'learning_rate': 1.9813664596273292e-07, 'completion_length': 84.87500381469727, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.0714285746216774, 'kl': 0.053955078125, 'epoch': 4.01} 80%|████████ | 1291/1610 [11:07:22<1:02:50, 11.82s/it] 80%|████████ | 1292/1610 [11:07:33<1:00:30, 11.42s/it] {'loss': 0.0096, 'grad_norm': 1.7396748343546171, 'learning_rate': 1.9751552795031055e-07, 'completion_length': 89.14286041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.1071428619325161, 'kl': 0.2391357421875, 'epoch': 4.01} 80%|████████ | 1292/1610 [11:07:33<1:00:30, 11.42s/it] 80%|████████ | 1293/1610 [11:07:48<1:05:55, 12.48s/it] {'loss': 0.0285, 'grad_norm': 1.578065074531352, 'learning_rate': 1.968944099378882e-07, 'completion_length': 99.6785774230957, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.1181928999722004, 'kl': 0.710205078125, 'epoch': 4.02} 80%|████████ | 1293/1610 [11:07:48<1:05:55, 12.48s/it] 80%|████████ | 1294/1610 [11:07:57<1:01:13, 11.63s/it] {'loss': 0.0114, 'grad_norm': 1.892803728444226, 'learning_rate': 1.962732919254658e-07, 'completion_length': 71.85714721679688, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.535714328289032, 'reward_std': 0.2534676790237427, 'kl': 0.284423828125, 'epoch': 4.02} 80%|████████ | 1294/1610 [11:07:57<1:01:13, 11.63s/it] 80%|████████ | 1295/1610 [11:08:10<1:02:32, 11.91s/it] {'loss': 0.0348, 'grad_norm': 2.6596021503520104, 'learning_rate': 1.9565217391304347e-07, 'completion_length': 94.16071701049805, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6250000596046448, 'reward_std': 0.2937234044075012, 'kl': 0.87109375, 'epoch': 4.02} 80%|████████ | 1295/1610 [11:08:10<1:02:32, 11.91s/it] 80%|████████ | 1296/1610 [11:08:21<1:00:30, 11.56s/it] {'loss': 0.0243, 'grad_norm': 2.281222000641509, 'learning_rate': 1.950310559006211e-07, 'completion_length': 82.03571701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.17098906636238098, 'kl': 0.6103515625, 'epoch': 4.02} 80%|████████ | 1296/1610 [11:08:21<1:00:30, 11.56s/it] 81%|████████ | 1297/1610 [11:08:39<1:10:48, 13.57s/it] {'loss': 0.0435, 'grad_norm': 2.3036388116028808, 'learning_rate': 1.9440993788819876e-07, 'completion_length': 114.76786422729492, 'rewards/accuracy_reward': 0.6785714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6428571939468384, 'reward_std': 0.17553051561117172, 'kl': 1.087890625, 'epoch': 4.03} 81%|████████ | 1297/1610 [11:08:39<1:10:48, 13.57s/it] 81%|████████ | 1298/1610 [11:08:49<1:04:42, 12.45s/it] {'loss': 0.0256, 'grad_norm': 3.1123490614036453, 'learning_rate': 1.937888198757764e-07, 'completion_length': 81.5535774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.571428656578064, 'reward_std': 0.26657508313655853, 'kl': 0.638671875, 'epoch': 4.03} 81%|████████ | 1298/1610 [11:08:49<1:04:42, 12.45s/it] 81%|████████ | 1299/1610 [11:09:09<1:16:10, 14.70s/it] {'loss': 0.0449, 'grad_norm': 2.8696505973832003, 'learning_rate': 1.9316770186335403e-07, 'completion_length': 118.33929443359375, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.5178571939468384, 'reward_std': 0.21981074661016464, 'kl': 1.1171875, 'epoch': 4.03} 81%|████████ | 1299/1610 [11:09:09<1:16:10, 14.70s/it] 81%|████████ | 1300/1610 [11:09:20<1:10:00, 13.55s/it] {'loss': 0.0161, 'grad_norm': 4.836168635409376, 'learning_rate': 1.9254658385093166e-07, 'completion_length': 104.1785774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6071429252624512, 'reward_std': 0.25552502274513245, 'kl': 0.40478515625, 'epoch': 4.04} 81%|████████ | 1300/1610 [11:09:20<1:10:00, 13.55s/it] 81%|████████ | 1301/1610 [11:12:14<5:18:41, 61.88s/it] {'loss': 0.0088, 'grad_norm': 1.6609330553753232, 'learning_rate': 1.919254658385093e-07, 'completion_length': 81.51786041259766, 'rewards/accuracy_reward': 0.8571428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8392857313156128, 'reward_std': 0.18105553090572357, 'kl': 0.220458984375, 'epoch': 4.04} 81%|████████ | 1301/1610 [11:12:14<5:18:41, 61.88s/it] 81%|████████ | 1302/1610 [11:12:24<3:57:32, 46.27s/it] {'loss': 0.029, 'grad_norm': 2.802005335192752, 'learning_rate': 1.9130434782608695e-07, 'completion_length': 82.8214340209961, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5535715222358704, 'reward_std': 0.25248410552740097, 'kl': 0.7265625, 'epoch': 4.04} 81%|████████ | 1302/1610 [11:12:24<3:57:32, 46.27s/it] 81%|████████ | 1303/1610 [11:12:39<3:09:03, 36.95s/it] {'loss': 0.0371, 'grad_norm': 2.4239147073671785, 'learning_rate': 1.9068322981366459e-07, 'completion_length': 96.53572082519531, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7321429252624512, 'reward_std': 0.18105553090572357, 'kl': 0.9296875, 'epoch': 4.05} 81%|████████ | 1303/1610 [11:12:39<3:09:03, 36.95s/it] 81%|████████ | 1304/1610 [11:12:52<2:30:56, 29.60s/it] {'loss': 0.0214, 'grad_norm': 2.435374013495923, 'learning_rate': 1.9006211180124224e-07, 'completion_length': 88.53571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.14838216826319695, 'kl': 0.53515625, 'epoch': 4.05} 81%|████████ | 1304/1610 [11:12:52<2:30:56, 29.60s/it] 81%|████████ | 1305/1610 [11:13:08<2:10:40, 25.71s/it] {'loss': 0.0425, 'grad_norm': 2.381556276039884, 'learning_rate': 1.8944099378881988e-07, 'completion_length': 116.01786422729492, 'rewards/accuracy_reward': 0.3392857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.2500000596046448, 'reward_std': 0.32695358991622925, 'kl': 1.064453125, 'epoch': 4.05} 81%|████████ | 1305/1610 [11:13:08<2:10:40, 25.71s/it] 81%|████████ | 1306/1610 [11:13:18<1:46:17, 20.98s/it] {'loss': 0.004, 'grad_norm': 1.6533632736392556, 'learning_rate': 1.888198757763975e-07, 'completion_length': 78.66071701049805, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1181928962469101, 'kl': 0.0987548828125, 'epoch': 4.06} 81%|████████ | 1306/1610 [11:13:18<1:46:17, 20.98s/it] 81%|████████ | 1307/1610 [11:13:28<1:28:15, 17.48s/it] {'loss': 0.0137, 'grad_norm': 1.9251999653411571, 'learning_rate': 1.8819875776397514e-07, 'completion_length': 84.91072082519531, 'rewards/accuracy_reward': 0.6785714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.1785714402794838, 'kl': 0.341552734375, 'epoch': 4.06} 81%|████████ | 1307/1610 [11:13:28<1:28:15, 17.48s/it] 81%|████████ | 1308/1610 [11:13:38<1:17:56, 15.48s/it] {'loss': 0.0029, 'grad_norm': 1.9602316606329728, 'learning_rate': 1.8757763975155277e-07, 'completion_length': 79.10714721679688, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1539071798324585, 'kl': 0.07177734375, 'epoch': 4.06} 81%|████████ | 1308/1610 [11:13:38<1:17:56, 15.48s/it] 81%|████████▏ | 1309/1610 [11:13:50<1:12:36, 14.48s/it] {'loss': 0.0114, 'grad_norm': 1.10249744661381, 'learning_rate': 1.8695652173913043e-07, 'completion_length': 94.83928680419922, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.2845458984375, 'epoch': 4.07} 81%|████████▏ | 1309/1610 [11:13:51<1:12:36, 14.48s/it] 81%|████████▏ | 1310/1610 [11:14:06<1:13:13, 14.65s/it] {'loss': 0.0281, 'grad_norm': 1.9088635637083324, 'learning_rate': 1.8633540372670807e-07, 'completion_length': 99.96429061889648, 'rewards/accuracy_reward': 0.6071428805589676, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.2253357619047165, 'kl': 0.7001953125, 'epoch': 4.07} 81%|████████▏ | 1310/1610 [11:14:06<1:13:13, 14.65s/it] 81%|████████▏ | 1311/1610 [11:14:24<1:18:43, 15.80s/it] {'loss': 0.0373, 'grad_norm': 2.527360399259366, 'learning_rate': 1.8571428571428572e-07, 'completion_length': 107.64286422729492, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.892857164144516, 'reward': 1.4107143878936768, 'reward_std': 0.1896214671432972, 'kl': 0.9317626953125, 'epoch': 4.07} 81%|████████▏ | 1311/1610 [11:14:24<1:18:43, 15.80s/it] 81%|████████▏ | 1312/1610 [11:14:35<1:11:31, 14.40s/it] {'loss': 0.0117, 'grad_norm': 1.376188710039067, 'learning_rate': 1.8509316770186333e-07, 'completion_length': 82.9464340209961, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.16242313385009766, 'kl': 0.292724609375, 'epoch': 4.07} 81%|████████▏ | 1312/1610 [11:14:35<1:11:31, 14.40s/it] 82%|████████▏ | 1313/1610 [11:14:48<1:08:40, 13.87s/it] {'loss': 0.0458, 'grad_norm': 2.315804568478023, 'learning_rate': 1.84472049689441e-07, 'completion_length': 96.5535774230957, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4642857313156128, 'reward_std': 0.3007388263940811, 'kl': 1.140625, 'epoch': 4.08} 82%|████████▏ | 1313/1610 [11:14:48<1:08:40, 13.87s/it] 82%|████████▏ | 1314/1610 [11:14:58<1:03:16, 12.83s/it] {'loss': 0.004, 'grad_norm': 1.8798460091943212, 'learning_rate': 1.8385093167701862e-07, 'completion_length': 95.1785774230957, 'rewards/accuracy_reward': 0.8750000298023224, 'rewards/format_reward': 1.0, 'reward': 1.8750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.09912109375, 'epoch': 4.08} 82%|████████▏ | 1314/1610 [11:14:58<1:03:16, 12.83s/it] 82%|████████▏ | 1315/1610 [11:15:16<1:10:41, 14.38s/it] {'loss': 0.0485, 'grad_norm': 11.542768802625183, 'learning_rate': 1.8322981366459628e-07, 'completion_length': 108.33929061889648, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5000000596046448, 'reward_std': 0.3294376879930496, 'kl': 1.2109375, 'epoch': 4.08} 82%|████████▏ | 1315/1610 [11:15:16<1:10:41, 14.38s/it] 82%|████████▏ | 1316/1610 [11:15:27<1:04:39, 13.20s/it] {'loss': 0.015, 'grad_norm': 2.366988331806244, 'learning_rate': 1.8260869565217391e-07, 'completion_length': 90.21429061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.1428571492433548, 'kl': 0.37548828125, 'epoch': 4.09} 82%|████████▏ | 1316/1610 [11:15:27<1:04:39, 13.20s/it] 82%|████████▏ | 1317/1610 [11:15:40<1:04:17, 13.16s/it] {'loss': 0.0125, 'grad_norm': 2.3943246531584323, 'learning_rate': 1.8198757763975152e-07, 'completion_length': 92.01786422729492, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.29123930633068085, 'kl': 0.3134765625, 'epoch': 4.09} 82%|████████▏ | 1317/1610 [11:15:40<1:04:17, 13.16s/it] 82%|████████▏ | 1318/1610 [11:15:51<1:01:08, 12.56s/it] {'loss': 0.0261, 'grad_norm': 3.4191802113709637, 'learning_rate': 1.8136645962732918e-07, 'completion_length': 86.05357360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.3032229393720627, 'kl': 0.65234375, 'epoch': 4.09} 82%|████████▏ | 1318/1610 [11:15:51<1:01:08, 12.56s/it] 82%|████████▏ | 1319/1610 [11:16:07<1:06:28, 13.70s/it] {'loss': 0.0205, 'grad_norm': 1.981519970187102, 'learning_rate': 1.807453416149068e-07, 'completion_length': 104.1785774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178572535514832, 'reward_std': 0.1896214708685875, 'kl': 0.513671875, 'epoch': 4.1} 82%|████████▏ | 1319/1610 [11:16:07<1:06:28, 13.70s/it] 82%|████████▏ | 1320/1610 [11:16:16<59:45, 12.36s/it] {'loss': 0.0102, 'grad_norm': 1.9481233718733744, 'learning_rate': 1.8012422360248447e-07, 'completion_length': 83.16071701049805, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535714626312256, 'reward_std': 0.21981072798371315, 'kl': 0.2548828125, 'epoch': 4.1} 82%|████████▏ | 1320/1610 [11:16:16<59:45, 12.36s/it] 82%|████████▏ | 1321/1610 [11:16:32<1:04:37, 13.42s/it] {'loss': 0.0412, 'grad_norm': 3.4672890276008004, 'learning_rate': 1.795031055900621e-07, 'completion_length': 97.5714340209961, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4821429252624512, 'reward_std': 0.2006715089082718, 'kl': 1.02734375, 'epoch': 4.1} 82%|████████▏ | 1321/1610 [11:16:32<1:04:37, 13.42s/it] 82%|████████▏ | 1322/1610 [11:16:43<59:44, 12.45s/it] {'loss': 0.0103, 'grad_norm': 3.1151736263438163, 'learning_rate': 1.7888198757763976e-07, 'completion_length': 89.96428680419922, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.14534125104546547, 'kl': 0.2578125, 'epoch': 4.11} 82%|████████▏ | 1322/1610 [11:16:43<59:44, 12.45s/it] 82%|████████▏ | 1323/1610 [11:16:57<1:02:36, 13.09s/it] {'loss': 0.0266, 'grad_norm': 1.6698095091086653, 'learning_rate': 1.7826086956521737e-07, 'completion_length': 90.75000381469727, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.1785714365541935, 'kl': 0.662109375, 'epoch': 4.11} 82%|████████▏ | 1323/1610 [11:16:57<1:02:36, 13.09s/it] 82%|████████▏ | 1324/1610 [11:17:12<1:04:15, 13.48s/it] {'loss': 0.032, 'grad_norm': 2.6980590895779963, 'learning_rate': 1.7763975155279503e-07, 'completion_length': 92.71429061889648, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4642857313156128, 'reward_std': 0.24695909023284912, 'kl': 0.7998046875, 'epoch': 4.11} 82%|████████▏ | 1324/1610 [11:17:12<1:04:15, 13.48s/it] 82%|████████▏ | 1325/1610 [11:17:24<1:02:39, 13.19s/it] {'loss': 0.0254, 'grad_norm': 1.8133399833844412, 'learning_rate': 1.7701863354037266e-07, 'completion_length': 90.9464340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.2500000074505806, 'kl': 0.63623046875, 'epoch': 4.11} 82%|████████▏ | 1325/1610 [11:17:24<1:02:39, 13.19s/it] 82%|████████▏ | 1326/1610 [11:17:38<1:04:02, 13.53s/it] {'loss': 0.0719, 'grad_norm': 4.172038569785155, 'learning_rate': 1.763975155279503e-07, 'completion_length': 100.39286422729492, 'rewards/accuracy_reward': 0.4107143133878708, 'rewards/format_reward': 0.8214285969734192, 'reward': 1.2321429252624512, 'reward_std': 0.1896214634180069, 'kl': 1.794921875, 'epoch': 4.12} 82%|████████▏ | 1326/1610 [11:17:38<1:04:02, 13.53s/it] 82%|████████▏ | 1327/1610 [11:17:49<59:19, 12.58s/it] {'loss': 0.0079, 'grad_norm': 1.8501536115563433, 'learning_rate': 1.7577639751552795e-07, 'completion_length': 86.5535774230957, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7857143878936768, 'reward_std': 0.18409645557403564, 'kl': 0.1971435546875, 'epoch': 4.12} 82%|████████▏ | 1327/1610 [11:17:49<59:19, 12.58s/it] 82%|████████▏ | 1328/1610 [11:17:58<55:00, 11.70s/it] {'loss': 0.0099, 'grad_norm': 2.1888226113397757, 'learning_rate': 1.7515527950310558e-07, 'completion_length': 88.5535774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.18409645557403564, 'kl': 0.248291015625, 'epoch': 4.12} 82%|████████▏ | 1328/1610 [11:17:58<55:00, 11.70s/it] 83%|████████▎ | 1329/1610 [11:18:08<52:18, 11.17s/it] {'loss': 0.0136, 'grad_norm': 3.6368673456656118, 'learning_rate': 1.7453416149068322e-07, 'completion_length': 84.39286041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.365151971578598, 'kl': 0.33984375, 'epoch': 4.13} 83%|████████▎ | 1329/1610 [11:18:08<52:18, 11.17s/it] 83%|████████▎ | 1330/1610 [11:18:21<54:19, 11.64s/it] {'loss': 0.0114, 'grad_norm': 2.3375805173651187, 'learning_rate': 1.7391304347826085e-07, 'completion_length': 106.73214721679688, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.21981074661016464, 'kl': 0.285400390625, 'epoch': 4.13} 83%|████████▎ | 1330/1610 [11:18:21<54:19, 11.64s/it] 83%|████████▎ | 1331/1610 [11:18:32<53:01, 11.40s/it] {'loss': 0.0246, 'grad_norm': 3.213173606410736, 'learning_rate': 1.732919254658385e-07, 'completion_length': 85.9285774230957, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5357143878936768, 'reward_std': 0.3681929111480713, 'kl': 0.615234375, 'epoch': 4.13} 83%|████████▎ | 1331/1610 [11:18:32<53:01, 11.40s/it] 83%|████████▎ | 1332/1610 [11:18:42<50:52, 10.98s/it] {'loss': 0.0154, 'grad_norm': 2.8636271349534943, 'learning_rate': 1.7267080745341614e-07, 'completion_length': 87.64286422729492, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.17098907008767128, 'kl': 0.38525390625, 'epoch': 4.14} 83%|████████▎ | 1332/1610 [11:18:42<50:52, 10.98s/it] 83%|████████▎ | 1333/1610 [11:18:52<49:51, 10.80s/it] {'loss': 0.0079, 'grad_norm': 6.901042045554203, 'learning_rate': 1.720496894409938e-07, 'completion_length': 87.51786422729492, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.1539071872830391, 'kl': 0.19677734375, 'epoch': 4.14} 83%|████████▎ | 1333/1610 [11:18:52<49:51, 10.80s/it] 83%|████████▎ | 1334/1610 [11:19:08<56:01, 12.18s/it] {'loss': 0.0429, 'grad_norm': 2.544574592957148, 'learning_rate': 1.7142857142857143e-07, 'completion_length': 90.3035774230957, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3214285969734192, 'reward_std': 0.240360289812088, 'kl': 1.07421875, 'epoch': 4.14} 83%|████████▎ | 1334/1610 [11:19:08<56:01, 12.18s/it] 83%|████████▎ | 1335/1610 [11:19:22<58:39, 12.80s/it] {'loss': 0.0321, 'grad_norm': 5.257444630147581, 'learning_rate': 1.7080745341614904e-07, 'completion_length': 81.85714721679688, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785714626312256, 'reward_std': 0.18409645557403564, 'kl': 0.802734375, 'epoch': 4.15} 83%|████████▎ | 1335/1610 [11:19:22<58:39, 12.80s/it] 83%|████████▎ | 1336/1610 [11:19:32<54:10, 11.86s/it] {'loss': 0.0268, 'grad_norm': 2.5384353340490167, 'learning_rate': 1.701863354037267e-07, 'completion_length': 79.48214721679688, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.4285715222358704, 'reward_std': 0.30173254013061523, 'kl': 0.672119140625, 'epoch': 4.15} 83%|████████▎ | 1336/1610 [11:19:32<54:10, 11.86s/it] 83%|████████▎ | 1337/1610 [11:19:48<59:40, 13.12s/it] {'loss': 0.0558, 'grad_norm': 2.511842197354715, 'learning_rate': 1.6956521739130433e-07, 'completion_length': 100.48214721679688, 'rewards/accuracy_reward': 0.4285714477300644, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3392857909202576, 'reward_std': 0.32391269505023956, 'kl': 1.39453125, 'epoch': 4.15} 83%|████████▎ | 1337/1610 [11:19:48<59:40, 13.12s/it] 83%|████████▎ | 1338/1610 [11:20:00<59:02, 13.02s/it] {'loss': 0.024, 'grad_norm': 2.985477831537463, 'learning_rate': 1.68944099378882e-07, 'completion_length': 95.41072082519531, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785714626312256, 'reward_std': 0.2967643141746521, 'kl': 0.6005859375, 'epoch': 4.16} 83%|████████▎ | 1338/1610 [11:20:00<59:02, 13.02s/it] 83%|████████▎ | 1339/1610 [11:20:16<1:01:50, 13.69s/it] {'loss': 0.0426, 'grad_norm': 3.1667129515697683, 'learning_rate': 1.6832298136645962e-07, 'completion_length': 109.64286422729492, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6071429252624512, 'reward_std': 0.1539071835577488, 'kl': 1.060546875, 'epoch': 4.16} 83%|████████▎ | 1339/1610 [11:20:16<1:01:50, 13.69s/it] 83%|████████▎ | 1340/1610 [11:20:28<59:41, 13.27s/it] {'loss': 0.022, 'grad_norm': 3.313088570023216, 'learning_rate': 1.6770186335403728e-07, 'completion_length': 84.57143020629883, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.2142857313156128, 'kl': 0.5498046875, 'epoch': 4.16} 83%|████████▎ | 1340/1610 [11:20:28<59:41, 13.27s/it] 83%|████████▎ | 1341/1610 [11:20:43<1:02:01, 13.83s/it] {'loss': 0.0441, 'grad_norm': 3.519822537550145, 'learning_rate': 1.6708074534161489e-07, 'completion_length': 100.83929061889648, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3750000596046448, 'reward_std': 0.3651519864797592, 'kl': 1.1015625, 'epoch': 4.16} 83%|████████▎ | 1341/1610 [11:20:43<1:02:01, 13.83s/it] 83%|████████▎ | 1342/1610 [11:20:54<58:24, 13.08s/it] {'loss': 0.026, 'grad_norm': 2.092390287528405, 'learning_rate': 1.6645962732919252e-07, 'completion_length': 90.4464340209961, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6607143878936768, 'reward_std': 0.32391269505023956, 'kl': 0.65234375, 'epoch': 4.17} 83%|████████▎ | 1342/1610 [11:20:54<58:24, 13.08s/it] 83%|████████▎ | 1343/1610 [11:21:04<54:01, 12.14s/it] {'loss': 0.0225, 'grad_norm': 2.237236384028182, 'learning_rate': 1.6583850931677018e-07, 'completion_length': 84.89286041259766, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4107143878936768, 'reward_std': 0.23086076974868774, 'kl': 0.56298828125, 'epoch': 4.17} 83%|████████▎ | 1343/1610 [11:21:04<54:01, 12.14s/it] 83%|████████▎ | 1344/1610 [11:21:19<57:36, 12.99s/it] {'loss': 0.0469, 'grad_norm': 5.614392699564262, 'learning_rate': 1.652173913043478e-07, 'completion_length': 93.35714721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.23086077719926834, 'kl': 1.1748046875, 'epoch': 4.17} 83%|████████▎ | 1344/1610 [11:21:19<57:36, 12.99s/it] 84%|████████▎ | 1345/1610 [11:21:36<1:01:45, 13.98s/it] {'loss': 0.0438, 'grad_norm': 3.7543178548520935, 'learning_rate': 1.6459627329192547e-07, 'completion_length': 112.91072082519531, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.6250000596046448, 'reward_std': 0.22229483723640442, 'kl': 1.095703125, 'epoch': 4.18} 84%|████████▎ | 1345/1610 [11:21:36<1:01:45, 13.98s/it] 84%|████████▎ | 1346/1610 [11:21:46<56:03, 12.74s/it] {'loss': 0.0076, 'grad_norm': 2.38769800302218, 'learning_rate': 1.639751552795031e-07, 'completion_length': 82.50000381469727, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 1.0, 'reward': 1.5178571939468384, 'reward_std': 0.1071428619325161, 'kl': 0.189697265625, 'epoch': 4.18} 84%|████████▎ | 1346/1610 [11:21:46<56:03, 12.74s/it] 84%|████████▎ | 1347/1610 [11:22:00<58:14, 13.29s/it] {'loss': 0.0333, 'grad_norm': 3.1763098667272334, 'learning_rate': 1.6335403726708073e-07, 'completion_length': 87.71429061889648, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.910714328289032, 'reward': 1.4285715222358704, 'reward_std': 0.2967643141746521, 'kl': 0.8330078125, 'epoch': 4.18} 84%|████████▎ | 1347/1610 [11:22:00<58:14, 13.29s/it] 84%|████████▎ | 1348/1610 [11:22:13<56:54, 13.03s/it] {'loss': 0.0307, 'grad_norm': 3.3891075106505006, 'learning_rate': 1.6273291925465837e-07, 'completion_length': 94.78571701049805, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5535715222358704, 'reward_std': 0.36266787350177765, 'kl': 0.76953125, 'epoch': 4.19} 84%|████████▎ | 1348/1610 [11:22:13<56:54, 13.03s/it] 84%|████████▍ | 1349/1610 [11:22:22<52:15, 12.01s/it] {'loss': 0.0049, 'grad_norm': 1.432712426302788, 'learning_rate': 1.6211180124223603e-07, 'completion_length': 92.53571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.1428571529686451, 'kl': 0.122314453125, 'epoch': 4.19} 84%|████████▍ | 1349/1610 [11:22:22<52:15, 12.01s/it] 84%|████████▍ | 1350/1610 [11:22:32<48:53, 11.28s/it] {'loss': 0.0171, 'grad_norm': 2.78898504166625, 'learning_rate': 1.6149068322981366e-07, 'completion_length': 73.57143020629883, 'rewards/accuracy_reward': 0.6428571790456772, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5892857909202576, 'reward_std': 0.16546405479311943, 'kl': 0.42724609375, 'epoch': 4.19} 84%|████████▍ | 1350/1610 [11:22:32<48:53, 11.28s/it] 84%|████████▍ | 1351/1610 [11:22:42<47:25, 10.99s/it] {'loss': 0.0159, 'grad_norm': 5.350408072706796, 'learning_rate': 1.608695652173913e-07, 'completion_length': 79.00000381469727, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.39794921875, 'epoch': 4.2} 84%|████████▍ | 1351/1610 [11:22:42<47:25, 10.99s/it] 84%|████████▍ | 1352/1610 [11:22:53<47:08, 10.96s/it] {'loss': 0.0214, 'grad_norm': 3.3135205688998766, 'learning_rate': 1.6024844720496895e-07, 'completion_length': 80.98214721679688, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 0.910714328289032, 'reward': 1.410714328289032, 'reward_std': 0.3193712383508682, 'kl': 0.537109375, 'epoch': 4.2} 84%|████████▍ | 1352/1610 [11:22:53<47:08, 10.96s/it] 84%|████████▍ | 1353/1610 [11:23:08<52:26, 12.24s/it] {'loss': 0.0288, 'grad_norm': 2.081395342662895, 'learning_rate': 1.5962732919254656e-07, 'completion_length': 106.03572082519531, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.8750000298023224, 'reward': 1.5178571939468384, 'reward_std': 0.22378524392843246, 'kl': 0.72265625, 'epoch': 4.2} 84%|████████▍ | 1353/1610 [11:23:08<52:26, 12.24s/it] 84%|████████▍ | 1354/1610 [11:23:23<55:55, 13.11s/it] {'loss': 0.0051, 'grad_norm': 3.6977735615405734, 'learning_rate': 1.5900621118012422e-07, 'completion_length': 82.21429061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.1428571492433548, 'kl': 0.127197265625, 'epoch': 4.2} 84%|████████▍ | 1354/1610 [11:23:23<55:55, 13.11s/it] 84%|████████▍ | 1355/1610 [11:23:34<52:02, 12.24s/it] {'loss': 0.0256, 'grad_norm': 4.582376447483899, 'learning_rate': 1.5838509316770185e-07, 'completion_length': 87.94643020629883, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.910714328289032, 'reward': 1.5892857909202576, 'reward_std': 0.2826733738183975, 'kl': 0.640625, 'epoch': 4.21} 84%|████████▍ | 1355/1610 [11:23:34<52:02, 12.24s/it] 84%|████████▍ | 1356/1610 [11:23:44<49:30, 11.70s/it] {'loss': 0.0343, 'grad_norm': 2.912229270630809, 'learning_rate': 1.577639751552795e-07, 'completion_length': 87.53571701049805, 'rewards/accuracy_reward': 0.375, 'rewards/format_reward': 0.910714328289032, 'reward': 1.285714328289032, 'reward_std': 0.29924844205379486, 'kl': 0.8603515625, 'epoch': 4.21} 84%|████████▍ | 1356/1610 [11:23:44<49:30, 11.70s/it] 84%|████████▍ | 1357/1610 [11:23:54<47:40, 11.31s/it] {'loss': 0.0283, 'grad_norm': 2.4517600711130467, 'learning_rate': 1.5714285714285714e-07, 'completion_length': 85.0714340209961, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3214285969734192, 'reward_std': 0.18409645557403564, 'kl': 0.707763671875, 'epoch': 4.21} 84%|████████▍ | 1357/1610 [11:23:54<47:40, 11.31s/it] 84%|████████▍ | 1358/1610 [11:24:04<44:59, 10.71s/it] {'loss': 0.0066, 'grad_norm': 3.4959983287646725, 'learning_rate': 1.565217391304348e-07, 'completion_length': 63.33928871154785, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6250000596046448, 'reward_std': 0.20670336484909058, 'kl': 0.1640625, 'epoch': 4.22} 84%|████████▍ | 1358/1610 [11:24:04<44:59, 10.71s/it] 84%|████████▍ | 1359/1610 [11:24:22<54:13, 12.96s/it] {'loss': 0.0515, 'grad_norm': 2.9837715881677833, 'learning_rate': 1.559006211180124e-07, 'completion_length': 108.08929061889648, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.2857143878936768, 'reward_std': 0.21676982194185257, 'kl': 1.2890625, 'epoch': 4.22} 84%|████████▍ | 1359/1610 [11:24:22<54:13, 12.96s/it] 84%|████████▍ | 1360/1610 [11:24:31<49:35, 11.90s/it] {'loss': 0.0114, 'grad_norm': 3.6897934547777576, 'learning_rate': 1.5527950310559004e-07, 'completion_length': 69.03571701049805, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.2967643290758133, 'kl': 0.283203125, 'epoch': 4.22} 84%|████████▍ | 1360/1610 [11:24:31<49:35, 11.90s/it] 85%|████████▍ | 1361/1610 [11:24:41<46:52, 11.30s/it] {'loss': 0.0399, 'grad_norm': 2.2050218715291257, 'learning_rate': 1.546583850931677e-07, 'completion_length': 74.48214721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.892857164144516, 'reward': 1.5357143878936768, 'reward_std': 0.1428571529686451, 'kl': 0.9970703125, 'epoch': 4.23} 85%|████████▍ | 1361/1610 [11:24:41<46:52, 11.30s/it] 85%|████████▍ | 1362/1610 [11:24:50<44:00, 10.65s/it] {'loss': 0.0235, 'grad_norm': 5.37166505134377, 'learning_rate': 1.5403726708074533e-07, 'completion_length': 77.82143020629883, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5000001192092896, 'reward_std': 0.25552503019571304, 'kl': 0.5869140625, 'epoch': 4.23} 85%|████████▍ | 1362/1610 [11:24:50<44:00, 10.65s/it] 85%|████████▍ | 1363/1610 [11:25:01<43:49, 10.64s/it] {'loss': 0.0023, 'grad_norm': 1.1134058595097382, 'learning_rate': 1.53416149068323e-07, 'completion_length': 88.30357360839844, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.07695359364151955, 'kl': 0.0584716796875, 'epoch': 4.23} 85%|████████▍ | 1363/1610 [11:25:01<43:49, 10.64s/it] 85%|████████▍ | 1364/1610 [11:25:10<41:45, 10.18s/it] {'loss': 0.0043, 'grad_norm': 2.947018896146771, 'learning_rate': 1.5279503105590062e-07, 'completion_length': 74.64286041259766, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.1428571529686451, 'kl': 0.107666015625, 'epoch': 4.24} 85%|████████▍ | 1364/1610 [11:25:10<41:45, 10.18s/it] 85%|████████▍ | 1365/1610 [11:25:20<40:56, 10.03s/it] {'loss': 0.0137, 'grad_norm': 1.4494034245577647, 'learning_rate': 1.5217391304347825e-07, 'completion_length': 84.0535774230957, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6071429252624512, 'reward_std': 0.12974976375699043, 'kl': 0.3408203125, 'epoch': 4.24} 85%|████████▍ | 1365/1610 [11:25:20<40:56, 10.03s/it] 85%|████████▍ | 1366/1610 [11:25:29<40:01, 9.84s/it] {'loss': 0.0218, 'grad_norm': 4.298532233674548, 'learning_rate': 1.5155279503105589e-07, 'completion_length': 72.23214721679688, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.27813194692134857, 'kl': 0.5458984375, 'epoch': 4.24} 85%|████████▍ | 1366/1610 [11:25:29<40:01, 9.84s/it] 85%|████████▍ | 1367/1610 [11:25:40<40:45, 10.06s/it] {'loss': 0.0063, 'grad_norm': 1.5965218647698256, 'learning_rate': 1.5093167701863354e-07, 'completion_length': 85.76786041259766, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.2006715089082718, 'kl': 0.158203125, 'epoch': 4.25} 85%|████████▍ | 1367/1610 [11:25:40<40:45, 10.06s/it] 85%|████████▍ | 1368/1610 [11:25:50<41:23, 10.26s/it] {'loss': 0.0364, 'grad_norm': 4.474308035061628, 'learning_rate': 1.5031055900621118e-07, 'completion_length': 75.91071701049805, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.8928571939468384, 'reward': 1.4285715222358704, 'reward_std': 0.33800362050533295, 'kl': 0.912109375, 'epoch': 4.25} 85%|████████▍ | 1368/1610 [11:25:50<41:23, 10.26s/it] 85%|████████▌ | 1369/1610 [11:26:01<41:03, 10.22s/it] {'loss': 0.0067, 'grad_norm': 1.3420832646121303, 'learning_rate': 1.496894409937888e-07, 'completion_length': 84.25000381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.11266787722706795, 'kl': 0.167724609375, 'epoch': 4.25} 85%|████████▌ | 1369/1610 [11:26:01<41:03, 10.22s/it] 85%|████████▌ | 1370/1610 [11:26:15<45:24, 11.35s/it] {'loss': 0.0319, 'grad_norm': 3.2193082464325196, 'learning_rate': 1.4906832298136647e-07, 'completion_length': 82.4464340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5178571939468384, 'reward_std': 0.21981073170900345, 'kl': 0.79833984375, 'epoch': 4.25} 85%|████████▌ | 1370/1610 [11:26:15<45:24, 11.35s/it] 85%|████████▌ | 1371/1610 [11:26:34<55:29, 13.93s/it] {'loss': 0.0557, 'grad_norm': 3.194280160612576, 'learning_rate': 1.4844720496894407e-07, 'completion_length': 114.9285774230957, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.8571429252624512, 'reward': 1.4642857909202576, 'reward_std': 0.37067699432373047, 'kl': 1.390625, 'epoch': 4.26} 85%|████████▌ | 1371/1610 [11:26:34<55:29, 13.93s/it] 85%|████████▌ | 1372/1610 [11:26:44<50:19, 12.69s/it] {'loss': 0.0172, 'grad_norm': 5.133241712686629, 'learning_rate': 1.4782608695652173e-07, 'completion_length': 77.82143020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.571428656578064, 'reward_std': 0.31633031368255615, 'kl': 0.4296875, 'epoch': 4.26} 85%|████████▌ | 1372/1610 [11:26:44<50:19, 12.69s/it] 85%|████████▌ | 1373/1610 [11:26:54<46:19, 11.73s/it] {'loss': 0.0301, 'grad_norm': 3.1257129093631626, 'learning_rate': 1.4720496894409937e-07, 'completion_length': 77.53571701049805, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.6250000596046448, 'reward_std': 0.07695358991622925, 'kl': 0.7537841796875, 'epoch': 4.26} 85%|████████▌ | 1373/1610 [11:26:54<46:19, 11.73s/it] 85%|████████▌ | 1374/1610 [11:27:06<47:07, 11.98s/it] {'loss': 0.0451, 'grad_norm': 2.9394484153821048, 'learning_rate': 1.4658385093167703e-07, 'completion_length': 82.30357360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.4642857909202576, 'reward_std': 0.25552502647042274, 'kl': 1.1298828125, 'epoch': 4.27} 85%|████████▌ | 1374/1610 [11:27:06<47:07, 11.98s/it] 85%|████████▌ | 1375/1610 [11:27:18<46:25, 11.85s/it] {'loss': 0.0244, 'grad_norm': 2.5499275582446157, 'learning_rate': 1.4596273291925466e-07, 'completion_length': 92.35714721679688, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4821429252624512, 'reward_std': 0.1785714402794838, 'kl': 0.61279296875, 'epoch': 4.27} 85%|████████▌ | 1375/1610 [11:27:18<46:25, 11.85s/it] 85%|████████▌ | 1376/1610 [11:27:27<43:13, 11.08s/it] {'loss': 0.0066, 'grad_norm': 2.466410929670738, 'learning_rate': 1.4534161490683232e-07, 'completion_length': 62.42857551574707, 'rewards/accuracy_reward': 0.4821428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4642857909202576, 'reward_std': 0.1539071798324585, 'kl': 0.1650390625, 'epoch': 4.27} 85%|████████▌ | 1376/1610 [11:27:27<43:13, 11.08s/it] 86%|████████▌ | 1377/1610 [11:27:43<48:13, 12.42s/it] {'loss': 0.0061, 'grad_norm': 1.7659838604902183, 'learning_rate': 1.4472049689440992e-07, 'completion_length': 97.98214721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.23086077719926834, 'kl': 0.1533203125, 'epoch': 4.28} 86%|████████▌ | 1377/1610 [11:27:43<48:13, 12.42s/it] 86%|████████▌ | 1378/1610 [11:27:52<44:24, 11.48s/it] {'loss': 0.0065, 'grad_norm': 3.817037062098173, 'learning_rate': 1.4409937888198756e-07, 'completion_length': 68.01786041259766, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857909202576, 'reward_std': 0.24695908278226852, 'kl': 0.1630859375, 'epoch': 4.28} 86%|████████▌ | 1378/1610 [11:27:52<44:24, 11.48s/it] 86%|████████▌ | 1379/1610 [11:28:02<42:45, 11.11s/it] {'loss': 0.0036, 'grad_norm': 1.7384798167697395, 'learning_rate': 1.4347826086956521e-07, 'completion_length': 82.73214721679688, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.090087890625, 'epoch': 4.28} 86%|████████▌ | 1379/1610 [11:28:02<42:45, 11.11s/it] 86%|████████▌ | 1380/1610 [11:28:16<45:35, 11.89s/it] {'loss': 0.0206, 'grad_norm': 2.6732565846263654, 'learning_rate': 1.4285714285714285e-07, 'completion_length': 92.64286041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178572535514832, 'reward_std': 0.21981073170900345, 'kl': 0.5166015625, 'epoch': 4.29} 86%|████████▌ | 1380/1610 [11:28:16<45:35, 11.89s/it] 86%|████████▌ | 1381/1610 [11:28:25<41:59, 11.00s/it] {'loss': 0.0054, 'grad_norm': 0.9100954230974851, 'learning_rate': 1.422360248447205e-07, 'completion_length': 74.39286041259766, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.1337890625, 'epoch': 4.29} 86%|████████▌ | 1381/1610 [11:28:25<41:59, 11.00s/it] 86%|████████▌ | 1382/1610 [11:28:37<43:18, 11.40s/it] {'loss': 0.0191, 'grad_norm': 1.9390541501146006, 'learning_rate': 1.4161490683229814e-07, 'completion_length': 78.39286041259766, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7321429252624512, 'reward_std': 0.07695358991622925, 'kl': 0.4765625, 'epoch': 4.29} 86%|████████▌ | 1382/1610 [11:28:37<43:18, 11.40s/it] 86%|████████▌ | 1383/1610 [11:28:49<43:29, 11.49s/it] {'loss': 0.0336, 'grad_norm': 4.189460542867316, 'learning_rate': 1.4099378881987577e-07, 'completion_length': 83.76786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.535714328289032, 'reward_std': 0.28819840401411057, 'kl': 0.8408203125, 'epoch': 4.3} 86%|████████▌ | 1383/1610 [11:28:49<43:29, 11.49s/it] 86%|████████▌ | 1384/1610 [11:29:04<47:19, 12.57s/it] {'loss': 0.0141, 'grad_norm': 2.0487404042151334, 'learning_rate': 1.403726708074534e-07, 'completion_length': 84.42857360839844, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.3083212077617645, 'kl': 0.352294921875, 'epoch': 4.3} 86%|████████▌ | 1384/1610 [11:29:04<47:19, 12.57s/it] 86%|████████▌ | 1385/1610 [11:29:14<44:10, 11.78s/it] {'loss': 0.0221, 'grad_norm': 2.4077086532515866, 'learning_rate': 1.3975155279503104e-07, 'completion_length': 78.42857360839844, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535715222358704, 'reward_std': 0.29123931378126144, 'kl': 0.551025390625, 'epoch': 4.3} 86%|████████▌ | 1385/1610 [11:29:14<44:10, 11.78s/it] 86%|████████▌ | 1386/1610 [11:29:24<42:26, 11.37s/it] {'loss': 0.0029, 'grad_norm': 0.8526804426466292, 'learning_rate': 1.391304347826087e-07, 'completion_length': 69.44643211364746, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.072021484375, 'epoch': 4.3} 86%|████████▌ | 1386/1610 [11:29:24<42:26, 11.37s/it] 86%|████████▌ | 1387/1610 [11:29:39<46:21, 12.47s/it] {'loss': 0.0243, 'grad_norm': 1.6939457441216157, 'learning_rate': 1.3850931677018633e-07, 'completion_length': 85.87500381469727, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.6064453125, 'epoch': 4.31} 86%|████████▌ | 1387/1610 [11:29:39<46:21, 12.47s/it] 86%|████████▌ | 1388/1610 [11:29:51<45:07, 12.19s/it] {'loss': 0.0097, 'grad_norm': 2.0757472294937727, 'learning_rate': 1.3788819875776399e-07, 'completion_length': 92.46429061889648, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464285969734192, 'reward_std': 0.2610500454902649, 'kl': 0.244140625, 'epoch': 4.31} 86%|████████▌ | 1388/1610 [11:29:51<45:07, 12.19s/it] 86%|████████▋ | 1389/1610 [11:30:02<43:12, 11.73s/it] {'loss': 0.0259, 'grad_norm': 2.168089531747938, 'learning_rate': 1.372670807453416e-07, 'completion_length': 87.71429061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.571428656578064, 'reward_std': 0.11266787722706795, 'kl': 0.6484375, 'epoch': 4.31} 86%|████████▋ | 1389/1610 [11:30:02<43:12, 11.73s/it] 86%|████████▋ | 1390/1610 [11:30:17<46:58, 12.81s/it] {'loss': 0.0068, 'grad_norm': 3.3215669603889433, 'learning_rate': 1.3664596273291925e-07, 'completion_length': 90.76786041259766, 'rewards/accuracy_reward': 0.446428582072258, 'rewards/format_reward': 0.910714328289032, 'reward': 1.3571429252624512, 'reward_std': 0.4008662849664688, 'kl': 0.170654296875, 'epoch': 4.32} 86%|████████▋ | 1390/1610 [11:30:17<46:58, 12.81s/it] 86%|████████▋ | 1391/1610 [11:30:26<42:29, 11.64s/it] {'loss': 0.0138, 'grad_norm': 4.009903877955375, 'learning_rate': 1.3602484472049688e-07, 'completion_length': 63.55357551574707, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.14838216826319695, 'kl': 0.346923828125, 'epoch': 4.32} 86%|████████▋ | 1391/1610 [11:30:26<42:29, 11.64s/it] 86%|████████▋ | 1392/1610 [11:30:41<46:06, 12.69s/it] {'loss': 0.0171, 'grad_norm': 2.9261037057000863, 'learning_rate': 1.3540372670807454e-07, 'completion_length': 90.85714721679688, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4464285969734192, 'reward_std': 0.21981073170900345, 'kl': 0.429443359375, 'epoch': 4.32} 86%|████████▋ | 1392/1610 [11:30:41<46:06, 12.69s/it] 87%|████████▋ | 1393/1610 [11:30:50<42:21, 11.71s/it] {'loss': 0.0202, 'grad_norm': 3.2094373443936597, 'learning_rate': 1.3478260869565218e-07, 'completion_length': 75.08928680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.14838216453790665, 'kl': 0.504150390625, 'epoch': 4.33} 87%|████████▋ | 1393/1610 [11:30:50<42:21, 11.71s/it] 87%|████████▋ | 1394/1610 [11:31:01<41:05, 11.41s/it] {'loss': 0.0038, 'grad_norm': 2.601699146848381, 'learning_rate': 1.3416149068322978e-07, 'completion_length': 77.67857360839844, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.23086076974868774, 'kl': 0.095458984375, 'epoch': 4.33} 87%|████████▋ | 1394/1610 [11:31:01<41:05, 11.41s/it] 87%|████████▋ | 1395/1610 [11:31:12<39:54, 11.14s/it] {'loss': 0.0345, 'grad_norm': 3.7742470384325615, 'learning_rate': 1.3354037267080744e-07, 'completion_length': 71.14286041259766, 'rewards/accuracy_reward': 0.5714286118745804, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178571939468384, 'reward_std': 0.21124479919672012, 'kl': 0.86328125, 'epoch': 4.33} 87%|████████▋ | 1395/1610 [11:31:12<39:54, 11.14s/it] 87%|████████▋ | 1396/1610 [11:31:21<37:57, 10.64s/it] {'loss': 0.0196, 'grad_norm': 2.849826470697926, 'learning_rate': 1.3291925465838507e-07, 'completion_length': 82.66071701049805, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.21981074661016464, 'kl': 0.490234375, 'epoch': 4.34} 87%|████████▋ | 1396/1610 [11:31:21<37:57, 10.64s/it] 87%|████████▋ | 1397/1610 [11:31:32<38:01, 10.71s/it] {'loss': 0.0303, 'grad_norm': 3.5365899932420475, 'learning_rate': 1.3229813664596273e-07, 'completion_length': 81.64286041259766, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.5000000596046448, 'reward_std': 0.40943218767642975, 'kl': 0.759765625, 'epoch': 4.34} 87%|████████▋ | 1397/1610 [11:31:32<38:01, 10.71s/it] 87%|████████▋ | 1398/1610 [11:31:43<38:25, 10.88s/it] {'loss': 0.0145, 'grad_norm': 3.445449096816921, 'learning_rate': 1.3167701863354037e-07, 'completion_length': 88.42857360839844, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.278131939470768, 'kl': 0.3623046875, 'epoch': 4.34} 87%|████████▋ | 1398/1610 [11:31:43<38:25, 10.88s/it] 87%|████████▋ | 1399/1610 [11:31:52<36:19, 10.33s/it] {'loss': 0.009, 'grad_norm': 0.9848708605335822, 'learning_rate': 1.3105590062111802e-07, 'completion_length': 64.80357360839844, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.223876953125, 'epoch': 4.34} 87%|████████▋ | 1399/1610 [11:31:52<36:19, 10.33s/it] 87%|████████▋ | 1400/1610 [11:32:07<40:40, 11.62s/it] {'loss': 0.043, 'grad_norm': 3.3375902691777983, 'learning_rate': 1.3043478260869563e-07, 'completion_length': 83.23214721679688, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.446428656578064, 'reward_std': 0.3193712458014488, 'kl': 1.076171875, 'epoch': 4.35} 87%|████████▋ | 1400/1610 [11:32:07<40:40, 11.62s/it] 87%|████████▋ | 1401/1610 [11:35:08<3:37:14, 62.37s/it] {'loss': 0.0528, 'grad_norm': 4.490248580192749, 'learning_rate': 1.298136645962733e-07, 'completion_length': 85.75000381469727, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9107142984867096, 'reward': 1.5000000596046448, 'reward_std': 0.11266788095235825, 'kl': 1.315673828125, 'epoch': 4.35} 87%|████████▋ | 1401/1610 [11:35:08<3:37:14, 62.37s/it] 87%|████████▋ | 1402/1610 [11:35:18<2:41:33, 46.60s/it] {'loss': 0.0052, 'grad_norm': 1.997542747846838, 'learning_rate': 1.2919254658385092e-07, 'completion_length': 70.92857360839844, 'rewards/accuracy_reward': 0.8571428954601288, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.12939453125, 'epoch': 4.35} 87%|████████▋ | 1402/1610 [11:35:18<2:41:33, 46.60s/it] 87%|████████▋ | 1403/1610 [11:35:27<2:02:24, 35.48s/it] {'loss': 0.017, 'grad_norm': 1.7170462952350365, 'learning_rate': 1.2857142857142855e-07, 'completion_length': 84.03571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.1428571529686451, 'kl': 0.426025390625, 'epoch': 4.36} 87%|████████▋ | 1403/1610 [11:35:27<2:02:24, 35.48s/it] 87%|████████▋ | 1404/1610 [11:35:36<1:34:21, 27.48s/it] {'loss': 0.0316, 'grad_norm': 2.975940507211001, 'learning_rate': 1.2795031055900621e-07, 'completion_length': 60.60714530944824, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.1785714328289032, 'kl': 0.79150390625, 'epoch': 4.36} 87%|████████▋ | 1404/1610 [11:35:36<1:34:21, 27.48s/it] 87%|████████▋ | 1405/1610 [11:35:47<1:16:41, 22.45s/it] {'loss': 0.0052, 'grad_norm': 1.4175240649974115, 'learning_rate': 1.2732919254658385e-07, 'completion_length': 81.00000381469727, 'rewards/accuracy_reward': 0.8214286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8035715222358704, 'reward_std': 0.18105553835630417, 'kl': 0.12890625, 'epoch': 4.36} 87%|████████▋ | 1405/1610 [11:35:47<1:16:41, 22.45s/it] 87%|████████▋ | 1406/1610 [11:35:57<1:03:47, 18.76s/it] {'loss': 0.0028, 'grad_norm': 1.6103218961219379, 'learning_rate': 1.2670807453416148e-07, 'completion_length': 71.33928871154785, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.0701904296875, 'epoch': 4.37} 87%|████████▋ | 1406/1610 [11:35:57<1:03:47, 18.76s/it] 87%|████████▋ | 1407/1610 [11:36:07<54:25, 16.08s/it] {'loss': 0.0045, 'grad_norm': 0.8522854002578919, 'learning_rate': 1.260869565217391e-07, 'completion_length': 74.8214340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.112548828125, 'epoch': 4.37} 87%|████████▋ | 1407/1610 [11:36:07<54:25, 16.08s/it] 87%|████████▋ | 1408/1610 [11:36:17<48:02, 14.27s/it] {'loss': 0.0149, 'grad_norm': 2.829495795072239, 'learning_rate': 1.2546583850931677e-07, 'completion_length': 74.21429061889648, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.607142984867096, 'reward_std': 0.2967643365263939, 'kl': 0.37158203125, 'epoch': 4.37} 87%|████████▋ | 1408/1610 [11:36:17<48:02, 14.27s/it] 88%|████████▊ | 1409/1610 [11:36:26<43:07, 12.87s/it] {'loss': 0.0211, 'grad_norm': 3.0603730779520824, 'learning_rate': 1.248447204968944e-07, 'completion_length': 77.125, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.5283203125, 'epoch': 4.38} 88%|████████▊ | 1409/1610 [11:36:26<43:07, 12.87s/it] 88%|████████▊ | 1410/1610 [11:36:37<40:37, 12.19s/it] {'loss': 0.0038, 'grad_norm': 1.9461401455434009, 'learning_rate': 1.2422360248447204e-07, 'completion_length': 93.8035774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.09423828125, 'epoch': 4.38} 88%|████████▊ | 1410/1610 [11:36:37<40:37, 12.19s/it] 88%|████████▊ | 1411/1610 [11:36:47<38:40, 11.66s/it] {'loss': 0.0118, 'grad_norm': 1.9007354637389664, 'learning_rate': 1.236024844720497e-07, 'completion_length': 78.94643020629883, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6428571939468384, 'reward_std': 0.18658055365085602, 'kl': 0.2939453125, 'epoch': 4.38} 88%|████████▊ | 1411/1610 [11:36:47<38:40, 11.66s/it] 88%|████████▊ | 1412/1610 [11:36:57<36:43, 11.13s/it] {'loss': 0.0097, 'grad_norm': 2.3186971718869316, 'learning_rate': 1.2298136645962733e-07, 'completion_length': 75.01786041259766, 'rewards/accuracy_reward': 0.75, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.7142857313156128, 'reward_std': 0.17098908126354218, 'kl': 0.2421875, 'epoch': 4.39} 88%|████████▊ | 1412/1610 [11:36:57<36:43, 11.13s/it] 88%|████████▊ | 1413/1610 [11:37:12<40:28, 12.33s/it] {'loss': 0.0092, 'grad_norm': 2.4732635293389493, 'learning_rate': 1.2236024844720496e-07, 'completion_length': 98.26786041259766, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.6071429252624512, 'reward_std': 0.2142857313156128, 'kl': 0.2294921875, 'epoch': 4.39} 88%|████████▊ | 1413/1610 [11:37:12<40:28, 12.33s/it] 88%|████████▊ | 1414/1610 [11:37:22<38:02, 11.65s/it] {'loss': 0.0034, 'grad_norm': 3.2610247201011044, 'learning_rate': 1.2173913043478262e-07, 'completion_length': 71.85714530944824, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.083740234375, 'epoch': 4.39} 88%|████████▊ | 1414/1610 [11:37:22<38:02, 11.65s/it] 88%|████████▊ | 1415/1610 [11:37:38<41:34, 12.79s/it] {'loss': 0.0032, 'grad_norm': 1.621551376119367, 'learning_rate': 1.2111801242236025e-07, 'completion_length': 87.53571701049805, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.1785714402794838, 'kl': 0.081298828125, 'epoch': 4.39} 88%|████████▊ | 1415/1610 [11:37:38<41:34, 12.79s/it] 88%|████████▊ | 1416/1610 [11:37:49<40:06, 12.40s/it] {'loss': 0.0037, 'grad_norm': 1.1647265967564662, 'learning_rate': 1.2049689440993788e-07, 'completion_length': 80.62500381469727, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 1.0, 'reward': 1.8928572535514832, 'reward_std': 0.0714285746216774, 'kl': 0.091552734375, 'epoch': 4.4} 88%|████████▊ | 1416/1610 [11:37:49<40:06, 12.40s/it] 88%|████████▊ | 1417/1610 [11:37:59<37:41, 11.72s/it] {'loss': 0.0115, 'grad_norm': 1.7096067823086631, 'learning_rate': 1.1987577639751552e-07, 'completion_length': 76.30357360839844, 'rewards/accuracy_reward': 0.589285746216774, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.1896214708685875, 'kl': 0.28759765625, 'epoch': 4.4} 88%|████████▊ | 1417/1610 [11:37:59<37:41, 11.72s/it] 88%|████████▊ | 1418/1610 [11:38:10<35:58, 11.24s/it] {'loss': 0.0123, 'grad_norm': 1.8225006132060493, 'learning_rate': 1.1925465838509315e-07, 'completion_length': 91.73214721679688, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.12974976375699043, 'kl': 0.306640625, 'epoch': 4.4} 88%|████████▊ | 1418/1610 [11:38:10<35:58, 11.24s/it] 88%|████████▊ | 1419/1610 [11:38:19<34:21, 10.79s/it] {'loss': 0.0027, 'grad_norm': 1.2175699948883012, 'learning_rate': 1.1863354037267081e-07, 'completion_length': 74.42857360839844, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.19514648616313934, 'kl': 0.0682373046875, 'epoch': 4.41} 88%|████████▊ | 1419/1610 [11:38:19<34:21, 10.79s/it] 88%|████████▊ | 1420/1610 [11:38:29<33:26, 10.56s/it] {'loss': 0.0046, 'grad_norm': 1.9021826030538318, 'learning_rate': 1.1801242236024844e-07, 'completion_length': 81.3214340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.14838217198848724, 'kl': 0.11376953125, 'epoch': 4.41} 88%|████████▊ | 1420/1610 [11:38:29<33:26, 10.56s/it] 88%|████████▊ | 1421/1610 [11:38:44<37:29, 11.90s/it] {'loss': 0.0183, 'grad_norm': 4.817683646213568, 'learning_rate': 1.1739130434782609e-07, 'completion_length': 84.71428680419922, 'rewards/accuracy_reward': 0.4285714626312256, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.3571429252624512, 'reward_std': 0.2836569547653198, 'kl': 0.4580078125, 'epoch': 4.41} 88%|████████▊ | 1421/1610 [11:38:44<37:29, 11.90s/it] 88%|████████▊ | 1422/1610 [11:38:57<38:21, 12.24s/it] {'loss': 0.0189, 'grad_norm': 1.6898204048201786, 'learning_rate': 1.1677018633540373e-07, 'completion_length': 83.55357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.17098907008767128, 'kl': 0.4765625, 'epoch': 4.42} 88%|████████▊ | 1422/1610 [11:38:57<38:21, 12.24s/it] 88%|████████▊ | 1423/1610 [11:39:06<34:35, 11.10s/it] {'loss': 0.0044, 'grad_norm': 1.5026550123160174, 'learning_rate': 1.1614906832298136e-07, 'completion_length': 68.82143020629883, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.1539071835577488, 'kl': 0.1103515625, 'epoch': 4.42} 88%|████████▊ | 1423/1610 [11:39:06<34:35, 11.10s/it] 88%|████████▊ | 1424/1610 [11:39:17<34:28, 11.12s/it] {'loss': 0.0124, 'grad_norm': 1.3677691400856087, 'learning_rate': 1.15527950310559e-07, 'completion_length': 82.41071701049805, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178571939468384, 'reward_std': 0.1071428656578064, 'kl': 0.30908203125, 'epoch': 4.42} 88%|████████▊ | 1424/1610 [11:39:17<34:28, 11.12s/it] 89%|████████▊ | 1425/1610 [11:39:26<32:40, 10.60s/it] {'loss': 0.0085, 'grad_norm': 2.1809354547766655, 'learning_rate': 1.1490683229813663e-07, 'completion_length': 71.75000381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.212646484375, 'epoch': 4.43} 89%|████████▊ | 1425/1610 [11:39:26<32:40, 10.60s/it] 89%|████████▊ | 1426/1610 [11:39:36<31:44, 10.35s/it] {'loss': 0.018, 'grad_norm': 3.5760216797897897, 'learning_rate': 1.1428571428571427e-07, 'completion_length': 75.50000381469727, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.24241764098405838, 'kl': 0.450927734375, 'epoch': 4.43} 89%|████████▊ | 1426/1610 [11:39:36<31:44, 10.35s/it] 89%|████████▊ | 1427/1610 [11:39:45<30:16, 9.92s/it] {'loss': 0.0043, 'grad_norm': 4.46108584103826, 'learning_rate': 1.1366459627329192e-07, 'completion_length': 70.26786041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.07695358991622925, 'kl': 0.108154296875, 'epoch': 4.43} 89%|████████▊ | 1427/1610 [11:39:45<30:16, 9.92s/it] 89%|████████▊ | 1428/1610 [11:39:54<29:24, 9.70s/it] {'loss': 0.0127, 'grad_norm': 1.907238202659363, 'learning_rate': 1.1304347826086955e-07, 'completion_length': 71.05357360839844, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535715222358704, 'reward_std': 0.1071428656578064, 'kl': 0.31689453125, 'epoch': 4.43} 89%|████████▊ | 1428/1610 [11:39:54<29:24, 9.70s/it] 89%|████████▉ | 1429/1610 [11:40:04<29:26, 9.76s/it] {'loss': 0.0091, 'grad_norm': 1.9417929458187464, 'learning_rate': 1.124223602484472e-07, 'completion_length': 75.625, 'rewards/accuracy_reward': 0.3035714477300644, 'rewards/format_reward': 1.0, 'reward': 1.3035714626312256, 'reward_std': 0.1785714402794838, 'kl': 0.2275390625, 'epoch': 4.44} 89%|████████▉ | 1429/1610 [11:40:04<29:26, 9.76s/it] 89%|████████▉ | 1430/1610 [11:40:14<29:41, 9.90s/it] {'loss': 0.0044, 'grad_norm': 2.4501832853122116, 'learning_rate': 1.1180124223602484e-07, 'completion_length': 84.10714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.10888671875, 'epoch': 4.44} 89%|████████▉ | 1430/1610 [11:40:14<29:41, 9.90s/it] 89%|████████▉ | 1431/1610 [11:40:25<30:24, 10.19s/it] {'loss': 0.0031, 'grad_norm': 1.313247665792695, 'learning_rate': 1.1118012422360248e-07, 'completion_length': 82.0714340209961, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.12974976375699043, 'kl': 0.0771484375, 'epoch': 4.44} 89%|████████▉ | 1431/1610 [11:40:25<30:24, 10.19s/it] 89%|████████▉ | 1432/1610 [11:40:40<33:54, 11.43s/it] {'loss': 0.0032, 'grad_norm': 1.886907336678126, 'learning_rate': 1.1055900621118012e-07, 'completion_length': 80.55357360839844, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.1539071872830391, 'kl': 0.0810546875, 'epoch': 4.45} 89%|████████▉ | 1432/1610 [11:40:40<33:54, 11.43s/it] 89%|████████▉ | 1433/1610 [11:40:50<32:39, 11.07s/it] {'loss': 0.0038, 'grad_norm': 1.6431730208181154, 'learning_rate': 1.0993788819875776e-07, 'completion_length': 70.66072082519531, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.095458984375, 'epoch': 4.45} 89%|████████▉ | 1433/1610 [11:40:50<32:39, 11.07s/it] 89%|████████▉ | 1434/1610 [11:41:01<32:30, 11.08s/it] {'loss': 0.0079, 'grad_norm': 1.2271390120116936, 'learning_rate': 1.0931677018633539e-07, 'completion_length': 83.83929061889648, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.0714285746216774, 'kl': 0.19775390625, 'epoch': 4.45} 89%|████████▉ | 1434/1610 [11:41:01<32:30, 11.08s/it] 89%|████████▉ | 1435/1610 [11:41:10<30:12, 10.36s/it] {'loss': 0.0036, 'grad_norm': 1.8710970071381043, 'learning_rate': 1.0869565217391303e-07, 'completion_length': 69.26786041259766, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.14838216826319695, 'kl': 0.089111328125, 'epoch': 4.46} 89%|████████▉ | 1435/1610 [11:41:10<30:12, 10.36s/it] 89%|████████▉ | 1436/1610 [11:41:19<29:29, 10.17s/it] {'loss': 0.0068, 'grad_norm': 1.9592674309458487, 'learning_rate': 1.0807453416149068e-07, 'completion_length': 74.58929061889648, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.11266788095235825, 'kl': 0.170166015625, 'epoch': 4.46} 89%|████████▉ | 1436/1610 [11:41:19<29:29, 10.17s/it] 89%|████████▉ | 1437/1610 [11:41:29<28:58, 10.05s/it] {'loss': 0.0071, 'grad_norm': 1.5021958654443703, 'learning_rate': 1.0745341614906831e-07, 'completion_length': 75.71428680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.178466796875, 'epoch': 4.46} 89%|████████▉ | 1437/1610 [11:41:29<28:58, 10.05s/it] 89%|████████▉ | 1438/1610 [11:41:40<29:16, 10.21s/it] {'loss': 0.0036, 'grad_norm': 1.6625222515405935, 'learning_rate': 1.0683229813664596e-07, 'completion_length': 76.87500381469727, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.11266787722706795, 'kl': 0.09033203125, 'epoch': 4.47} 89%|████████▉ | 1438/1610 [11:41:40<29:16, 10.21s/it] 89%|████████▉ | 1439/1610 [11:41:49<28:19, 9.94s/it] {'loss': 0.0065, 'grad_norm': 3.1767433603640245, 'learning_rate': 1.062111801242236e-07, 'completion_length': 70.94643020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.14838215708732605, 'kl': 0.161376953125, 'epoch': 4.47} 89%|████████▉ | 1439/1610 [11:41:49<28:19, 9.94s/it] 89%|████████▉ | 1440/1610 [11:41:58<27:30, 9.71s/it] {'loss': 0.0098, 'grad_norm': 1.7237127164915227, 'learning_rate': 1.0559006211180124e-07, 'completion_length': 76.5714340209961, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.21981073915958405, 'kl': 0.246826171875, 'epoch': 4.47} 89%|████████▉ | 1440/1610 [11:41:58<27:30, 9.71s/it] 90%|████████▉ | 1441/1610 [11:42:08<27:41, 9.83s/it] {'loss': 0.0045, 'grad_norm': 0.6743583713846417, 'learning_rate': 1.0496894409937888e-07, 'completion_length': 84.19643020629883, 'rewards/accuracy_reward': 0.696428582072258, 'rewards/format_reward': 1.0, 'reward': 1.6964285969734192, 'reward_std': 0.0357142873108387, 'kl': 0.112548828125, 'epoch': 4.48} 90%|████████▉ | 1441/1610 [11:42:08<27:41, 9.83s/it] 90%|████████▉ | 1442/1610 [11:42:17<27:00, 9.65s/it] {'loss': 0.0148, 'grad_norm': 2.2036457509624197, 'learning_rate': 1.0434782608695651e-07, 'completion_length': 68.66071701049805, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.3707275390625, 'epoch': 4.48} 90%|████████▉ | 1442/1610 [11:42:17<27:00, 9.65s/it] 90%|████████▉ | 1443/1610 [11:42:28<27:27, 9.87s/it] {'loss': 0.0063, 'grad_norm': 1.5606982738707074, 'learning_rate': 1.0372670807453415e-07, 'completion_length': 89.07143020629883, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.821428656578064, 'reward_std': 0.1539071835577488, 'kl': 0.15673828125, 'epoch': 4.48} 90%|████████▉ | 1443/1610 [11:42:28<27:27, 9.87s/it] 90%|████████▉ | 1444/1610 [11:42:37<26:53, 9.72s/it] {'loss': 0.0107, 'grad_norm': 2.895668778902427, 'learning_rate': 1.0310559006211179e-07, 'completion_length': 73.23214340209961, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.660714328289032, 'reward_std': 0.14838216826319695, 'kl': 0.2685546875, 'epoch': 4.48} 90%|████████▉ | 1444/1610 [11:42:37<26:53, 9.72s/it] 90%|████████▉ | 1445/1610 [11:42:47<26:50, 9.76s/it] {'loss': 0.0128, 'grad_norm': 2.197742058929996, 'learning_rate': 1.0248447204968944e-07, 'completion_length': 82.14286041259766, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.571428656578064, 'reward_std': 0.26657506078481674, 'kl': 0.318359375, 'epoch': 4.49} 90%|████████▉ | 1445/1610 [11:42:47<26:50, 9.76s/it] 90%|████████▉ | 1446/1610 [11:43:01<30:08, 11.03s/it] {'loss': 0.0059, 'grad_norm': 2.1021213495020845, 'learning_rate': 1.0186335403726707e-07, 'completion_length': 82.37500381469727, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.26657505333423615, 'kl': 0.14794921875, 'epoch': 4.49} 90%|████████▉ | 1446/1610 [11:43:01<30:08, 11.03s/it] 90%|████████▉ | 1447/1610 [11:43:10<28:18, 10.42s/it] {'loss': 0.0111, 'grad_norm': 4.244078712292141, 'learning_rate': 1.0124223602484472e-07, 'completion_length': 66.62500381469727, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5357143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.27978515625, 'epoch': 4.49} 90%|████████▉ | 1447/1610 [11:43:10<28:18, 10.42s/it] 90%|████████▉ | 1448/1610 [11:43:19<26:40, 9.88s/it] {'loss': 0.0069, 'grad_norm': 3.0246312965161333, 'learning_rate': 1.0062111801242236e-07, 'completion_length': 58.78571701049805, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1896214671432972, 'kl': 0.171875, 'epoch': 4.5} 90%|████████▉ | 1448/1610 [11:43:19<26:40, 9.88s/it] 90%|█████████ | 1449/1610 [11:43:28<26:25, 9.85s/it] {'loss': 0.0076, 'grad_norm': 1.6144143977643906, 'learning_rate': 1e-07, 'completion_length': 71.57143020629883, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695358991622925, 'kl': 0.190673828125, 'epoch': 4.5} 90%|█████████ | 1449/1610 [11:43:28<26:25, 9.85s/it] 90%|█████████ | 1450/1610 [11:43:39<26:43, 10.02s/it] {'loss': 0.0139, 'grad_norm': 2.2134737676308722, 'learning_rate': 9.937888198757763e-08, 'completion_length': 86.14286041259766, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.7321429252624512, 'reward_std': 0.21124479919672012, 'kl': 0.3486328125, 'epoch': 4.5} 90%|█████████ | 1450/1610 [11:43:39<26:43, 10.02s/it] 90%|█████████ | 1451/1610 [11:43:50<27:13, 10.27s/it] {'loss': 0.0036, 'grad_norm': 3.12514568760906, 'learning_rate': 9.875776397515527e-08, 'completion_length': 88.83929061889648, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535715222358704, 'reward_std': 0.1896214708685875, 'kl': 0.0888671875, 'epoch': 4.51} 90%|█████████ | 1451/1610 [11:43:50<27:13, 10.27s/it] 90%|█████████ | 1452/1610 [11:43:59<26:36, 10.11s/it] {'loss': 0.0147, 'grad_norm': 2.1474961971649384, 'learning_rate': 9.81366459627329e-08, 'completion_length': 91.57143020629883, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4285714626312256, 'reward_std': 0.18409645557403564, 'kl': 0.3681640625, 'epoch': 4.51} 90%|█████████ | 1452/1610 [11:43:59<26:36, 10.11s/it] 90%|█████████ | 1453/1610 [11:44:10<26:33, 10.15s/it] {'loss': 0.0187, 'grad_norm': 2.745393557254067, 'learning_rate': 9.751552795031055e-08, 'completion_length': 83.08929061889648, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4285714626312256, 'reward_std': 0.25346769392490387, 'kl': 0.4658203125, 'epoch': 4.51} 90%|█████████ | 1453/1610 [11:44:10<26:33, 10.15s/it] 90%|█████████ | 1454/1610 [11:44:26<31:13, 12.01s/it] {'loss': 0.0446, 'grad_norm': 11.894379282705776, 'learning_rate': 9.68944099378882e-08, 'completion_length': 93.03571701049805, 'rewards/accuracy_reward': 0.5, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.4464285969734192, 'reward_std': 0.3324786275625229, 'kl': 1.111328125, 'epoch': 4.52} 90%|█████████ | 1454/1610 [11:44:26<31:13, 12.01s/it] 90%|█████████ | 1455/1610 [11:44:41<33:00, 12.77s/it] {'loss': 0.0077, 'grad_norm': 2.504791138404411, 'learning_rate': 9.627329192546583e-08, 'completion_length': 91.58929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.29123932123184204, 'kl': 0.19287109375, 'epoch': 4.52} 90%|█████████ | 1455/1610 [11:44:41<33:00, 12.77s/it] 90%|█████████ | 1456/1610 [11:44:55<33:40, 13.12s/it] {'loss': 0.0349, 'grad_norm': 1.4589383707227292, 'learning_rate': 9.565217391304348e-08, 'completion_length': 88.50000381469727, 'rewards/accuracy_reward': 0.803571492433548, 'rewards/format_reward': 0.9285714626312256, 'reward': 1.7321429252624512, 'reward_std': 0.13527477905154228, 'kl': 0.875, 'epoch': 4.52} 90%|█████████ | 1456/1610 [11:44:55<33:40, 13.12s/it] 90%|█████████ | 1457/1610 [11:45:04<30:21, 11.91s/it] {'loss': 0.0032, 'grad_norm': 2.1059824995333183, 'learning_rate': 9.503105590062112e-08, 'completion_length': 75.6964340209961, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.15943220257759094, 'kl': 0.080810546875, 'epoch': 4.52} 90%|█████████ | 1457/1610 [11:45:04<30:21, 11.91s/it] 91%|█████████ | 1458/1610 [11:45:15<29:46, 11.75s/it] {'loss': 0.0201, 'grad_norm': 2.3321440731341383, 'learning_rate': 9.440993788819875e-08, 'completion_length': 81.55357360839844, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.21124479919672012, 'kl': 0.50146484375, 'epoch': 4.53} 91%|█████████ | 1458/1610 [11:45:15<29:46, 11.75s/it] 91%|█████████ | 1459/1610 [11:45:24<27:11, 10.80s/it] {'loss': 0.0034, 'grad_norm': 2.990535991067799, 'learning_rate': 9.378881987577639e-08, 'completion_length': 66.28571510314941, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.0859375, 'epoch': 4.53} 91%|█████████ | 1459/1610 [11:45:24<27:11, 10.80s/it] 91%|█████████ | 1460/1610 [11:45:38<29:43, 11.89s/it] {'loss': 0.0317, 'grad_norm': 2.984707738310746, 'learning_rate': 9.316770186335403e-08, 'completion_length': 75.98214721679688, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.7919921875, 'epoch': 4.53} 91%|█████████ | 1460/1610 [11:45:38<29:43, 11.89s/it] 91%|█████████ | 1461/1610 [11:45:49<28:32, 11.49s/it] {'loss': 0.0064, 'grad_norm': 0.8699558030597452, 'learning_rate': 9.254658385093167e-08, 'completion_length': 90.05357360839844, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.0714285746216774, 'kl': 0.16064453125, 'epoch': 4.54} 91%|█████████ | 1461/1610 [11:45:49<28:32, 11.49s/it] 91%|█████████ | 1462/1610 [11:46:00<28:19, 11.48s/it] {'loss': 0.0101, 'grad_norm': 1.65981751566373, 'learning_rate': 9.192546583850931e-08, 'completion_length': 89.51786041259766, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285714626312256, 'reward_std': 0.1428571529686451, 'kl': 0.25390625, 'epoch': 4.54} 91%|█████████ | 1462/1610 [11:46:00<28:19, 11.48s/it] 91%|█████████ | 1463/1610 [11:46:09<26:20, 10.75s/it] {'loss': 0.0082, 'grad_norm': 1.3316263942104343, 'learning_rate': 9.130434782608696e-08, 'completion_length': 72.23214721679688, 'rewards/accuracy_reward': 0.892857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.8750001192092896, 'reward_std': 0.1071428619325161, 'kl': 0.20654296875, 'epoch': 4.54} 91%|█████████ | 1463/1610 [11:46:09<26:20, 10.75s/it] 91%|█████████ | 1464/1610 [11:46:21<26:56, 11.07s/it] {'loss': 0.0068, 'grad_norm': 1.9968071420687656, 'learning_rate': 9.068322981366459e-08, 'completion_length': 79.4285774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.1705322265625, 'epoch': 4.55} 91%|█████████ | 1464/1610 [11:46:21<26:56, 11.07s/it] 91%|█████████ | 1465/1610 [11:46:31<26:22, 10.92s/it] {'loss': 0.0031, 'grad_norm': 0.4250053767499055, 'learning_rate': 9.006211180124224e-08, 'completion_length': 92.92857360839844, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.07763671875, 'epoch': 4.55} 91%|█████████ | 1465/1610 [11:46:31<26:22, 10.92s/it] 91%|█████████ | 1466/1610 [11:46:41<25:27, 10.61s/it] {'loss': 0.0128, 'grad_norm': 2.6053442997123573, 'learning_rate': 8.944099378881988e-08, 'completion_length': 81.66071701049805, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.16546404734253883, 'kl': 0.32177734375, 'epoch': 4.55} 91%|█████████ | 1466/1610 [11:46:41<25:27, 10.61s/it] 91%|█████████ | 1467/1610 [11:46:52<24:58, 10.48s/it] {'loss': 0.0059, 'grad_norm': 2.457549687993377, 'learning_rate': 8.881987577639751e-08, 'completion_length': 76.48214340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.147216796875, 'epoch': 4.56} 91%|█████████ | 1467/1610 [11:46:52<24:58, 10.48s/it] 91%|█████████ | 1468/1610 [11:47:01<24:23, 10.31s/it] {'loss': 0.0067, 'grad_norm': 2.354611254959747, 'learning_rate': 8.819875776397515e-08, 'completion_length': 75.41071891784668, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1896214708685875, 'kl': 0.166748046875, 'epoch': 4.56} 91%|█████████ | 1468/1610 [11:47:01<24:23, 10.31s/it] 91%|█████████ | 1469/1610 [11:47:11<23:50, 10.15s/it] {'loss': 0.0056, 'grad_norm': 1.178617540933919, 'learning_rate': 8.757763975155279e-08, 'completion_length': 81.6964340209961, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 1.0, 'reward': 1.7857143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.140380859375, 'epoch': 4.56} 91%|█████████ | 1469/1610 [11:47:11<23:50, 10.15s/it] 91%|█████████▏| 1470/1610 [11:47:22<24:20, 10.43s/it] {'loss': 0.0051, 'grad_norm': 1.1433397408042674, 'learning_rate': 8.695652173913042e-08, 'completion_length': 83.75000381469727, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.0714285746216774, 'kl': 0.127197265625, 'epoch': 4.57} 91%|█████████▏| 1470/1610 [11:47:22<24:20, 10.43s/it] 91%|█████████▏| 1471/1610 [11:47:34<25:19, 10.93s/it] {'loss': 0.0091, 'grad_norm': 1.2369823722853017, 'learning_rate': 8.633540372670807e-08, 'completion_length': 87.23214721679688, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.14838215708732605, 'kl': 0.2276611328125, 'epoch': 4.57} 91%|█████████▏| 1471/1610 [11:47:34<25:19, 10.93s/it] 91%|█████████▏| 1472/1610 [11:47:46<25:49, 11.23s/it] {'loss': 0.0095, 'grad_norm': 1.643718409073598, 'learning_rate': 8.571428571428572e-08, 'completion_length': 82.10714721679688, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.732142984867096, 'reward_std': 0.14838215708732605, 'kl': 0.238525390625, 'epoch': 4.57} 91%|█████████▏| 1472/1610 [11:47:46<25:49, 11.23s/it] 91%|█████████▏| 1473/1610 [11:47:56<24:39, 10.80s/it] {'loss': 0.0042, 'grad_norm': 1.965754361668725, 'learning_rate': 8.509316770186335e-08, 'completion_length': 86.64286041259766, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.14838216453790665, 'kl': 0.104248046875, 'epoch': 4.57} 91%|█████████▏| 1473/1610 [11:47:56<24:39, 10.80s/it] 92%|█████████▏| 1474/1610 [11:48:06<24:00, 10.59s/it] {'loss': 0.0058, 'grad_norm': 2.0004545709653443, 'learning_rate': 8.4472049689441e-08, 'completion_length': 73.76786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.19514649361371994, 'kl': 0.14501953125, 'epoch': 4.58} 92%|█████████▏| 1474/1610 [11:48:06<24:00, 10.59s/it] 92%|█████████▏| 1475/1610 [11:48:18<24:37, 10.95s/it] {'loss': 0.0232, 'grad_norm': 1.852173469996504, 'learning_rate': 8.385093167701864e-08, 'completion_length': 86.30357360839844, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6428571939468384, 'reward_std': 0.2363857924938202, 'kl': 0.578857421875, 'epoch': 4.58} 92%|█████████▏| 1475/1610 [11:48:18<24:37, 10.95s/it] 92%|█████████▏| 1476/1610 [11:48:30<25:08, 11.26s/it] {'loss': 0.0101, 'grad_norm': 3.3917878078493824, 'learning_rate': 8.322981366459626e-08, 'completion_length': 78.75000381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.250732421875, 'epoch': 4.58} 92%|█████████▏| 1476/1610 [11:48:30<25:08, 11.26s/it] 92%|█████████▏| 1477/1610 [11:48:41<24:59, 11.28s/it] {'loss': 0.0091, 'grad_norm': 1.4607704066290819, 'learning_rate': 8.26086956521739e-08, 'completion_length': 81.62500381469727, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.785714328289032, 'reward_std': 0.1428571529686451, 'kl': 0.228759765625, 'epoch': 4.59} 92%|█████████▏| 1477/1610 [11:48:41<24:59, 11.28s/it] 92%|█████████▏| 1478/1610 [11:48:56<27:08, 12.34s/it] {'loss': 0.0064, 'grad_norm': 2.732377572320306, 'learning_rate': 8.198757763975155e-08, 'completion_length': 91.26786041259766, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.410714328289032, 'reward_std': 0.30228935182094574, 'kl': 0.160888671875, 'epoch': 4.59} 92%|█████████▏| 1478/1610 [11:48:56<27:08, 12.34s/it] 92%|█████████▏| 1479/1610 [11:49:12<29:35, 13.55s/it] {'loss': 0.0243, 'grad_norm': 3.388482860170753, 'learning_rate': 8.136645962732918e-08, 'completion_length': 85.53571701049805, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6607143878936768, 'reward_std': 0.1785714402794838, 'kl': 0.607421875, 'epoch': 4.59} 92%|█████████▏| 1479/1610 [11:49:12<29:35, 13.55s/it] 92%|█████████▏| 1480/1610 [11:49:25<28:44, 13.26s/it] {'loss': 0.0387, 'grad_norm': 2.815986540990031, 'learning_rate': 8.074534161490683e-08, 'completion_length': 106.71429061889648, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5178571939468384, 'reward_std': 0.14838216453790665, 'kl': 0.968505859375, 'epoch': 4.6} 92%|█████████▏| 1480/1610 [11:49:25<28:44, 13.26s/it] 92%|█████████▏| 1481/1610 [11:49:35<26:21, 12.26s/it] {'loss': 0.0107, 'grad_norm': 1.715808663280359, 'learning_rate': 8.012422360248448e-08, 'completion_length': 68.69643020629883, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.1071428619325161, 'kl': 0.267578125, 'epoch': 4.6} 92%|█████████▏| 1481/1610 [11:49:35<26:21, 12.26s/it] 92%|█████████▏| 1482/1610 [11:49:45<24:48, 11.63s/it] {'loss': 0.0173, 'grad_norm': 1.480928116977331, 'learning_rate': 7.950310559006211e-08, 'completion_length': 71.03571701049805, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678571939468384, 'reward_std': 0.1181928962469101, 'kl': 0.4329833984375, 'epoch': 4.6} 92%|█████████▏| 1482/1610 [11:49:45<24:48, 11.63s/it] 92%|█████████▏| 1483/1610 [11:49:54<23:01, 10.88s/it] {'loss': 0.0024, 'grad_norm': 1.6491446862417107, 'learning_rate': 7.888198757763975e-08, 'completion_length': 71.69643020629883, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.0611572265625, 'epoch': 4.61} 92%|█████████▏| 1483/1610 [11:49:54<23:01, 10.88s/it] 92%|█████████▏| 1484/1610 [11:50:09<25:09, 11.98s/it] {'loss': 0.0157, 'grad_norm': 1.628332057394531, 'learning_rate': 7.82608695652174e-08, 'completion_length': 90.12500381469727, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.13527477905154228, 'kl': 0.392578125, 'epoch': 4.61} 92%|█████████▏| 1484/1610 [11:50:09<25:09, 11.98s/it] 92%|█████████▏| 1485/1610 [11:50:19<23:45, 11.40s/it] {'loss': 0.0064, 'grad_norm': 2.474931884308228, 'learning_rate': 7.763975155279502e-08, 'completion_length': 80.33929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.23086077719926834, 'kl': 0.1611328125, 'epoch': 4.61} 92%|█████████▏| 1485/1610 [11:50:19<23:45, 11.40s/it] 92%|█████████▏| 1486/1610 [11:50:30<23:17, 11.27s/it] {'loss': 0.0113, 'grad_norm': 2.9894280438229055, 'learning_rate': 7.701863354037266e-08, 'completion_length': 73.6785774230957, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.2253357470035553, 'kl': 0.283935546875, 'epoch': 4.61} 92%|█████████▏| 1486/1610 [11:50:30<23:17, 11.27s/it] 92%|█████████▏| 1487/1610 [11:50:39<21:51, 10.67s/it] {'loss': 0.0115, 'grad_norm': 2.068489371534918, 'learning_rate': 7.639751552795031e-08, 'completion_length': 66.94643020629883, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785715222358704, 'reward_std': 0.12974976375699043, 'kl': 0.2861328125, 'epoch': 4.62} 92%|█████████▏| 1487/1610 [11:50:39<21:51, 10.67s/it] 92%|█████████▏| 1488/1610 [11:50:51<22:30, 11.07s/it] {'loss': 0.0067, 'grad_norm': 2.2192148234337403, 'learning_rate': 7.577639751552794e-08, 'completion_length': 92.10714721679688, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428571939468384, 'reward_std': 0.19514648616313934, 'kl': 0.167724609375, 'epoch': 4.62} 92%|█████████▏| 1488/1610 [11:50:51<22:30, 11.07s/it] 92%|█████████▏| 1489/1610 [11:51:01<21:38, 10.73s/it] {'loss': 0.0078, 'grad_norm': 1.3083749906946216, 'learning_rate': 7.515527950310559e-08, 'completion_length': 78.58928680419922, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.0714285746216774, 'kl': 0.1947021484375, 'epoch': 4.62} 92%|█████████▏| 1489/1610 [11:51:01<21:38, 10.73s/it] 93%|█████████▎| 1490/1610 [11:51:11<21:08, 10.57s/it] {'loss': 0.0056, 'grad_norm': 1.964075942808271, 'learning_rate': 7.453416149068323e-08, 'completion_length': 82.08929061889648, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.18105553835630417, 'kl': 0.139404296875, 'epoch': 4.63} 93%|█████████▎| 1490/1610 [11:51:11<21:08, 10.57s/it] 93%|█████████▎| 1491/1610 [11:51:22<20:54, 10.54s/it] {'loss': 0.0109, 'grad_norm': 1.7184704910871555, 'learning_rate': 7.391304347826087e-08, 'completion_length': 89.71428680419922, 'rewards/accuracy_reward': 0.785714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678572535514832, 'reward_std': 0.2610500529408455, 'kl': 0.2724609375, 'epoch': 4.63} 93%|█████████▎| 1491/1610 [11:51:22<20:54, 10.54s/it] 93%|█████████▎| 1492/1610 [11:51:32<20:17, 10.32s/it] {'loss': 0.007, 'grad_norm': 5.462815333805028, 'learning_rate': 7.329192546583851e-08, 'completion_length': 73.19643020629883, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1181928962469101, 'kl': 0.1748046875, 'epoch': 4.63} 93%|█████████▎| 1492/1610 [11:51:32<20:17, 10.32s/it] 93%|█████████▎| 1493/1610 [11:51:42<20:04, 10.30s/it] {'loss': 0.0028, 'grad_norm': 1.641846844430044, 'learning_rate': 7.267080745341616e-08, 'completion_length': 76.87500381469727, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.070068359375, 'epoch': 4.64} 93%|█████████▎| 1493/1610 [11:51:42<20:04, 10.30s/it] 93%|█████████▎| 1494/1610 [11:51:53<20:34, 10.65s/it] {'loss': 0.0094, 'grad_norm': 0.9112248007834073, 'learning_rate': 7.204968944099378e-08, 'completion_length': 81.73214721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.1071428656578064, 'kl': 0.236572265625, 'epoch': 4.64} 93%|█████████▎| 1494/1610 [11:51:53<20:34, 10.65s/it] 93%|█████████▎| 1495/1610 [11:52:04<20:33, 10.73s/it] {'loss': 0.0079, 'grad_norm': 4.221993275550625, 'learning_rate': 7.142857142857142e-08, 'completion_length': 71.60714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.21981074661016464, 'kl': 0.197265625, 'epoch': 4.64} 93%|█████████▎| 1495/1610 [11:52:04<20:33, 10.73s/it] 93%|█████████▎| 1496/1610 [11:52:15<20:42, 10.90s/it] {'loss': 0.0059, 'grad_norm': 1.239947656878848, 'learning_rate': 7.080745341614907e-08, 'completion_length': 95.14286041259766, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.07695358991622925, 'kl': 0.1484375, 'epoch': 4.65} 93%|█████████▎| 1496/1610 [11:52:15<20:42, 10.90s/it] 93%|█████████▎| 1497/1610 [11:52:27<20:38, 10.96s/it] {'loss': 0.0039, 'grad_norm': 1.460124881774192, 'learning_rate': 7.01863354037267e-08, 'completion_length': 77.1964340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.660714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.09814453125, 'epoch': 4.65} 93%|█████████▎| 1497/1610 [11:52:27<20:38, 10.96s/it] 93%|█████████▎| 1498/1610 [11:52:40<22:02, 11.80s/it] {'loss': 0.01, 'grad_norm': 9.742127737252188, 'learning_rate': 6.956521739130435e-08, 'completion_length': 83.96428680419922, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.1539071872830391, 'kl': 0.24951171875, 'epoch': 4.65} 93%|█████████▎| 1498/1610 [11:52:40<22:02, 11.80s/it] 93%|█████████▎| 1499/1610 [11:52:51<21:24, 11.57s/it] {'loss': 0.0313, 'grad_norm': 2.3022893265263367, 'learning_rate': 6.894409937888199e-08, 'completion_length': 90.71428680419922, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5892857909202576, 'reward_std': 0.1896214634180069, 'kl': 0.78271484375, 'epoch': 4.66} 93%|█████████▎| 1499/1610 [11:52:51<21:24, 11.57s/it] 93%|█████████▎| 1500/1610 [11:53:02<20:41, 11.29s/it] {'loss': 0.0156, 'grad_norm': 3.3839656551134047, 'learning_rate': 6.832298136645963e-08, 'completion_length': 83.32143020629883, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.660714328289032, 'reward_std': 0.23483527451753616, 'kl': 0.3916015625, 'epoch': 4.66} 93%|█████████▎| 1500/1610 [11:53:02<20:41, 11.29s/it] 93%|█████████▎| 1501/1610 [11:56:14<1:58:54, 65.45s/it] {'loss': 0.0024, 'grad_norm': 1.5590019643332838, 'learning_rate': 6.770186335403727e-08, 'completion_length': 82.73214721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.11266788095235825, 'kl': 0.0606689453125, 'epoch': 4.66} 93%|█████████▎| 1501/1610 [11:56:14<1:58:54, 65.45s/it] 93%|█████████▎| 1502/1610 [11:56:25<1:28:29, 49.16s/it] {'loss': 0.0271, 'grad_norm': 3.4174655810734236, 'learning_rate': 6.708074534161489e-08, 'completion_length': 93.14286041259766, 'rewards/accuracy_reward': 0.392857164144516, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.3571429252624512, 'reward_std': 0.2253357619047165, 'kl': 0.6748046875, 'epoch': 4.66} 93%|█████████▎| 1502/1610 [11:56:25<1:28:29, 49.16s/it] 93%|█████████▎| 1503/1610 [11:56:36<1:07:04, 37.61s/it] {'loss': 0.0057, 'grad_norm': 0.7979435488403236, 'learning_rate': 6.645962732919254e-08, 'completion_length': 91.41071701049805, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.1417236328125, 'epoch': 4.67} 93%|█████████▎| 1503/1610 [11:56:36<1:07:04, 37.61s/it] 93%|█████████▎| 1504/1610 [11:56:48<53:02, 30.02s/it] {'loss': 0.0303, 'grad_norm': 3.204020568296875, 'learning_rate': 6.583850931677018e-08, 'completion_length': 77.42857360839844, 'rewards/accuracy_reward': 0.3750000074505806, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.321428656578064, 'reward_std': 0.17553051561117172, 'kl': 0.7568359375, 'epoch': 4.67} 93%|█████████▎| 1504/1610 [11:56:48<53:02, 30.02s/it] 93%|█████████▎| 1505/1610 [11:56:57<41:45, 23.86s/it] {'loss': 0.0031, 'grad_norm': 0.14367027988052122, 'learning_rate': 6.521739130434782e-08, 'completion_length': 70.87500381469727, 'rewards/accuracy_reward': 0.8571429252624512, 'rewards/format_reward': 1.0, 'reward': 1.8571429252624512, 'reward_std': 0.0, 'kl': 0.0762939453125, 'epoch': 4.67} 93%|█████████▎| 1505/1610 [11:56:57<41:45, 23.86s/it] 94%|█████████▎| 1506/1610 [11:57:07<33:53, 19.56s/it] {'loss': 0.0057, 'grad_norm': 3.164973113546604, 'learning_rate': 6.459627329192546e-08, 'completion_length': 79.35714721679688, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.2253357470035553, 'kl': 0.143310546875, 'epoch': 4.68} 94%|█████████▎| 1506/1610 [11:57:07<33:53, 19.56s/it] 94%|█████████▎| 1507/1610 [11:57:17<28:36, 16.67s/it] {'loss': 0.0053, 'grad_norm': 2.6973418409423586, 'learning_rate': 6.397515527950311e-08, 'completion_length': 78.69643020629883, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.1539071798324585, 'kl': 0.131591796875, 'epoch': 4.68} 94%|█████████▎| 1507/1610 [11:57:17<28:36, 16.67s/it] 94%|█████████▎| 1508/1610 [11:57:28<25:43, 15.13s/it] {'loss': 0.0061, 'grad_norm': 1.7595155234381044, 'learning_rate': 6.335403726708074e-08, 'completion_length': 98.80357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1785714402794838, 'kl': 0.151611328125, 'epoch': 4.68} 94%|█████████▎| 1508/1610 [11:57:28<25:43, 15.13s/it] 94%|█████████▎| 1509/1610 [11:57:39<23:03, 13.70s/it] {'loss': 0.0024, 'grad_norm': 1.216759075968677, 'learning_rate': 6.273291925465838e-08, 'completion_length': 91.78571701049805, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.07695358991622925, 'kl': 0.060302734375, 'epoch': 4.69} 94%|█████████▎| 1509/1610 [11:57:39<23:03, 13.70s/it] 94%|█████████▍| 1510/1610 [11:57:48<20:34, 12.34s/it] {'loss': 0.0105, 'grad_norm': 1.3967798762315724, 'learning_rate': 6.211180124223602e-08, 'completion_length': 72.33928680419922, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.1181928962469101, 'kl': 0.263427734375, 'epoch': 4.69} 94%|█████████▍| 1510/1610 [11:57:48<20:34, 12.34s/it] 94%|█████████▍| 1511/1610 [11:57:59<19:33, 11.86s/it] {'loss': 0.0089, 'grad_norm': 3.3670307648276614, 'learning_rate': 6.149068322981366e-08, 'completion_length': 87.1964340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.14838216453790665, 'kl': 0.22412109375, 'epoch': 4.69} 94%|█████████▍| 1511/1610 [11:57:59<19:33, 11.86s/it] 94%|█████████▍| 1512/1610 [11:58:13<20:24, 12.49s/it] {'loss': 0.0044, 'grad_norm': 1.6256373877104056, 'learning_rate': 6.086956521739131e-08, 'completion_length': 110.98214721679688, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.11266787722706795, 'kl': 0.110595703125, 'epoch': 4.7} 94%|█████████▍| 1512/1610 [11:58:13<20:24, 12.49s/it] 94%|█████████▍| 1513/1610 [11:58:24<19:25, 12.02s/it] {'loss': 0.003, 'grad_norm': 2.23263958045781, 'learning_rate': 6.024844720496894e-08, 'completion_length': 76.33929061889648, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.19514648616313934, 'kl': 0.07421875, 'epoch': 4.7} 94%|█████████▍| 1513/1610 [11:58:24<19:25, 12.02s/it] 94%|█████████▍| 1514/1610 [11:58:34<18:32, 11.59s/it] {'loss': 0.0065, 'grad_norm': 1.4472736510885418, 'learning_rate': 5.962732919254657e-08, 'completion_length': 87.28571701049805, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.0714285746216774, 'kl': 0.162353515625, 'epoch': 4.7} 94%|█████████▍| 1514/1610 [11:58:34<18:32, 11.59s/it] 94%|█████████▍| 1515/1610 [11:58:46<18:21, 11.59s/it] {'loss': 0.004, 'grad_norm': 1.0995718301342647, 'learning_rate': 5.900621118012422e-08, 'completion_length': 91.41072082519531, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.10107421875, 'epoch': 4.7} 94%|█████████▍| 1515/1610 [11:58:46<18:21, 11.59s/it] 94%|█████████▍| 1516/1610 [11:58:54<16:44, 10.69s/it] {'loss': 0.019, 'grad_norm': 3.4412479483441354, 'learning_rate': 5.8385093167701866e-08, 'completion_length': 73.51786041259766, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.5000000596046448, 'reward_std': 0.2142857201397419, 'kl': 0.4736328125, 'epoch': 4.71} 94%|█████████▍| 1516/1610 [11:58:54<16:44, 10.69s/it] 94%|█████████▍| 1517/1610 [11:59:06<17:10, 11.08s/it] {'loss': 0.0279, 'grad_norm': 1.9813431419334384, 'learning_rate': 5.77639751552795e-08, 'completion_length': 97.98214721679688, 'rewards/accuracy_reward': 0.5178571790456772, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4821429252624512, 'reward_std': 0.15943220257759094, 'kl': 0.7022705078125, 'epoch': 4.71} 94%|█████████▍| 1517/1610 [11:59:06<17:10, 11.08s/it] 94%|█████████▍| 1518/1610 [11:59:16<16:07, 10.51s/it] {'loss': 0.0022, 'grad_norm': 0.11515519099790558, 'learning_rate': 5.714285714285714e-08, 'completion_length': 77.98214340209961, 'rewards/accuracy_reward': 1.0, 'rewards/format_reward': 1.0, 'reward': 2.0, 'reward_std': 0.0, 'kl': 0.05419921875, 'epoch': 4.71} 94%|█████████▍| 1518/1610 [11:59:16<16:07, 10.51s/it] 94%|█████████▍| 1519/1610 [11:59:26<16:07, 10.63s/it] {'loss': 0.0174, 'grad_norm': 1.474985074760203, 'learning_rate': 5.6521739130434777e-08, 'completion_length': 82.85714721679688, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.4642857313156128, 'reward_std': 0.21676982939243317, 'kl': 0.4345703125, 'epoch': 4.72} 94%|█████████▍| 1519/1610 [11:59:26<16:07, 10.63s/it] 94%|█████████▍| 1520/1610 [11:59:37<16:00, 10.67s/it] {'loss': 0.0087, 'grad_norm': 1.7294387009367764, 'learning_rate': 5.590062111801242e-08, 'completion_length': 97.12500381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.1539071835577488, 'kl': 0.2177734375, 'epoch': 4.72} 94%|█████████▍| 1520/1610 [11:59:37<16:00, 10.67s/it] 94%|█████████▍| 1521/1610 [11:59:47<15:27, 10.42s/it] {'loss': 0.006, 'grad_norm': 2.5165343253690913, 'learning_rate': 5.527950310559006e-08, 'completion_length': 88.10714340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.18409645557403564, 'kl': 0.149169921875, 'epoch': 4.72} 94%|█████████▍| 1521/1610 [11:59:47<15:27, 10.42s/it] 95%|█████████▍| 1522/1610 [11:59:56<14:42, 10.03s/it] {'loss': 0.0022, 'grad_norm': 0.34271700063562227, 'learning_rate': 5.4658385093167694e-08, 'completion_length': 68.66071701049805, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.04123930633068085, 'kl': 0.054443359375, 'epoch': 4.73} 95%|█████████▍| 1522/1610 [11:59:56<14:42, 10.03s/it] 95%|█████████▍| 1523/1610 [12:00:07<14:50, 10.24s/it] {'loss': 0.0254, 'grad_norm': 3.9141576033202656, 'learning_rate': 5.403726708074534e-08, 'completion_length': 90.96428680419922, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5535715222358704, 'reward_std': 0.2610500454902649, 'kl': 0.634765625, 'epoch': 4.73} 95%|█████████▍| 1523/1610 [12:00:07<14:50, 10.24s/it] 95%|█████████▍| 1524/1610 [12:00:16<14:05, 9.83s/it] {'loss': 0.0032, 'grad_norm': 1.0538912691986688, 'learning_rate': 5.341614906832298e-08, 'completion_length': 69.44643020629883, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 1.0, 'reward': 1.4642857909202576, 'reward_std': 0.04123930633068085, 'kl': 0.081298828125, 'epoch': 4.73} 95%|█████████▍| 1524/1610 [12:00:16<14:05, 9.83s/it] 95%|█████████▍| 1525/1610 [12:00:31<16:09, 11.40s/it] {'loss': 0.0191, 'grad_norm': 1.1980958761536165, 'learning_rate': 5.279503105590062e-08, 'completion_length': 96.14286041259766, 'rewards/accuracy_reward': 0.3928571492433548, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3750000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.476806640625, 'epoch': 4.74} 95%|█████████▍| 1525/1610 [12:00:31<16:09, 11.40s/it] 95%|█████████▍| 1526/1610 [12:00:40<15:09, 10.82s/it] {'loss': 0.0079, 'grad_norm': 6.994830813384549, 'learning_rate': 5.217391304347826e-08, 'completion_length': 74.73214721679688, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142858505249023, 'reward_std': 0.18409645557403564, 'kl': 0.198486328125, 'epoch': 4.74} 95%|█████████▍| 1526/1610 [12:00:40<15:09, 10.82s/it] 95%|█████████▍| 1527/1610 [12:00:51<14:45, 10.67s/it] {'loss': 0.0327, 'grad_norm': 2.1063108330844775, 'learning_rate': 5.1552795031055897e-08, 'completion_length': 83.14286041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5714285969734192, 'reward_std': 0.23638580739498138, 'kl': 0.814208984375, 'epoch': 4.74} 95%|█████████▍| 1527/1610 [12:00:51<14:45, 10.67s/it] 95%|█████████▍| 1528/1610 [12:01:00<13:58, 10.23s/it] {'loss': 0.0056, 'grad_norm': 1.3805383766585053, 'learning_rate': 5.0931677018633536e-08, 'completion_length': 69.10714340209961, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.0714285746216774, 'kl': 0.139892578125, 'epoch': 4.75} 95%|█████████▍| 1528/1610 [12:01:00<13:58, 10.23s/it] 95%|█████████▍| 1529/1610 [12:01:11<14:16, 10.57s/it] {'loss': 0.0031, 'grad_norm': 1.467863859309965, 'learning_rate': 5.031055900621118e-08, 'completion_length': 89.35714721679688, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392857313156128, 'reward_std': 0.1071428656578064, 'kl': 0.0771484375, 'epoch': 4.75} 95%|█████████▍| 1529/1610 [12:01:11<14:16, 10.57s/it] 95%|█████████▌| 1530/1610 [12:01:21<13:39, 10.24s/it] {'loss': 0.007, 'grad_norm': 2.4048489596618228, 'learning_rate': 4.9689440993788814e-08, 'completion_length': 83.58929061889648, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214708685875, 'kl': 0.173828125, 'epoch': 4.75} 95%|█████████▌| 1530/1610 [12:01:21<13:39, 10.24s/it] 95%|█████████▌| 1531/1610 [12:01:31<13:25, 10.20s/it] {'loss': 0.0037, 'grad_norm': 2.6843593053792807, 'learning_rate': 4.906832298136645e-08, 'completion_length': 77.51786041259766, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 1.0, 'reward': 1.5000000596046448, 'reward_std': 0.18409645557403564, 'kl': 0.093017578125, 'epoch': 4.75} 95%|█████████▌| 1531/1610 [12:01:31<13:25, 10.20s/it] 95%|█████████▌| 1532/1610 [12:01:42<13:34, 10.44s/it] {'loss': 0.0057, 'grad_norm': 1.399898032795412, 'learning_rate': 4.84472049689441e-08, 'completion_length': 93.37500381469727, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.14079980179667473, 'kl': 0.1424560546875, 'epoch': 4.76} 95%|█████████▌| 1532/1610 [12:01:42<13:34, 10.44s/it] 95%|█████████▌| 1533/1610 [12:01:51<13:00, 10.14s/it] {'loss': 0.0026, 'grad_norm': 2.321086602771314, 'learning_rate': 4.782608695652174e-08, 'completion_length': 83.62500381469727, 'rewards/accuracy_reward': 0.7678571939468384, 'rewards/format_reward': 1.0, 'reward': 1.7678571939468384, 'reward_std': 0.1071428619325161, 'kl': 0.064208984375, 'epoch': 4.76} 95%|█████████▌| 1533/1610 [12:01:51<13:00, 10.14s/it] 95%|█████████▌| 1534/1610 [12:02:03<13:29, 10.65s/it] {'loss': 0.0161, 'grad_norm': 1.944409559900532, 'learning_rate': 4.720496894409938e-08, 'completion_length': 87.8214340209961, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.400146484375, 'epoch': 4.76} 95%|█████████▌| 1534/1610 [12:02:03<13:29, 10.65s/it] 95%|█████████▌| 1535/1610 [12:02:12<12:41, 10.15s/it] {'loss': 0.0049, 'grad_norm': 7.366426818472819, 'learning_rate': 4.6583850931677016e-08, 'completion_length': 70.26785850524902, 'rewards/accuracy_reward': 0.6607142984867096, 'rewards/format_reward': 1.0, 'reward': 1.6607143878936768, 'reward_std': 0.1896214783191681, 'kl': 0.1220703125, 'epoch': 4.77} 95%|█████████▌| 1535/1610 [12:02:12<12:41, 10.15s/it] 95%|█████████▌| 1536/1610 [12:02:23<12:50, 10.41s/it] {'loss': 0.007, 'grad_norm': 1.717862205920487, 'learning_rate': 4.5962732919254656e-08, 'completion_length': 87.69643020629883, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5892857909202576, 'reward_std': 0.21981074661016464, 'kl': 0.17431640625, 'epoch': 4.77} 95%|█████████▌| 1536/1610 [12:02:23<12:50, 10.41s/it] 95%|█████████▌| 1537/1610 [12:02:35<13:20, 10.97s/it] {'loss': 0.0032, 'grad_norm': 2.2708803076893487, 'learning_rate': 4.5341614906832295e-08, 'completion_length': 92.87500381469727, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.535714328289032, 'reward_std': 0.1539071835577488, 'kl': 0.080322265625, 'epoch': 4.77} 95%|█████████▌| 1537/1610 [12:02:35<13:20, 10.97s/it] 96%|█████████▌| 1538/1610 [12:02:44<12:13, 10.18s/it] {'loss': 0.0035, 'grad_norm': 2.636139252514531, 'learning_rate': 4.472049689440994e-08, 'completion_length': 69.4464340209961, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.18409644439816475, 'kl': 0.08642578125, 'epoch': 4.78} 96%|█████████▌| 1538/1610 [12:02:44<12:13, 10.18s/it] 96%|█████████▌| 1539/1610 [12:02:55<12:37, 10.67s/it] {'loss': 0.0136, 'grad_norm': 2.657411664650712, 'learning_rate': 4.409937888198757e-08, 'completion_length': 75.39286041259766, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.535714328289032, 'reward_std': 0.2253357619047165, 'kl': 0.341796875, 'epoch': 4.78} 96%|█████████▌| 1539/1610 [12:02:55<12:37, 10.67s/it] 96%|█████████▌| 1540/1610 [12:03:05<12:00, 10.30s/it] {'loss': 0.0027, 'grad_norm': 1.1870450482542756, 'learning_rate': 4.347826086956521e-08, 'completion_length': 80.05357360839844, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.732142984867096, 'reward_std': 0.1071428619325161, 'kl': 0.06787109375, 'epoch': 4.78} 96%|█████████▌| 1540/1610 [12:03:05<12:00, 10.30s/it] 96%|█████████▌| 1541/1610 [12:03:14<11:26, 9.95s/it] {'loss': 0.0038, 'grad_norm': 2.2412689301341913, 'learning_rate': 4.285714285714286e-08, 'completion_length': 69.48214340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.660714328289032, 'reward_std': 0.16546405106782913, 'kl': 0.095458984375, 'epoch': 4.79} 96%|█████████▌| 1541/1610 [12:03:14<11:26, 9.95s/it] 96%|█████████▌| 1542/1610 [12:03:24<11:26, 10.10s/it] {'loss': 0.0144, 'grad_norm': 2.48181197126554, 'learning_rate': 4.22360248447205e-08, 'completion_length': 89.4464340209961, 'rewards/accuracy_reward': 0.3750000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.3571429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.36083984375, 'epoch': 4.79} 96%|█████████▌| 1542/1610 [12:03:24<11:26, 10.10s/it] 96%|█████████▌| 1543/1610 [12:03:37<12:06, 10.84s/it] {'loss': 0.0102, 'grad_norm': 1.661592515854065, 'learning_rate': 4.161490683229813e-08, 'completion_length': 89.78572082519531, 'rewards/accuracy_reward': 0.5000000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.1896214708685875, 'kl': 0.255859375, 'epoch': 4.79} 96%|█████████▌| 1543/1610 [12:03:37<12:06, 10.84s/it] 96%|█████████▌| 1544/1610 [12:03:47<11:44, 10.68s/it] {'loss': 0.0056, 'grad_norm': 2.05895658237546, 'learning_rate': 4.0993788819875776e-08, 'completion_length': 70.85714721679688, 'rewards/accuracy_reward': 0.6428571790456772, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250000596046448, 'reward_std': 0.14838216453790665, 'kl': 0.14013671875, 'epoch': 4.8} 96%|█████████▌| 1544/1610 [12:03:47<11:44, 10.68s/it] 96%|█████████▌| 1545/1610 [12:03:58<11:34, 10.68s/it] {'loss': 0.0065, 'grad_norm': 1.0922724893440985, 'learning_rate': 4.0372670807453415e-08, 'completion_length': 81.73214721679688, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.0357142873108387, 'kl': 0.162841796875, 'epoch': 4.8} 96%|█████████▌| 1545/1610 [12:03:58<11:34, 10.68s/it] 96%|█████████▌| 1546/1610 [12:04:10<11:50, 11.09s/it] {'loss': 0.0145, 'grad_norm': 2.284424486365993, 'learning_rate': 3.9751552795031054e-08, 'completion_length': 97.33929061889648, 'rewards/accuracy_reward': 0.4464285969734192, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.410714328289032, 'reward_std': 0.20670336484909058, 'kl': 0.363525390625, 'epoch': 4.8} 96%|█████████▌| 1546/1610 [12:04:10<11:50, 11.09s/it] 96%|█████████▌| 1547/1610 [12:04:20<11:19, 10.79s/it] {'loss': 0.006, 'grad_norm': 3.0779286509636274, 'learning_rate': 3.91304347826087e-08, 'completion_length': 70.94643211364746, 'rewards/accuracy_reward': 0.7142857313156128, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.18409645557403564, 'kl': 0.150146484375, 'epoch': 4.8} 96%|█████████▌| 1547/1610 [12:04:20<11:19, 10.79s/it] 96%|█████████▌| 1548/1610 [12:04:30<10:52, 10.52s/it] {'loss': 0.0025, 'grad_norm': 1.336127959074169, 'learning_rate': 3.850931677018633e-08, 'completion_length': 86.91072082519531, 'rewards/accuracy_reward': 0.6071428656578064, 'rewards/format_reward': 1.0, 'reward': 1.607142984867096, 'reward_std': 0.18409645557403564, 'kl': 0.0628662109375, 'epoch': 4.81} 96%|█████████▌| 1548/1610 [12:04:30<10:52, 10.52s/it] 96%|█████████▌| 1549/1610 [12:04:40<10:29, 10.32s/it] {'loss': 0.016, 'grad_norm': 2.656297480092272, 'learning_rate': 3.788819875776397e-08, 'completion_length': 78.12500381469727, 'rewards/accuracy_reward': 0.5535714477300644, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.5178571939468384, 'reward_std': 0.1071428619325161, 'kl': 0.39990234375, 'epoch': 4.81} 96%|█████████▌| 1549/1610 [12:04:40<10:29, 10.32s/it] 96%|█████████▋| 1550/1610 [12:04:52<10:43, 10.73s/it] {'loss': 0.0058, 'grad_norm': 3.1565401784138154, 'learning_rate': 3.726708074534162e-08, 'completion_length': 99.01786422729492, 'rewards/accuracy_reward': 0.4821428656578064, 'rewards/format_reward': 1.0, 'reward': 1.4821429252624512, 'reward_std': 0.15943220257759094, 'kl': 0.144775390625, 'epoch': 4.81} 96%|█████████▋| 1550/1610 [12:04:52<10:43, 10.73s/it] 96%|█████████▋| 1551/1610 [12:05:03<10:48, 10.99s/it] {'loss': 0.0051, 'grad_norm': 0.9007170157568085, 'learning_rate': 3.6645962732919256e-08, 'completion_length': 97.75000381469727, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695358991622925, 'kl': 0.126708984375, 'epoch': 4.82} 96%|█████████▋| 1551/1610 [12:05:03<10:48, 10.99s/it] 96%|█████████▋| 1552/1610 [12:05:14<10:31, 10.89s/it] {'loss': 0.0142, 'grad_norm': 2.756263598466851, 'learning_rate': 3.602484472049689e-08, 'completion_length': 90.25000381469727, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.446428656578064, 'reward_std': 0.21981073170900345, 'kl': 0.35400390625, 'epoch': 4.82} 96%|█████████▋| 1552/1610 [12:05:14<10:31, 10.89s/it] 96%|█████████▋| 1553/1610 [12:05:23<09:47, 10.30s/it] {'loss': 0.014, 'grad_norm': 1.9943257928641926, 'learning_rate': 3.5403726708074535e-08, 'completion_length': 73.58929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7142857909202576, 'reward_std': 0.11266788095235825, 'kl': 0.350341796875, 'epoch': 4.82} 96%|█████████▋| 1553/1610 [12:05:23<09:47, 10.30s/it] 97%|█████████▋| 1554/1610 [12:05:35<10:01, 10.75s/it] {'loss': 0.0028, 'grad_norm': 1.0561112316293892, 'learning_rate': 3.4782608695652174e-08, 'completion_length': 92.55357360839844, 'rewards/accuracy_reward': 0.5357142984867096, 'rewards/format_reward': 1.0, 'reward': 1.5357143878936768, 'reward_std': 0.0714285746216774, 'kl': 0.0697021484375, 'epoch': 4.83} 97%|█████████▋| 1554/1610 [12:05:35<10:01, 10.75s/it] 97%|█████████▋| 1555/1610 [12:05:45<09:40, 10.56s/it] {'loss': 0.0092, 'grad_norm': 2.1926440343496134, 'learning_rate': 3.416149068322981e-08, 'completion_length': 77.98214721679688, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.1539071872830391, 'kl': 0.22998046875, 'epoch': 4.83} 97%|█████████▋| 1555/1610 [12:05:45<09:40, 10.56s/it] 97%|█████████▋| 1556/1610 [12:05:55<09:23, 10.43s/it] {'loss': 0.0048, 'grad_norm': 0.9544392214809956, 'learning_rate': 3.3540372670807445e-08, 'completion_length': 87.60714721679688, 'rewards/accuracy_reward': 0.5892857611179352, 'rewards/format_reward': 1.0, 'reward': 1.5892857909202576, 'reward_std': 0.07695359364151955, 'kl': 0.11962890625, 'epoch': 4.83} 97%|█████████▋| 1556/1610 [12:05:55<09:23, 10.43s/it] 97%|█████████▋| 1557/1610 [12:06:06<09:23, 10.64s/it] {'loss': 0.0082, 'grad_norm': 1.514980298937874, 'learning_rate': 3.291925465838509e-08, 'completion_length': 86.53572082519531, 'rewards/accuracy_reward': 0.8035714626312256, 'rewards/format_reward': 1.0, 'reward': 1.8035715222358704, 'reward_std': 0.07695359364151955, 'kl': 0.2060546875, 'epoch': 4.84} 97%|█████████▋| 1557/1610 [12:06:06<09:23, 10.64s/it] 97%|█████████▋| 1558/1610 [12:06:15<08:53, 10.25s/it] {'loss': 0.0063, 'grad_norm': 3.848755506780094, 'learning_rate': 3.229813664596273e-08, 'completion_length': 80.21429061889648, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6785714626312256, 'reward_std': 0.0714285746216774, 'kl': 0.158447265625, 'epoch': 4.84} 97%|█████████▋| 1558/1610 [12:06:15<08:53, 10.25s/it] 97%|█████████▋| 1559/1610 [12:06:26<08:56, 10.52s/it] {'loss': 0.0072, 'grad_norm': 2.291533355764639, 'learning_rate': 3.167701863354037e-08, 'completion_length': 94.9464340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785714626312256, 'reward_std': 0.2253357619047165, 'kl': 0.1805419921875, 'epoch': 4.84} 97%|█████████▋| 1559/1610 [12:06:26<08:56, 10.52s/it] 97%|█████████▋| 1560/1610 [12:06:37<08:47, 10.54s/it] {'loss': 0.0071, 'grad_norm': 1.4119134466861698, 'learning_rate': 3.105590062111801e-08, 'completion_length': 81.07143020629883, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250001192092896, 'reward_std': 0.1181928962469101, 'kl': 0.177734375, 'epoch': 4.84} 97%|█████████▋| 1560/1610 [12:06:37<08:47, 10.54s/it] 97%|█████████▋| 1561/1610 [12:06:52<09:42, 11.89s/it] {'loss': 0.0222, 'grad_norm': 2.3964112481544926, 'learning_rate': 3.0434782608695655e-08, 'completion_length': 97.3214340209961, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.2610500454902649, 'kl': 0.5556640625, 'epoch': 4.85} 97%|█████████▋| 1561/1610 [12:06:52<09:42, 11.89s/it] 97%|█████████▋| 1562/1610 [12:07:02<08:55, 11.16s/it] {'loss': 0.0136, 'grad_norm': 1.6921628055511566, 'learning_rate': 2.981366459627329e-08, 'completion_length': 75.69643020629883, 'rewards/accuracy_reward': 0.8392857313156128, 'rewards/format_reward': 1.0, 'reward': 1.8392858505249023, 'reward_std': 0.1071428619325161, 'kl': 0.33837890625, 'epoch': 4.85} 97%|█████████▋| 1562/1610 [12:07:02<08:55, 11.16s/it] 97%|█████████▋| 1563/1610 [12:07:11<08:19, 10.62s/it] {'loss': 0.0061, 'grad_norm': 3.1030777509670315, 'learning_rate': 2.9192546583850933e-08, 'completion_length': 76.6964340209961, 'rewards/accuracy_reward': 0.7857142984867096, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.7678571939468384, 'reward_std': 0.14838217198848724, 'kl': 0.15234375, 'epoch': 4.85} 97%|█████████▋| 1563/1610 [12:07:11<08:19, 10.62s/it] 97%|█████████▋| 1564/1610 [12:07:21<07:59, 10.42s/it] {'loss': 0.0083, 'grad_norm': 2.561567201939669, 'learning_rate': 2.857142857142857e-08, 'completion_length': 81.26786041259766, 'rewards/accuracy_reward': 0.6071428954601288, 'rewards/format_reward': 1.0, 'reward': 1.6071429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.2080078125, 'epoch': 4.86} 97%|█████████▋| 1564/1610 [12:07:21<07:59, 10.42s/it] 97%|█████████▋| 1565/1610 [12:07:31<07:43, 10.29s/it] {'loss': 0.0109, 'grad_norm': 2.614245059827418, 'learning_rate': 2.795031055900621e-08, 'completion_length': 75.6964340209961, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.2947069853544235, 'kl': 0.273193359375, 'epoch': 4.86} 97%|█████████▋| 1565/1610 [12:07:31<07:43, 10.29s/it] 97%|█████████▋| 1566/1610 [12:07:42<07:50, 10.70s/it] {'loss': 0.0086, 'grad_norm': 1.2598499522984528, 'learning_rate': 2.7329192546583847e-08, 'completion_length': 95.16071701049805, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.1071428619325161, 'kl': 0.21435546875, 'epoch': 4.86} 97%|█████████▋| 1566/1610 [12:07:42<07:50, 10.70s/it] 97%|█████████▋| 1567/1610 [12:07:53<07:40, 10.71s/it] {'loss': 0.002, 'grad_norm': 0.8591148789862684, 'learning_rate': 2.670807453416149e-08, 'completion_length': 89.76786041259766, 'rewards/accuracy_reward': 0.8392857611179352, 'rewards/format_reward': 1.0, 'reward': 1.8392857909202576, 'reward_std': 0.0357142873108387, 'kl': 0.0504150390625, 'epoch': 4.87} 97%|█████████▋| 1567/1610 [12:07:53<07:40, 10.71s/it] 97%|█████████▋| 1568/1610 [12:08:03<07:20, 10.49s/it] {'loss': 0.0034, 'grad_norm': 0.7588157048692561, 'learning_rate': 2.608695652173913e-08, 'completion_length': 84.83929061889648, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 1.0, 'reward': 1.7321429252624512, 'reward_std': 0.0357142873108387, 'kl': 0.08544921875, 'epoch': 4.87} 97%|█████████▋| 1568/1610 [12:08:03<07:20, 10.49s/it] 97%|█████████▋| 1569/1610 [12:08:14<07:10, 10.49s/it] {'loss': 0.0046, 'grad_norm': 1.3049176366204949, 'learning_rate': 2.5465838509316768e-08, 'completion_length': 82.94643020629883, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0824786126613617, 'kl': 0.11376953125, 'epoch': 4.87} 97%|█████████▋| 1569/1610 [12:08:14<07:10, 10.49s/it] 98%|█████████▊| 1570/1610 [12:08:29<08:02, 12.05s/it] {'loss': 0.0202, 'grad_norm': 0.4190172011210485, 'learning_rate': 2.4844720496894407e-08, 'completion_length': 90.98214721679688, 'rewards/accuracy_reward': 0.9285714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.9107143878936768, 'reward_std': 0.0357142873108387, 'kl': 0.50537109375, 'epoch': 4.88} 98%|█████████▊| 1570/1610 [12:08:29<08:02, 12.05s/it] 98%|█████████▊| 1571/1610 [12:08:40<07:29, 11.53s/it] {'loss': 0.0134, 'grad_norm': 2.310345814001422, 'learning_rate': 2.422360248447205e-08, 'completion_length': 72.50000381469727, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4821429252624512, 'reward_std': 0.14838216453790665, 'kl': 0.33642578125, 'epoch': 4.88} 98%|█████████▊| 1571/1610 [12:08:40<07:29, 11.53s/it] 98%|█████████▊| 1572/1610 [12:08:50<07:08, 11.29s/it] {'loss': 0.0038, 'grad_norm': 1.5530166955111724, 'learning_rate': 2.360248447204969e-08, 'completion_length': 85.80357360839844, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.1071428619325161, 'kl': 0.09521484375, 'epoch': 4.88} 98%|█████████▊| 1572/1610 [12:08:50<07:08, 11.29s/it] 98%|█████████▊| 1573/1610 [12:09:01<06:48, 11.03s/it] {'loss': 0.0043, 'grad_norm': 1.1306716121031881, 'learning_rate': 2.2981366459627328e-08, 'completion_length': 84.26786422729492, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.1428571492433548, 'kl': 0.108154296875, 'epoch': 4.89} 98%|█████████▊| 1573/1610 [12:09:01<06:48, 11.03s/it] 98%|█████████▊| 1574/1610 [12:09:11<06:31, 10.87s/it] {'loss': 0.0035, 'grad_norm': 0.7802082945102631, 'learning_rate': 2.236024844720497e-08, 'completion_length': 94.0535774230957, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 1.0, 'reward': 1.6250000596046448, 'reward_std': 0.07695359364151955, 'kl': 0.0869140625, 'epoch': 4.89} 98%|█████████▊| 1574/1610 [12:09:11<06:31, 10.87s/it] 98%|█████████▊| 1575/1610 [12:09:26<07:01, 12.04s/it] {'loss': 0.0198, 'grad_norm': 1.522717812473059, 'learning_rate': 2.1739130434782606e-08, 'completion_length': 92.58929061889648, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.18658055365085602, 'kl': 0.496337890625, 'epoch': 4.89} 98%|█████████▊| 1575/1610 [12:09:26<07:01, 12.04s/it] 98%|█████████▊| 1576/1610 [12:09:35<06:17, 11.10s/it] {'loss': 0.0075, 'grad_norm': 2.6561073106906203, 'learning_rate': 2.111801242236025e-08, 'completion_length': 74.19643020629883, 'rewards/accuracy_reward': 0.5892857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.571428656578064, 'reward_std': 0.2142857238650322, 'kl': 0.1865234375, 'epoch': 4.89} 98%|█████████▊| 1576/1610 [12:09:35<06:17, 11.10s/it] 98%|█████████▊| 1577/1610 [12:09:46<06:09, 11.20s/it] {'loss': 0.017, 'grad_norm': 1.7443070510970629, 'learning_rate': 2.0496894409937888e-08, 'completion_length': 83.12500381469727, 'rewards/accuracy_reward': 0.7321428954601288, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.696428656578064, 'reward_std': 0.1181928962469101, 'kl': 0.426513671875, 'epoch': 4.9} 98%|█████████▊| 1577/1610 [12:09:46<06:09, 11.20s/it] 98%|█████████▊| 1578/1610 [12:09:57<05:53, 11.04s/it] {'loss': 0.0064, 'grad_norm': 2.2137427749890835, 'learning_rate': 1.9875776397515527e-08, 'completion_length': 87.57143020629883, 'rewards/accuracy_reward': 0.517857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5000000596046448, 'reward_std': 0.19514650478959084, 'kl': 0.16064453125, 'epoch': 4.9} 98%|█████████▊| 1578/1610 [12:09:57<05:53, 11.04s/it] 98%|█████████▊| 1579/1610 [12:10:07<05:27, 10.58s/it] {'loss': 0.0033, 'grad_norm': 1.5439995021057549, 'learning_rate': 1.9254658385093166e-08, 'completion_length': 73.6785774230957, 'rewards/accuracy_reward': 0.910714328289032, 'rewards/format_reward': 1.0, 'reward': 1.9107143878936768, 'reward_std': 0.1071428619325161, 'kl': 0.083251953125, 'epoch': 4.9} 98%|█████████▊| 1579/1610 [12:10:07<05:27, 10.58s/it] 98%|█████████▊| 1580/1610 [12:10:17<05:11, 10.40s/it] {'loss': 0.0095, 'grad_norm': 1.8673108116449828, 'learning_rate': 1.863354037267081e-08, 'completion_length': 86.14286041259766, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.1428571529686451, 'kl': 0.236328125, 'epoch': 4.91} 98%|█████████▊| 1580/1610 [12:10:17<05:11, 10.40s/it] 98%|█████████▊| 1581/1610 [12:10:26<04:53, 10.11s/it] {'loss': 0.0086, 'grad_norm': 3.931247194519565, 'learning_rate': 1.8012422360248444e-08, 'completion_length': 85.8035774230957, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.18409644439816475, 'kl': 0.21484375, 'epoch': 4.91} 98%|█████████▊| 1581/1610 [12:10:26<04:53, 10.11s/it] 98%|█████████▊| 1582/1610 [12:10:42<05:28, 11.73s/it] {'loss': 0.0213, 'grad_norm': 1.5209972129853253, 'learning_rate': 1.7391304347826087e-08, 'completion_length': 107.6785774230957, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1785714328289032, 'kl': 0.5321044921875, 'epoch': 4.91} 98%|█████████▊| 1582/1610 [12:10:42<05:28, 11.73s/it] 98%|█████████▊| 1583/1610 [12:10:52<05:03, 11.23s/it] {'loss': 0.0093, 'grad_norm': 4.60436778493603, 'learning_rate': 1.6770186335403723e-08, 'completion_length': 79.19643020629883, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.1896214708685875, 'kl': 0.233642578125, 'epoch': 4.92} 98%|█████████▊| 1583/1610 [12:10:52<05:03, 11.23s/it] 98%|█████████▊| 1584/1610 [12:11:06<05:16, 12.16s/it] {'loss': 0.0122, 'grad_norm': 1.251100734157806, 'learning_rate': 1.6149068322981365e-08, 'completion_length': 86.9285774230957, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6785715222358704, 'reward_std': 0.12974976375699043, 'kl': 0.304931640625, 'epoch': 4.92} 98%|█████████▊| 1584/1610 [12:11:06<05:16, 12.16s/it] 98%|█████████▊| 1585/1610 [12:11:19<05:08, 12.34s/it] {'loss': 0.0149, 'grad_norm': 1.2623177614353074, 'learning_rate': 1.5527950310559004e-08, 'completion_length': 81.78572082519531, 'rewards/accuracy_reward': 0.5714285969734192, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5535714626312256, 'reward_std': 0.1181928962469101, 'kl': 0.372314453125, 'epoch': 4.92} 98%|█████████▊| 1585/1610 [12:11:19<05:08, 12.34s/it] 99%|█████████▊| 1586/1610 [12:11:29<04:44, 11.84s/it] {'loss': 0.0029, 'grad_norm': 0.900759999594808, 'learning_rate': 1.4906832298136644e-08, 'completion_length': 79.94643020629883, 'rewards/accuracy_reward': 0.4285714328289032, 'rewards/format_reward': 1.0, 'reward': 1.4285715222358704, 'reward_std': 0.11266788095235825, 'kl': 0.072265625, 'epoch': 4.93} 99%|█████████▊| 1586/1610 [12:11:29<04:44, 11.84s/it] 99%|█████████▊| 1587/1610 [12:11:42<04:39, 12.16s/it] {'loss': 0.0128, 'grad_norm': 1.6306576603238931, 'learning_rate': 1.4285714285714284e-08, 'completion_length': 97.03571701049805, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.696428656578064, 'reward_std': 0.1785714402794838, 'kl': 0.31884765625, 'epoch': 4.93} 99%|█████████▊| 1587/1610 [12:11:42<04:39, 12.16s/it] 99%|█████████▊| 1588/1610 [12:11:52<04:11, 11.43s/it] {'loss': 0.0124, 'grad_norm': 2.111041502005878, 'learning_rate': 1.3664596273291924e-08, 'completion_length': 73.46429061889648, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6250001192092896, 'reward_std': 0.24794266372919083, 'kl': 0.310546875, 'epoch': 4.93} 99%|█████████▊| 1588/1610 [12:11:52<04:11, 11.43s/it] 99%|█████████▊| 1589/1610 [12:12:02<03:53, 11.14s/it] {'loss': 0.0135, 'grad_norm': 1.8131740943111994, 'learning_rate': 1.3043478260869564e-08, 'completion_length': 76.25000381469727, 'rewards/accuracy_reward': 0.8214285969734192, 'rewards/format_reward': 1.0, 'reward': 1.8214285969734192, 'reward_std': 0.11266788095235825, 'kl': 0.33740234375, 'epoch': 4.93} 99%|█████████▊| 1589/1610 [12:12:02<03:53, 11.14s/it] 99%|█████████▉| 1590/1610 [12:12:18<04:08, 12.40s/it] {'loss': 0.0196, 'grad_norm': 2.0842096481802113, 'learning_rate': 1.2422360248447204e-08, 'completion_length': 99.75000381469727, 'rewards/accuracy_reward': 0.535714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.5178572535514832, 'reward_std': 0.14838216453790665, 'kl': 0.4891357421875, 'epoch': 4.94} 99%|█████████▉| 1590/1610 [12:12:18<04:08, 12.40s/it] 99%|█████████▉| 1591/1610 [12:12:26<03:34, 11.28s/it] {'loss': 0.0155, 'grad_norm': 3.008843420379642, 'learning_rate': 1.1801242236024844e-08, 'completion_length': 62.83928871154785, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9642857313156128, 'reward': 1.6428571939468384, 'reward_std': 0.2253357470035553, 'kl': 0.38623046875, 'epoch': 4.94} 99%|█████████▉| 1591/1610 [12:12:26<03:34, 11.28s/it] 99%|█████████▉| 1592/1610 [12:12:37<03:17, 10.95s/it] {'loss': 0.0029, 'grad_norm': 13.27091935359041, 'learning_rate': 1.1180124223602485e-08, 'completion_length': 73.73214721679688, 'rewards/accuracy_reward': 0.7500000298023224, 'rewards/format_reward': 1.0, 'reward': 1.7500001192092896, 'reward_std': 0.18409645557403564, 'kl': 0.072509765625, 'epoch': 4.94} 99%|█████████▉| 1592/1610 [12:12:37<03:17, 10.95s/it] 99%|█████████▉| 1593/1610 [12:12:47<03:05, 10.92s/it] {'loss': 0.0352, 'grad_norm': 3.65627278180808, 'learning_rate': 1.0559006211180124e-08, 'completion_length': 87.28572082519531, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.18409645557403564, 'kl': 0.8828125, 'epoch': 4.95} 99%|█████████▉| 1593/1610 [12:12:48<03:05, 10.92s/it] 99%|█████████▉| 1594/1610 [12:12:57<02:49, 10.62s/it] {'loss': 0.005, 'grad_norm': 3.5678064548569592, 'learning_rate': 9.937888198757763e-09, 'completion_length': 85.71429061889648, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428571939468384, 'reward_std': 0.11266787722706795, 'kl': 0.1259765625, 'epoch': 4.95} 99%|█████████▉| 1594/1610 [12:12:57<02:49, 10.62s/it] 99%|█████████▉| 1595/1610 [12:13:12<02:56, 11.74s/it] {'loss': 0.0409, 'grad_norm': 3.2901248720044807, 'learning_rate': 9.316770186335404e-09, 'completion_length': 81.35714721679688, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.607142984867096, 'reward_std': 0.3138462230563164, 'kl': 1.0185546875, 'epoch': 4.95} 99%|█████████▉| 1595/1610 [12:13:12<02:56, 11.74s/it] 99%|█████████▉| 1596/1610 [12:13:22<02:39, 11.43s/it] {'loss': 0.0117, 'grad_norm': 6.25294740367062, 'learning_rate': 8.695652173913043e-09, 'completion_length': 78.25000381469727, 'rewards/accuracy_reward': 0.7142857611179352, 'rewards/format_reward': 1.0, 'reward': 1.7142857909202576, 'reward_std': 0.0714285746216774, 'kl': 0.29345703125, 'epoch': 4.96} 99%|█████████▉| 1596/1610 [12:13:22<02:39, 11.43s/it] 99%|█████████▉| 1597/1610 [12:13:33<02:26, 11.26s/it] {'loss': 0.0041, 'grad_norm': 1.25112090317022, 'learning_rate': 8.074534161490683e-09, 'completion_length': 83.91071701049805, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.07695358991622925, 'kl': 0.103515625, 'epoch': 4.96} 99%|█████████▉| 1597/1610 [12:13:33<02:26, 11.26s/it] 99%|█████████▉| 1598/1610 [12:13:43<02:10, 10.89s/it] {'loss': 0.0021, 'grad_norm': 1.2134048930215522, 'learning_rate': 7.453416149068322e-09, 'completion_length': 75.60714721679688, 'rewards/accuracy_reward': 0.642857164144516, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.0824786126613617, 'kl': 0.0533447265625, 'epoch': 4.96} 99%|█████████▉| 1598/1610 [12:13:43<02:10, 10.89s/it] 99%|█████████▉| 1599/1610 [12:13:54<01:59, 10.85s/it] {'loss': 0.0038, 'grad_norm': 1.0548936579240205, 'learning_rate': 6.832298136645962e-09, 'completion_length': 94.25000762939453, 'rewards/accuracy_reward': 0.767857164144516, 'rewards/format_reward': 1.0, 'reward': 1.7678572535514832, 'reward_std': 0.1181928999722004, 'kl': 0.09521484375, 'epoch': 4.97} 99%|█████████▉| 1599/1610 [12:13:54<01:59, 10.85s/it] 99%|█████████▉| 1600/1610 [12:14:09<02:00, 12.04s/it] {'loss': 0.0136, 'grad_norm': 1.9921140266139763, 'learning_rate': 6.211180124223602e-09, 'completion_length': 92.44643020629883, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6607143878936768, 'reward_std': 0.21124479547142982, 'kl': 0.33984375, 'epoch': 4.97} 99%|█████████▉| 1600/1610 [12:14:09<02:00, 12.04s/it] 99%|█████████▉| 1601/1610 [12:17:08<09:20, 62.26s/it] {'loss': 0.0023, 'grad_norm': 1.8130338631491425, 'learning_rate': 5.5900621118012426e-09, 'completion_length': 78.3214340209961, 'rewards/accuracy_reward': 0.5535714626312256, 'rewards/format_reward': 1.0, 'reward': 1.5535714626312256, 'reward_std': 0.1896214708685875, 'kl': 0.0579833984375, 'epoch': 4.97} 99%|█████████▉| 1601/1610 [12:17:08<09:20, 62.26s/it] 100%|█████████▉| 1602/1610 [12:17:19<06:14, 46.83s/it] {'loss': 0.0095, 'grad_norm': 1.730323269281731, 'learning_rate': 4.968944099378882e-09, 'completion_length': 90.37500381469727, 'rewards/accuracy_reward': 0.6964285969734192, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.1071428619325161, 'kl': 0.237548828125, 'epoch': 4.98} 100%|█████████▉| 1602/1610 [12:17:19<06:14, 46.83s/it] 100%|█████████▉| 1603/1610 [12:17:30<04:12, 36.09s/it] {'loss': 0.0251, 'grad_norm': 2.7183833689618004, 'learning_rate': 4.347826086956522e-09, 'completion_length': 83.26786041259766, 'rewards/accuracy_reward': 0.4642857313156128, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.4464285969734192, 'reward_std': 0.20670336857438087, 'kl': 0.62744140625, 'epoch': 4.98} 100%|█████████▉| 1603/1610 [12:17:30<04:12, 36.09s/it] 100%|█████████▉| 1604/1610 [12:17:41<02:51, 28.56s/it] {'loss': 0.0055, 'grad_norm': 1.0180480727224637, 'learning_rate': 3.726708074534161e-09, 'completion_length': 87.12500381469727, 'rewards/accuracy_reward': 0.7500000596046448, 'rewards/format_reward': 1.0, 'reward': 1.7500000596046448, 'reward_std': 0.04123930633068085, 'kl': 0.138671875, 'epoch': 4.98} 100%|█████████▉| 1604/1610 [12:17:41<02:51, 28.56s/it] 100%|█████████▉| 1605/1610 [12:17:58<02:04, 24.92s/it] {'loss': 0.0317, 'grad_norm': 2.1882383379884427, 'learning_rate': 3.105590062111801e-09, 'completion_length': 111.5535774230957, 'rewards/accuracy_reward': 0.5000000149011612, 'rewards/format_reward': 0.9464285969734192, 'reward': 1.446428656578064, 'reward_std': 0.20670336857438087, 'kl': 0.79296875, 'epoch': 4.98} 100%|█████████▉| 1605/1610 [12:17:58<02:04, 24.92s/it] 100%|█████████▉| 1606/1610 [12:18:08<01:22, 20.54s/it] {'loss': 0.0027, 'grad_norm': 2.439852621751633, 'learning_rate': 2.484472049689441e-09, 'completion_length': 83.00000381469727, 'rewards/accuracy_reward': 0.6964286267757416, 'rewards/format_reward': 1.0, 'reward': 1.696428656578064, 'reward_std': 0.21981073170900345, 'kl': 0.067626953125, 'epoch': 4.99} 100%|█████████▉| 1606/1610 [12:18:08<01:22, 20.54s/it] 100%|█████████▉| 1607/1610 [12:18:17<00:50, 16.99s/it] {'loss': 0.0083, 'grad_norm': 2.187170068543717, 'learning_rate': 1.8633540372670804e-09, 'completion_length': 64.50000381469727, 'rewards/accuracy_reward': 0.6785714626312256, 'rewards/format_reward': 1.0, 'reward': 1.6785715222358704, 'reward_std': 0.0714285746216774, 'kl': 0.205810546875, 'epoch': 4.99} 100%|█████████▉| 1607/1610 [12:18:17<00:50, 16.99s/it] 100%|█████████▉| 1608/1610 [12:18:26<00:29, 14.76s/it] {'loss': 0.0144, 'grad_norm': 1.5912631728986255, 'learning_rate': 1.2422360248447204e-09, 'completion_length': 76.46429061889648, 'rewards/accuracy_reward': 0.6428571939468384, 'rewards/format_reward': 1.0, 'reward': 1.6428572535514832, 'reward_std': 0.11266787722706795, 'kl': 0.357421875, 'epoch': 4.99} 100%|█████████▉| 1608/1610 [12:18:26<00:29, 14.76s/it] 100%|█████████▉| 1609/1610 [12:18:39<00:14, 14.20s/it] {'loss': 0.0157, 'grad_norm': 0.9095838290645217, 'learning_rate': 6.211180124223602e-10, 'completion_length': 98.1964340209961, 'rewards/accuracy_reward': 0.6250000298023224, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6071429252624512, 'reward_std': 0.10410194098949432, 'kl': 0.3931884765625, 'epoch': 5.0} 100%|█████████▉| 1609/1610 [12:18:39<00:14, 14.20s/it] 100%|██████████| 1610/1610 [12:18:51<00:00, 13.60s/it] {'loss': 0.0128, 'grad_norm': 3.4635786935303896, 'learning_rate': 0.0, 'completion_length': 82.12500381469727, 'rewards/accuracy_reward': 0.660714328289032, 'rewards/format_reward': 0.9821428656578064, 'reward': 1.6428572535514832, 'reward_std': 0.28365693986415863, 'kl': 0.318115234375, 'epoch': 5.0} 100%|██████████| 1610/1610 [12:18:51<00:00, 13.60s/it] {'train_runtime': 44499.4805, 'train_samples_per_second': 0.507, 'train_steps_per_second': 0.036, 'train_loss': 0.007075942645478701, 'epoch': 5.0} 100%|██████████| 1610/1610 [12:21:30<00:00, 13.60s/it] 100%|██████████| 1610/1610 [12:21:30<00:00, 27.63s/it] wandb: wandb: 🚀 View run VLLM-Correct-Qwen2-VL-7B-GRPO-GEOQA-4k5-2025-02-21-20-41-14 at: https://wandb.ai/tanhuajie264-peking-university/vison-open-r1/runs/nrvnbvl0 wandb: Find logs at: wandb/run-20250221_204845-nrvnbvl0/logs