8056 – [ModelScope][容器镜像]镜像默认使用的tensorboard疑似缺少依赖，导致modelscope社区自带用例执行失败。

Bug 8056 - [ModelScope][容器镜像]镜像默认使用的tensorboard疑似缺少依赖，导致modelscope社区自带用例执行失败。

Summary: [ModelScope][容器镜像]镜像默认使用的tensorboard疑似缺少依赖，导致modelscope社区自带用例执行失败。

Status:	NEW

Alias:	None

Product:	Anolis OS 8
Classification:	Anolis OS
Component:	Images&Installations (show other bugs)	Images&Installations
Sub Component:
Version:	8.8
Hardware:	All Linux

Importance:	P3-Medium S3-normal
Target Milestone:	---
Assignee:	zhongling
QA Contact:	shuming

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-01-25 23:02 UTC by yunmeng365524
Modified:	2024-01-29 15:09 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description yunmeng365524 2024-01-25 23:02:44 UTC

Description of problem:
镜像默认使用的tensorboard疑似缺少依赖，导致modelscope社区自带用例执行失败。

Version-Release number of selected component (if applicable):
容器镜像信息：
[root@localhost modelscope]# docker images
REPOSITORY                                     TAG          IMAGE ID       CREATED       SIZE
registry.openanolis.cn/openanolis/modelscope   1.10.0-an8   736efcaf9a7c   5 weeks ago   25.3GB


How reproducible:
以gpu的方式创建并启动容器
docker create --gpus all -it -v /tmp:/tmp 736efcaf9a7c
进入容器后clone modelscope社区代码
git clone https://github.com/modelscope/modelscope.git
cd modelscope

执行测试用例：
[root@1a186991ac23 modelscope]# python3 tests/trainers/test_translation_evaluation_trainer.py
2024-01-25 22:52:38,805 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2024-01-25 22:52:38,809 - modelscope - INFO - TensorFlow version 2.9.2 Found.
2024-01-25 22:52:38,809 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-01-25 22:52:38,850 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 da292646adab6334a8f0cd8a272bf9b1 and a total number of 946 components indexed
2024-01-25 22:52:45,061 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:52:45,942 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
/opt/conda/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:326: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  np.bool8: (False, True),
/opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/dtypes.py:205: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  np.bool8: (False, True),
<frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 80 from C header, got 96 from PyObject
2024-01-25 22:52:47,942 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:52:48,687 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
E2024-01-25 22:52:49,647 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:52:50,525 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:52:51,175 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:52:51,861 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
E
======================================================================
ERROR: test_run_with_unite_mup_base (__main__.TranslationEvaluationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py", line 42, in tf
    from tensorboard.compat import notf  # noqa: F401
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tests/trainers/test_translation_evaluation_trainer.py", line 37, in test_run_with_unite_mup_base
    trainer = build_trainer(name=self.name, default_args=default_args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/builder.py", line 39, in build_trainer
    return build_from_cfg(cfg, TRAINERS, default_args=default_args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 184, in build_from_cfg
    LazyImportModule.import_module(sig)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/import_utils.py", line 475, in import_module
    importlib.import_module(module_name)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/nlp/translation_evaluation_trainer.py", line 17, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
    from .writer import FileWriter, SummaryWriter  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 16, in <module>
    from ._embedding import (
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/_embedding.py", line 9, in <module>
    _HAS_GFILE_JOIN = hasattr(tf.io.gfile, "join")
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 65, in __getattr__
    return getattr(load_once(self), attr_name)
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 97, in wrapper
    cache[arg] = f(arg)
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 50, in load_once
    module = load_fn()
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py", line 45, in tf
    import tensorflow
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 45, in <module>
    from tensorflow.python.feature_column import feature_column_lib as feature_column
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/feature_column/feature_column_lib.py", line 18, in <module>
    from tensorflow.python.feature_column.feature_column import *
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/feature_column/feature_column.py", line 143, in <module>
    from tensorflow.python.layers import base
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/layers/base.py", line 16, in <module>
    from tensorflow.python.keras.legacy_tf_layers import base
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/__init__.py", line 25, in <module>
    from tensorflow.python.keras import models
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/models.py", line 22, in <module>
    from tensorflow.python.keras.engine import functional
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/functional.py", line 32, in <module>
    from tensorflow.python.keras.engine import training as training_lib
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 52, in <module>
    from tensorflow.python.keras.saving import hdf5_format
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 37, in <module>
    import h5py
  File "/opt/conda/lib/python3.8/site-packages/h5py/__init__.py", line 46, in <module>
    from ._conv import register_converters as _register_converters
  File "h5py/h5t.pxd", line 14, in init h5py._conv
  File "h5py/h5t.pyx", line 293, in init h5py.h5t
  File "/opt/conda/lib/python3.8/site-packages/numpy/__init__.py", line 320, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'typeDict'

======================================================================
ERROR: test_run_with_unite_mup_large (__main__.TranslationEvaluationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py", line 42, in tf
    from tensorboard.compat import notf  # noqa: F401
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tests/trainers/test_translation_evaluation_trainer.py", line 31, in test_run_with_unite_mup_large
    trainer = build_trainer(name=self.name, default_args=default_args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/builder.py", line 39, in build_trainer
    return build_from_cfg(cfg, TRAINERS, default_args=default_args)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 184, in build_from_cfg
    LazyImportModule.import_module(sig)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/import_utils.py", line 475, in import_module
    importlib.import_module(module_name)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/nlp/translation_evaluation_trainer.py", line 17, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
    from .writer import FileWriter, SummaryWriter  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 16, in <module>
    from ._embedding import (
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/_embedding.py", line 9, in <module>
    _HAS_GFILE_JOIN = hasattr(tf.io.gfile, "join")
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 65, in __getattr__
    return getattr(load_once(self), attr_name)
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 97, in wrapper
    cache[arg] = f(arg)
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/lazy.py", line 50, in load_once
    module = load_fn()
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/__init__.py", line 45, in tf
    import tensorflow
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 45, in <module>
    from tensorflow.python.feature_column import feature_column_lib as feature_column
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/feature_column/feature_column_lib.py", line 18, in <module>
    from tensorflow.python.feature_column.feature_column import *
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/feature_column/feature_column.py", line 143, in <module>
    from tensorflow.python.layers import base
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/layers/base.py", line 16, in <module>
    from tensorflow.python.keras.legacy_tf_layers import base
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/__init__.py", line 25, in <module>
    from tensorflow.python.keras import models
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/models.py", line 22, in <module>
    from tensorflow.python.keras.engine import functional
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/functional.py", line 32, in <module>
    from tensorflow.python.keras.engine import training as training_lib
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 52, in <module>
    from tensorflow.python.keras.saving import hdf5_format
  File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 37, in <module>
    import h5py
  File "/opt/conda/lib/python3.8/site-packages/h5py/__init__.py", line 46, in <module>
    from ._conv import register_converters as _register_converters
  File "h5py/h5t.pxd", line 14, in init h5py._conv
  File "h5py/h5t.pyx", line 293, in init h5py.h5t
  File "/opt/conda/lib/python3.8/site-packages/numpy/__init__.py", line 320, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'typeDict'

----------------------------------------------------------------------
Ran 2 tests in 7.831s

FAILED (errors=2)


Steps to Reproduce:
如上

Actual results:
测试失败

Expected results:
测试通过

Additional info:
对比modelscope 官方ubuntu镜像：
root@ee8906738fa2:/tmp/modelscope# python3 tests/trainers/test_translation_evaluation_trainer.py
2024-01-25 22:53:33,569 - modelscope - INFO - PyTorch version 2.1.0+cu118 Found.
2024-01-25 22:53:33,571 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2024-01-25 22:53:33,571 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2024-01-25 22:53:33,616 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 44f0b88effe82ceea94a98cf99709694 and a total number of 946 components indexed
2024-01-25 22:53:41,046 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:53:41,904 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
/opt/conda/lib/python3.10/site-packages/tensorflow/__init__.py:29: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  import distutils as _distutils
2024-01-25 22:53:42.491129: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 22:53:42.491173: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 22:53:42.491216: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-25 22:53:42.501214: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/opt/conda/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py:35: DeprecationWarning: ml_dtypes.float8_e4m3b11 is deprecated. Use ml_dtypes.float8_e4m3b11fnuz
  from tensorflow.tsl.python.lib.core import pywrap_ml_dtypes
2024-01-25 22:53:43.462065: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-25 22:53:44,670 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:53:45,076 - modelscope - INFO - initialize model from /mnt/workspace/.cache/modelscope/damo/nlp_unite_mup_translation_evaluation_multilingual_base
2024-01-25 22:53:53,190 - modelscope - INFO - ==========================Training Config Start==========================
2024-01-25 22:53:53,191 - modelscope - INFO - {
    "framework": "pytorch",
    "task": "translation-evaluation",
    "pipeline": {
        "type": "translation-evaluation"
    },
    "preprocessor": {
        "type": "translation-evaluation-preprocessor",
        "max_len": 510,
        "pad_token_id": 1,
        "eos_token_id": 2
    },
    "model": {
        "attention_probs_dropout_prob": 0.1,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "hidden_size": 768,
        "initializer_range": 0.02,
        "intermediate_size": 3072,
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "unite",
        "num_attention_heads": 12,
        "num_hidden_layers": 12,
        "output_past": true,
        "pad_token_id": 1,
        "type_vocab_size": 1,
        "use_cache": true,
        "vocab_size": 250002,
        "mlp_hidden_sizes": [
            3072,
            1024
        ],
        "mlp_act": "tanh",
        "mlp_final_act": null,
        "mlp_dropout": 0.1,
        "type": "unite"
    },
    "dataset": {
        "train": {
            "name": "train.csv",
            "split": "train"
        },
        "valid": {
            "name": "eval.csv",
            "split": "eval"
        }
    },
    "train": {
        "initialize_model_with_checkpoint": true,
        "num_gpus": 1,
        "batch_size": 2,
        "seed": 12,
        "optimizer": {
            "type": "AdamW",
            "plm_lr": 1e-05,
            "betas": [
                0.9,
                0.98
            ],
            "eps": 1e-09,
            "weight_decay": 0.0,
            "plm_lr_layerwise_decay": 0.95,
            "mlp_lr": 3e-05,
            "options": {
                "cumulative_iters": 4,
                "grad_clip": null
            }
        },
        "lr_scheduler": {
            "type": "ConstantLR",
            "factor": 1.0,
            "total_iters": 3
        },
        "max_epochs": 3,
        "work_dir": "experiments_unite_base/",
        "hooks": [
            {
                "type": "TensorboardHook",
                "interval": 1
            },
            {
                "type": "IterTimerHook"
            }
        ],
        "logging": {
            "interval": 1
        },
        "checkpoint": {
            "best": {
                "metric_key": "src-ref_avg",
                "rule": "max"
            },
            "period": {
                "interval": 1
            }
        }
    },
    "evaluation": {
        "batch_size": 4,
        "save_outputs": true,
        "metrics": [
            {
                "type": "translation-evaluation-metric",
                "gap_threshold": 25.0
            }
        ],
        "period": {
            "interval": 1
        }
    }
}
2024-01-25 22:53:53,191 - modelscope - INFO - ===========================Training Config End===========================
2024-01-25 22:53:53,191 - modelscope - INFO - Building dataloader for training ...
2024-01-25 22:53:53,191 - modelscope - INFO - Reading train csv file from train.csv ...
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2096: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
  warnings.warn(
2024-01-25 22:53:54,475 - modelscope - INFO - 109 samples are given for training. Using 36 samples for each input format. Leaving the last 1 samples unused.
2024-01-25 22:53:54,476 - modelscope - INFO - Reading done, 109 items in total
2024-01-25 22:53:54,476 - modelscope - INFO - Building AdamW optimizer ...
2024-01-25 22:53:54,478 - modelscope - WARNING - ('LR_SCHEDULER', 'default', 'ConstantLR') not found in ast index file
2024-01-25 22:53:54,479 - modelscope - INFO - Stage: before_run:
    (ABOVE_NORMAL) OptimizerHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: before_train_epoch:
    (LOW         ) LrSchedulerHook
 --------------------
Stage: before_train_iter:
    (ABOVE_NORMAL) OptimizerHook
 --------------------
Stage: after_train_iter:
    (ABOVE_NORMAL) OptimizerHook
    (NORMAL      ) EvaluationHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_train_epoch:
    (NORMAL      ) EvaluationHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_val_epoch:
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_run:
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TensorboardHook
 --------------------
2024-01-25 22:53:54,495 - modelscope - INFO - Checkpoints will be saved to experiments_unite_base/
2024-01-25 22:53:54,509 - modelscope - INFO - Checkpoints will be saved to experiments_unite_base/
2024-01-25 22:53:54,509 - modelscope - INFO - Text logs will be saved to experiments_unite_base/
2024-01-25 22:53:54,509 - modelscope - INFO - tensorboard files will be saved to experiments_unite_base/tensorboard_output
2024-01-25 22:53:57,830 - modelscope - INFO - epoch [1][1/18]	lr: 1.000e-05, eta: 0:02:52, iter_time: 3.246, data_load_time: 2.394, memory: 1639, loss: 1.5867
2024-01-25 22:53:58,017 - modelscope - INFO - epoch [1][2/18]	lr: 1.000e-05, eta: 0:01:28, iter_time: 0.173, data_load_time: 0.077, memory: 2315, loss: 0.4219
2024-01-25 22:53:58,131 - modelscope - INFO - epoch [1][3/18]	lr: 1.000e-05, eta: 0:01:00, iter_time: 0.144, data_load_time: 0.089, memory: 2315, loss: 1.2569
2024-01-25 22:53:58,306 - modelscope - INFO - epoch [1][4/18]	lr: 1.000e-05, eta: 0:00:46, iter_time: 0.147, data_load_time: 0.061, memory: 2410, loss: 0.9410
2024-01-25 22:53:58,539 - modelscope - INFO - epoch [1][5/18]	lr: 1.000e-05, eta: 0:00:38, iter_time: 0.185, data_load_time: 0.087, memory: 3746, loss: 1.7758
2024-01-25 22:53:58,648 - modelscope - INFO - epoch [1][6/18]	lr: 1.000e-05, eta: 0:00:32, iter_time: 0.189, data_load_time: 0.134, memory: 3746, loss: 1.5786
2024-01-25 22:53:58,832 - modelscope - INFO - epoch [1][7/18]	lr: 1.000e-05, eta: 0:00:28, iter_time: 0.137, data_load_time: 0.056, memory: 3746, loss: 1.5095
2024-01-25 22:53:58,972 - modelscope - INFO - epoch [1][8/18]	lr: 1.000e-05, eta: 0:00:25, iter_time: 0.162, data_load_time: 0.101, memory: 3746, loss: 1.4335
2024-01-25 22:53:59,087 - modelscope - INFO - epoch [1][9/18]	lr: 1.000e-05, eta: 0:00:22, iter_time: 0.134, data_load_time: 0.079, memory: 3746, loss: 0.2827
2024-01-25 22:53:59,195 - modelscope - INFO - epoch [1][10/18]	lr: 1.000e-05, eta: 0:00:20, iter_time: 0.114, data_load_time: 0.061, memory: 3746, loss: 0.7218
2024-01-25 22:53:59,294 - modelscope - INFO - epoch [1][11/18]	lr: 1.000e-05, eta: 0:00:18, iter_time: 0.104, data_load_time: 0.055, memory: 3746, loss: 1.3647
2024-01-25 22:53:59,465 - modelscope - INFO - epoch [1][12/18]	lr: 1.000e-05, eta: 0:00:16, iter_time: 0.122, data_load_time: 0.049, memory: 3746, loss: 1.1365
2024-01-25 22:53:59,551 - modelscope - INFO - epoch [1][13/18]	lr: 1.000e-05, eta: 0:00:15, iter_time: 0.143, data_load_time: 0.099, memory: 3746, loss: 1.1352
2024-01-25 22:53:59,719 - modelscope - INFO - epoch [1][14/18]	lr: 1.000e-05, eta: 0:00:14, iter_time: 0.118, data_load_time: 0.043, memory: 3746, loss: 0.3775
2024-01-25 22:53:59,865 - modelscope - INFO - epoch [1][15/18]	lr: 1.000e-05, eta: 0:00:13, iter_time: 0.158, data_load_time: 0.091, memory: 3746, loss: 0.9682
2024-01-25 22:54:00,055 - modelscope - INFO - epoch [1][16/18]	lr: 1.000e-05, eta: 0:00:12, iter_time: 0.160, data_load_time: 0.080, memory: 3746, loss: 0.8070
2024-01-25 22:54:00,176 - modelscope - INFO - epoch [1][17/18]	lr: 1.000e-05, eta: 0:00:12, iter_time: 0.165, data_load_time: 0.110, memory: 3746, loss: 2.2128
2024-01-25 22:54:00,270 - modelscope - INFO - epoch [1][18/18]	lr: 1.000e-05, eta: 0:00:11, iter_time: 0.114, data_load_time: 0.066, memory: 3746, loss: 5.4629
2024-01-25 22:54:00,330 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:54:00,331 - modelscope - INFO - Reading eval csv file from eval.csv ...
2024-01-25 22:54:01,364 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:04<00:00, 29.08it/s]
2024-01-25 22:54:05,495 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:54:05,508 - modelscope - INFO - 	zh-en: 33.061224
2024-01-25 22:54:05,508 - modelscope - INFO - Average evaluation result for src-ref input format: 0.330612
2024-01-25 22:54:05,508 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 34.11it/s]
2024-01-25 22:54:09,029 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:54:09,040 - modelscope - INFO - 	zh-en: 40.408163
2024-01-25 22:54:09,041 - modelscope - INFO - Average evaluation result for src input format: 0.404082
2024-01-25 22:54:09,041 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 33.15it/s]
2024-01-25 22:54:12,664 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:54:12,675 - modelscope - INFO - 	zh-en: 7.755102
2024-01-25 22:54:12,675 - modelscope - INFO - Average evaluation result for ref input format: 0.077551
2024-01-25 22:54:12,675 - modelscope - INFO -
2024-01-25 22:54:12,676 - modelscope - INFO - Saving checkpoint at 1 epoch
2024-01-25 22:54:17,963 - modelscope - INFO - Saving checkpoint at 1 epoch
2024-01-25 22:54:23,228 - modelscope - INFO - epoch(eval) [1][30]	memory: 3746, evaluation/src-ref_avg: 0.3306, evaluation/src-ref_zh-en: 0.3306, evaluation/src_avg: 0.4041, evaluation/src_zh-en: 0.4041, evaluation/ref_avg: 0.0776, evaluation/ref_zh-en: 0.0776
2024-01-25 22:54:25,752 - modelscope - INFO - epoch [2][1/18]	lr: 1.000e-05, eta: 0:00:15, iter_time: 2.463, data_load_time: 2.399, memory: 3746, loss: 2.6457
2024-01-25 22:54:25,987 - modelscope - INFO - epoch [2][2/18]	lr: 1.000e-05, eta: 0:00:14, iter_time: 0.165, data_load_time: 0.062, memory: 3746, loss: 0.8232
2024-01-25 22:54:26,108 - modelscope - INFO - epoch [2][3/18]	lr: 1.000e-05, eta: 0:00:13, iter_time: 0.187, data_load_time: 0.132, memory: 3746, loss: 0.6192
2024-01-25 22:54:26,217 - modelscope - INFO - epoch [2][4/18]	lr: 1.000e-05, eta: 0:00:12, iter_time: 0.121, data_load_time: 0.067, memory: 3746, loss: 0.1971
2024-01-25 22:54:26,324 - modelscope - INFO - epoch [2][5/18]	lr: 1.000e-05, eta: 0:00:11, iter_time: 0.106, data_load_time: 0.053, memory: 3746, loss: 1.7651
2024-01-25 22:54:26,458 - modelscope - INFO - epoch [2][6/18]	lr: 1.000e-05, eta: 0:00:11, iter_time: 0.113, data_load_time: 0.053, memory: 3746, loss: 0.4991
2024-01-25 22:54:26,694 - modelscope - INFO - epoch [2][7/18]	lr: 1.000e-05, eta: 0:00:10, iter_time: 0.174, data_load_time: 0.075, memory: 3747, loss: 1.0642
2024-01-25 22:54:26,820 - modelscope - INFO - epoch [2][8/18]	lr: 1.000e-05, eta: 0:00:09, iter_time: 0.195, data_load_time: 0.136, memory: 3747, loss: 0.3571
2024-01-25 22:54:26,969 - modelscope - INFO - epoch [2][9/18]	lr: 1.000e-05, eta: 0:00:09, iter_time: 0.134, data_load_time: 0.066, memory: 3747, loss: 0.9691
2024-01-25 22:54:27,085 - modelscope - INFO - epoch [2][10/18]	lr: 1.000e-05, eta: 0:00:08, iter_time: 0.134, data_load_time: 0.082, memory: 3747, loss: 0.4898
2024-01-25 22:54:27,175 - modelscope - INFO - epoch [2][11/18]	lr: 1.000e-05, eta: 0:00:08, iter_time: 0.109, data_load_time: 0.063, memory: 3747, loss: 5.4889
2024-01-25 22:54:27,283 - modelscope - INFO - epoch [2][12/18]	lr: 1.000e-05, eta: 0:00:07, iter_time: 0.098, data_load_time: 0.044, memory: 3747, loss: 1.4372
2024-01-25 22:54:27,432 - modelscope - INFO - epoch [2][13/18]	lr: 1.000e-05, eta: 0:00:07, iter_time: 0.121, data_load_time: 0.054, memory: 3747, loss: 0.5515
2024-01-25 22:54:27,603 - modelscope - INFO - epoch [2][14/18]	lr: 1.000e-05, eta: 0:00:06, iter_time: 0.154, data_load_time: 0.082, memory: 3747, loss: 1.0365
2024-01-25 22:54:27,719 - modelscope - INFO - epoch [2][15/18]	lr: 1.000e-05, eta: 0:00:06, iter_time: 0.153, data_load_time: 0.099, memory: 3747, loss: 0.4552
2024-01-25 22:54:27,823 - modelscope - INFO - epoch [2][16/18]	lr: 1.000e-05, eta: 0:00:06, iter_time: 0.114, data_load_time: 0.061, memory: 3747, loss: 1.9504
2024-01-25 22:54:28,067 - modelscope - INFO - epoch [2][17/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.155, data_load_time: 0.052, memory: 4064, loss: 0.7299
2024-01-25 22:54:28,256 - modelscope - INFO - epoch [2][18/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.221, data_load_time: 0.141, memory: 4064, loss: 0.1739
2024-01-25 22:54:28,326 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:54:28,326 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:04<00:00, 29.07it/s]
2024-01-25 22:54:32,458 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:54:32,470 - modelscope - INFO - 	zh-en: 33.061224
2024-01-25 22:54:32,470 - modelscope - INFO - Average evaluation result for src-ref input format: 0.330612
2024-01-25 22:54:32,470 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 34.04it/s]
2024-01-25 22:54:35,998 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:54:36,010 - modelscope - INFO - 	zh-en: 39.591837
2024-01-25 22:54:36,010 - modelscope - INFO - Average evaluation result for src input format: 0.395918
2024-01-25 22:54:36,010 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 33.36it/s]
2024-01-25 22:54:39,610 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:54:39,621 - modelscope - INFO - 	zh-en: 23.265306
2024-01-25 22:54:39,621 - modelscope - INFO - Average evaluation result for ref input format: 0.232653
2024-01-25 22:54:39,621 - modelscope - INFO -
2024-01-25 22:54:39,622 - modelscope - INFO - Saving checkpoint at 2 epoch
2024-01-25 22:54:44,904 - modelscope - INFO - epoch(eval) [2][30]	memory: 4064, evaluation/src-ref_avg: 0.3306, evaluation/src-ref_zh-en: 0.3306, evaluation/src_avg: 0.3959, evaluation/src_zh-en: 0.3959, evaluation/ref_avg: 0.2327, evaluation/ref_zh-en: 0.2327
2024-01-25 22:54:47,449 - modelscope - INFO - epoch [3][1/18]	lr: 1.000e-05, eta: 0:00:06, iter_time: 2.480, data_load_time: 2.415, memory: 4064, loss: 0.9880
2024-01-25 22:54:47,664 - modelscope - INFO - epoch [3][2/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.160, data_load_time: 0.066, memory: 4064, loss: 0.7404
2024-01-25 22:54:47,763 - modelscope - INFO - epoch [3][3/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.169, data_load_time: 0.119, memory: 4064, loss: 1.2241
2024-01-25 22:54:47,915 - modelscope - INFO - epoch [3][4/18]	lr: 1.000e-05, eta: 0:00:04, iter_time: 0.116, data_load_time: 0.049, memory: 4064, loss: 0.4311
2024-01-25 22:54:48,003 - modelscope - INFO - epoch [3][5/18]	lr: 1.000e-05, eta: 0:00:04, iter_time: 0.130, data_load_time: 0.085, memory: 4064, loss: 2.2497
2024-01-25 22:54:48,116 - modelscope - INFO - epoch [3][6/18]	lr: 1.000e-05, eta: 0:00:03, iter_time: 0.099, data_load_time: 0.044, memory: 4064, loss: 1.6233
2024-01-25 22:54:48,208 - modelscope - INFO - epoch [3][7/18]	lr: 1.000e-05, eta: 0:00:03, iter_time: 0.105, data_load_time: 0.056, memory: 4064, loss: 1.0620
2024-01-25 22:54:48,380 - modelscope - INFO - epoch [3][8/18]	lr: 1.000e-05, eta: 0:00:03, iter_time: 0.116, data_load_time: 0.043, memory: 4064, loss: 0.7042
2024-01-25 22:54:48,483 - modelscope - INFO - epoch [3][9/18]	lr: 1.000e-05, eta: 0:00:02, iter_time: 0.150, data_load_time: 0.100, memory: 4064, loss: 0.7766
2024-01-25 22:54:48,616 - modelscope - INFO - epoch [3][10/18]	lr: 1.000e-05, eta: 0:00:02, iter_time: 0.113, data_load_time: 0.052, memory: 4064, loss: 0.8106
2024-01-25 22:54:48,715 - modelscope - INFO - epoch [3][11/18]	lr: 1.000e-05, eta: 0:00:02, iter_time: 0.122, data_load_time: 0.073, memory: 4064, loss: 1.2490
2024-01-25 22:54:48,851 - modelscope - INFO - epoch [3][12/18]	lr: 1.000e-05, eta: 0:00:01, iter_time: 0.109, data_load_time: 0.049, memory: 4064, loss: 1.4198
2024-01-25 22:54:48,980 - modelscope - INFO - epoch [3][13/18]	lr: 1.000e-05, eta: 0:00:01, iter_time: 0.135, data_load_time: 0.076, memory: 4064, loss: 1.2823
2024-01-25 22:54:49,148 - modelscope - INFO - epoch [3][14/18]	lr: 1.000e-05, eta: 0:00:01, iter_time: 0.146, data_load_time: 0.071, memory: 4064, loss: 0.8499
2024-01-25 22:54:49,392 - modelscope - INFO - epoch [3][15/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.196, data_load_time: 0.093, memory: 4086, loss: 0.8231
2024-01-25 22:54:49,566 - modelscope - INFO - epoch [3][16/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.213, data_load_time: 0.140, memory: 4086, loss: 1.3057
2024-01-25 22:54:49,658 - modelscope - INFO - epoch [3][17/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.146, data_load_time: 0.101, memory: 4086, loss: 0.7469
2024-01-25 22:54:49,824 - modelscope - INFO - epoch [3][18/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.123, data_load_time: 0.046, memory: 4086, loss: 0.7546
2024-01-25 22:54:49,892 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:54:49,892 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:04<00:00, 29.18it/s]
2024-01-25 22:54:54,008 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:54:54,020 - modelscope - INFO - 	zh-en: 11.020408
2024-01-25 22:54:54,020 - modelscope - INFO - Average evaluation result for src-ref input format: 0.110204
2024-01-25 22:54:54,020 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 33.83it/s]
2024-01-25 22:54:57,570 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:54:57,581 - modelscope - INFO - 	zh-en: 18.367347
2024-01-25 22:54:57,582 - modelscope - INFO - Average evaluation result for src input format: 0.183673
2024-01-25 22:54:57,582 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:03<00:00, 33.40it/s]
2024-01-25 22:55:01,177 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:55:01,189 - modelscope - INFO - 	zh-en: 3.673469
2024-01-25 22:55:01,189 - modelscope - INFO - Average evaluation result for ref input format: 0.036735
2024-01-25 22:55:01,189 - modelscope - INFO -
2024-01-25 22:55:01,189 - modelscope - INFO - Saving checkpoint at 3 epoch
2024-01-25 22:55:06,473 - modelscope - INFO - epoch(eval) [3][30]	memory: 4086, evaluation/src-ref_avg: 0.1102, evaluation/src-ref_zh-en: 0.1102, evaluation/src_avg: 0.1837, evaluation/src_zh-en: 0.1837, evaluation/ref_avg: 0.0367, evaluation/ref_zh-en: 0.0367
2024-01-25 22:55:06,474 - modelscope - INFO - Train finished. Uploading models, waiting...
2024-01-25 22:55:06,557 - modelscope - INFO - {'done': True}
2024-01-25 22:55:06,967 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:55:08,777 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
.2024-01-25 22:55:09,731 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:55:10,548 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:55:11,271 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:55:11,563 - modelscope - INFO - initialize model from /mnt/workspace/.cache/modelscope/damo/nlp_unite_mup_translation_evaluation_multilingual_large
2024-01-25 22:55:21,075 - modelscope - INFO - ==========================Training Config Start==========================
2024-01-25 22:55:21,076 - modelscope - INFO - {
    "framework": "pytorch",
    "task": "translation-evaluation",
    "pipeline": {
        "type": "translation-evaluation"
    },
    "preprocessor": {
        "type": "translation-evaluation-preprocessor",
        "max_len": 510,
        "pad_token_id": 1,
        "eos_token_id": 2
    },
    "model": {
        "attention_probs_dropout_prob": 0.1,
        "bos_token_id": 0,
        "eos_token_id": 2,
        "hidden_act": "gelu",
        "hidden_dropout_prob": 0.1,
        "hidden_size": 1024,
        "initializer_range": 0.02,
        "intermediate_size": 4096,
        "layer_norm_eps": 1e-05,
        "max_position_embeddings": 514,
        "model_type": "unite",
        "num_attention_heads": 16,
        "num_hidden_layers": 24,
        "output_past": true,
        "pad_token_id": 1,
        "type_vocab_size": 1,
        "use_cache": true,
        "vocab_size": 250002,
        "mlp_hidden_sizes": [
            3072,
            1024
        ],
        "mlp_act": "tanh",
        "mlp_final_act": null,
        "mlp_dropout": 0.1,
        "type": "unite"
    },
    "dataset": {
        "train": {
            "name": "train.csv",
            "split": "train"
        },
        "valid": {
            "name": "eval.csv",
            "split": "eval"
        }
    },
    "train": {
        "initialize_model_with_checkpoint": true,
        "num_gpus": 1,
        "batch_size": 2,
        "seed": 12,
        "optimizer": {
            "type": "AdamW",
            "plm_lr": 1e-05,
            "betas": [
                0.9,
                0.98
            ],
            "eps": 1e-09,
            "weight_decay": 0.0,
            "plm_lr_layerwise_decay": 0.95,
            "mlp_lr": 3e-05,
            "options": {
                "cumulative_iters": 4,
                "grad_clip": null
            }
        },
        "lr_scheduler": {
            "type": "ConstantLR",
            "factor": 1.0,
            "total_iters": 3
        },
        "max_epochs": 3,
        "work_dir": "experiments_unite_large/",
        "hooks": [
            {
                "type": "TensorboardHook",
                "interval": 1
            },
            {
                "type": "IterTimerHook"
            }
        ],
        "logging": {
            "interval": 1
        },
        "checkpoint": {
            "best": {
                "metric_key": "src-ref_avg",
                "rule": "max"
            },
            "period": {
                "interval": 1
            }
        }
    },
    "evaluation": {
        "batch_size": 4,
        "save_outputs": true,
        "metrics": [
            {
                "type": "translation-evaluation-metric",
                "gap_threshold": 25.0
            }
        ],
        "period": {
            "interval": 1
        }
    }
}
2024-01-25 22:55:21,076 - modelscope - INFO - ===========================Training Config End===========================
2024-01-25 22:55:21,076 - modelscope - INFO - Building dataloader for training ...
2024-01-25 22:55:21,076 - modelscope - INFO - Reading train csv file from train.csv ...
/opt/conda/lib/python3.10/site-packages/datasets/load.py:2096: FutureWarning: 'ignore_verifications' was deprecated in favor of 'verification_mode' in version 2.9.1 and will be removed in 3.0.0.
You can remove this warning by passing 'verification_mode=no_checks' instead.
  warnings.warn(
2024-01-25 22:55:22,253 - modelscope - INFO - 109 samples are given for training. Using 36 samples for each input format. Leaving the last 1 samples unused.
2024-01-25 22:55:22,254 - modelscope - INFO - Reading done, 109 items in total
2024-01-25 22:55:22,254 - modelscope - INFO - Building AdamW optimizer ...
2024-01-25 22:55:22,259 - modelscope - WARNING - ('LR_SCHEDULER', 'default', 'ConstantLR') not found in ast index file
2024-01-25 22:55:22,260 - modelscope - INFO - Stage: before_run:
    (ABOVE_NORMAL) OptimizerHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: before_train_epoch:
    (LOW         ) LrSchedulerHook
 --------------------
Stage: before_train_iter:
    (ABOVE_NORMAL) OptimizerHook
 --------------------
Stage: after_train_iter:
    (ABOVE_NORMAL) OptimizerHook
    (NORMAL      ) EvaluationHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_train_epoch:
    (NORMAL      ) EvaluationHook
    (LOW         ) LrSchedulerHook
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_val_epoch:
    (VERY_LOW    ) TextLoggerHook
    (VERY_LOW    ) TensorboardHook
 --------------------
Stage: after_run:
    (LOW         ) BestCkptSaverHook
    (LOW         ) CheckpointHook
    (VERY_LOW    ) TensorboardHook
 --------------------
2024-01-25 22:55:22,277 - modelscope - INFO - Checkpoints will be saved to experiments_unite_large/
2024-01-25 22:55:22,291 - modelscope - INFO - Checkpoints will be saved to experiments_unite_large/
2024-01-25 22:55:22,291 - modelscope - INFO - Text logs will be saved to experiments_unite_large/
2024-01-25 22:55:22,292 - modelscope - INFO - tensorboard files will be saved to experiments_unite_large/tensorboard_output
2024-01-25 22:55:25,202 - modelscope - INFO - epoch [1][1/18]	lr: 1.000e-05, eta: 0:02:26, iter_time: 2.769, data_load_time: 2.557, memory: 5773, loss: 0.2842
2024-01-25 22:55:25,492 - modelscope - INFO - epoch [1][2/18]	lr: 1.000e-05, eta: 0:01:20, iter_time: 0.316, data_load_time: 0.140, memory: 6946, loss: 1.8496
2024-01-25 22:55:25,789 - modelscope - INFO - epoch [1][3/18]	lr: 1.000e-05, eta: 0:00:57, iter_time: 0.286, data_load_time: 0.117, memory: 7067, loss: 0.9591
2024-01-25 22:55:26,447 - modelscope - INFO - epoch [1][4/18]	lr: 1.000e-05, eta: 0:00:50, iter_time: 0.670, data_load_time: 0.125, memory: 8401, loss: 0.1568
2024-01-25 22:55:26,852 - modelscope - INFO - epoch [1][5/18]	lr: 1.000e-05, eta: 0:00:42, iter_time: 0.329, data_load_time: 0.113, memory: 8782, loss: 1.4333
2024-01-25 22:55:27,424 - modelscope - INFO - epoch [1][6/18]	lr: 1.000e-05, eta: 0:00:39, iter_time: 0.572, data_load_time: 0.189, memory: 10944, loss: 0.5080
2024-01-25 22:55:27,851 - modelscope - INFO - epoch [1][7/18]	lr: 1.000e-05, eta: 0:00:36, iter_time: 0.459, data_load_time: 0.190, memory: 10944, loss: 1.4123
2024-01-25 22:55:28,422 - modelscope - INFO - epoch [1][8/18]	lr: 1.000e-05, eta: 0:00:33, iter_time: 0.493, data_load_time: 0.157, memory: 10944, loss: 0.5202
2024-01-25 22:55:28,829 - modelscope - INFO - epoch [1][9/18]	lr: 1.000e-05, eta: 0:00:31, iter_time: 0.449, data_load_time: 0.235, memory: 10944, loss: 0.7124
2024-01-25 22:55:29,125 - modelscope - INFO - epoch [1][10/18]	lr: 1.000e-05, eta: 0:00:29, iter_time: 0.374, data_load_time: 0.193, memory: 10944, loss: 1.3857
2024-01-25 22:55:29,493 - modelscope - INFO - epoch [1][11/18]	lr: 1.000e-05, eta: 0:00:27, iter_time: 0.347, data_load_time: 0.116, memory: 10944, loss: 1.0442
2024-01-25 22:55:29,958 - modelscope - INFO - epoch [1][12/18]	lr: 1.000e-05, eta: 0:00:26, iter_time: 0.395, data_load_time: 0.136, memory: 10944, loss: 1.0455
2024-01-25 22:55:30,376 - modelscope - INFO - epoch [1][13/18]	lr: 1.000e-05, eta: 0:00:24, iter_time: 0.426, data_load_time: 0.207, memory: 10944, loss: 1.3795
2024-01-25 22:55:30,891 - modelscope - INFO - epoch [1][14/18]	lr: 1.000e-05, eta: 0:00:24, iter_time: 0.527, data_load_time: 0.199, memory: 10944, loss: 1.1123
2024-01-25 22:55:31,459 - modelscope - INFO - epoch [1][15/18]	lr: 1.000e-05, eta: 0:00:23, iter_time: 0.565, data_load_time: 0.188, memory: 11147, loss: 0.3557
2024-01-25 22:55:31,946 - modelscope - INFO - epoch [1][16/18]	lr: 1.000e-05, eta: 0:00:22, iter_time: 0.464, data_load_time: 0.190, memory: 11147, loss: 5.1896
2024-01-25 22:55:32,301 - modelscope - INFO - epoch [1][17/18]	lr: 1.000e-05, eta: 0:00:21, iter_time: 0.399, data_load_time: 0.213, memory: 11147, loss: 0.7192
2024-01-25 22:55:32,712 - modelscope - INFO - epoch [1][18/18]	lr: 1.000e-05, eta: 0:00:20, iter_time: 0.428, data_load_time: 0.169, memory: 11147, loss: 0.5781
2024-01-25 22:55:32,798 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:55:32,798 - modelscope - INFO - Reading eval csv file from eval.csv ...
Downloading data files: 100%|███████████████████████████████████| 1/1 [00:00<00:00, 9554.22it/s]
Extracting data files: 100%|█████████████████████████████████████| 1/1 [00:00<00:00, 786.33it/s]
Generating train split: 0 examples [00:00, ? examples/s]/opt/conda/lib/python3.10/site-packages/pandas/io/common.py:131: ResourceWarning: unclosed file <_io.BufferedReader name='/mnt/workspace/.cache/modelscope/damo/nlp_unite_mup_translation_evaluation_multilingual_large/eval.csv'>
  self.handle.detach()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Generating train split: 120 examples [00:00, 6942.10 examples/s]
2024-01-25 22:55:34,031 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:08<00:00, 14.92it/s]
2024-01-25 22:55:42,077 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:55:42,089 - modelscope - INFO - 	zh-en: -31.428571
2024-01-25 22:55:42,089 - modelscope - INFO - Average evaluation result for src-ref input format: -0.314286
2024-01-25 22:55:42,089 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.51it/s]
2024-01-25 22:55:48,243 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:55:48,255 - modelscope - INFO - 	zh-en: -15.918367
2024-01-25 22:55:48,255 - modelscope - INFO - Average evaluation result for src input format: -0.159184
2024-01-25 22:55:48,255 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.27it/s]
2024-01-25 22:55:54,485 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:55:54,497 - modelscope - INFO - 	zh-en: -24.081633
2024-01-25 22:55:54,497 - modelscope - INFO - Average evaluation result for ref input format: -0.240816
2024-01-25 22:55:54,497 - modelscope - INFO -
2024-01-25 22:55:54,498 - modelscope - INFO - Saving checkpoint at 1 epoch
2024-01-25 22:56:07,231 - modelscope - INFO - Saving checkpoint at 1 epoch
2024-01-25 22:56:19,765 - modelscope - INFO - epoch(eval) [1][30]	memory: 11147, evaluation/src-ref_avg: -0.3143, evaluation/src-ref_zh-en: -0.3143, evaluation/src_avg: -0.1592, evaluation/src_zh-en: -0.1592, evaluation/ref_avg: -0.2408, evaluation/ref_zh-en: -0.2408
2024-01-25 22:56:22,925 - modelscope - INFO - epoch [2][1/18]	lr: 1.000e-05, eta: 0:00:24, iter_time: 2.979, data_load_time: 2.565, memory: 11147, loss: 4.8808
2024-01-25 22:56:23,279 - modelscope - INFO - epoch [2][2/18]	lr: 1.000e-05, eta: 0:00:23, iter_time: 0.362, data_load_time: 0.182, memory: 11147, loss: 1.4385
2024-01-25 22:56:23,639 - modelscope - INFO - epoch [2][3/18]	lr: 1.000e-05, eta: 0:00:21, iter_time: 0.354, data_load_time: 0.177, memory: 11147, loss: 0.6582
2024-01-25 22:56:24,201 - modelscope - INFO - epoch [2][4/18]	lr: 1.000e-05, eta: 0:00:21, iter_time: 0.555, data_load_time: 0.179, memory: 11147, loss: 1.1768
2024-01-25 22:56:24,578 - modelscope - INFO - epoch [2][5/18]	lr: 1.000e-05, eta: 0:00:20, iter_time: 0.417, data_load_time: 0.185, memory: 11147, loss: 0.7451
2024-01-25 22:56:24,975 - modelscope - INFO - epoch [2][6/18]	lr: 1.000e-05, eta: 0:00:19, iter_time: 0.343, data_load_time: 0.145, memory: 11147, loss: 1.2105
2024-01-25 22:56:25,385 - modelscope - INFO - epoch [2][7/18]	lr: 1.000e-05, eta: 0:00:18, iter_time: 0.418, data_load_time: 0.199, memory: 11147, loss: 0.8132
2024-01-25 22:56:25,822 - modelscope - INFO - epoch [2][8/18]	lr: 1.000e-05, eta: 0:00:17, iter_time: 0.467, data_load_time: 0.190, memory: 11147, loss: 0.3505
2024-01-25 22:56:26,120 - modelscope - INFO - epoch [2][9/18]	lr: 1.000e-05, eta: 0:00:16, iter_time: 0.343, data_load_time: 0.161, memory: 11147, loss: 1.6900
2024-01-25 22:56:26,699 - modelscope - INFO - epoch [2][10/18]	lr: 1.000e-05, eta: 0:00:15, iter_time: 0.457, data_load_time: 0.116, memory: 11147, loss: 0.7491
2024-01-25 22:56:27,104 - modelscope - INFO - epoch [2][11/18]	lr: 1.000e-05, eta: 0:00:15, iter_time: 0.454, data_load_time: 0.239, memory: 11147, loss: 0.8758
2024-01-25 22:56:27,343 - modelscope - INFO - epoch [2][12/18]	lr: 1.000e-05, eta: 0:00:14, iter_time: 0.337, data_load_time: 0.189, memory: 11147, loss: 0.4616
2024-01-25 22:56:27,907 - modelscope - INFO - epoch [2][13/18]	lr: 1.000e-05, eta: 0:00:13, iter_time: 0.474, data_load_time: 0.091, memory: 11147, loss: 0.8847
2024-01-25 22:56:28,294 - modelscope - INFO - epoch [2][14/18]	lr: 1.000e-05, eta: 0:00:12, iter_time: 0.373, data_load_time: 0.180, memory: 11147, loss: 0.1583
2024-01-25 22:56:28,651 - modelscope - INFO - epoch [2][15/18]	lr: 1.000e-05, eta: 0:00:12, iter_time: 0.374, data_load_time: 0.195, memory: 11147, loss: 1.1917
2024-01-25 22:56:29,201 - modelscope - INFO - epoch [2][16/18]	lr: 1.000e-05, eta: 0:00:11, iter_time: 0.545, data_load_time: 0.177, memory: 11147, loss: 0.6198
2024-01-25 22:56:29,501 - modelscope - INFO - epoch [2][17/18]	lr: 1.000e-05, eta: 0:00:10, iter_time: 0.354, data_load_time: 0.182, memory: 11147, loss: 0.5833
2024-01-25 22:56:30,035 - modelscope - INFO - epoch [2][18/18]	lr: 1.000e-05, eta: 0:00:10, iter_time: 0.437, data_load_time: 0.128, memory: 11147, loss: 0.5790
2024-01-25 22:56:30,122 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:56:30,123 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:08<00:00, 14.96it/s]
2024-01-25 22:56:38,147 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:56:38,159 - modelscope - INFO - 	zh-en: -23.265306
2024-01-25 22:56:38,159 - modelscope - INFO - Average evaluation result for src-ref input format: -0.232653
2024-01-25 22:56:38,159 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.64it/s]
2024-01-25 22:56:44,271 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:56:44,282 - modelscope - INFO - 	zh-en: -33.061224
2024-01-25 22:56:44,283 - modelscope - INFO - Average evaluation result for src input format: -0.330612
2024-01-25 22:56:44,283 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.26it/s]
2024-01-25 22:56:50,515 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:56:50,526 - modelscope - INFO - 	zh-en: -23.265306
2024-01-25 22:56:50,527 - modelscope - INFO - Average evaluation result for ref input format: -0.232653
2024-01-25 22:56:50,527 - modelscope - INFO -
2024-01-25 22:56:50,527 - modelscope - INFO - Saving checkpoint at 2 epoch
2024-01-25 22:57:03,227 - modelscope - INFO - deleting checkpoint: experiments_unite_large/best_epoch1_src-ref_avg-0.3142857142857143
2024-01-25 22:57:03,822 - modelscope - INFO - Saving checkpoint at 2 epoch
2024-01-25 22:57:16,294 - modelscope - INFO - epoch(eval) [2][30]	memory: 11147, evaluation/src-ref_avg: -0.2327, evaluation/src-ref_zh-en: -0.2327, evaluation/src_avg: -0.3306, evaluation/src_zh-en: -0.3306, evaluation/ref_avg: -0.2327, evaluation/ref_zh-en: -0.2327
2024-01-25 22:57:19,375 - modelscope - INFO - epoch [3][1/18]	lr: 1.000e-05, eta: 0:00:10, iter_time: 2.887, data_load_time: 2.608, memory: 11147, loss: 0.5929
2024-01-25 22:57:19,759 - modelscope - INFO - epoch [3][2/18]	lr: 1.000e-05, eta: 0:00:09, iter_time: 0.430, data_load_time: 0.195, memory: 11147, loss: 2.2666
2024-01-25 22:57:20,053 - modelscope - INFO - epoch [3][3/18]	lr: 1.000e-05, eta: 0:00:09, iter_time: 0.322, data_load_time: 0.150, memory: 11147, loss: 0.3718
2024-01-25 22:57:20,445 - modelscope - INFO - epoch [3][4/18]	lr: 1.000e-05, eta: 0:00:08, iter_time: 0.326, data_load_time: 0.119, memory: 11147, loss: 1.7617
2024-01-25 22:57:20,694 - modelscope - INFO - epoch [3][5/18]	lr: 1.000e-05, eta: 0:00:07, iter_time: 0.313, data_load_time: 0.186, memory: 11147, loss: 1.5502
2024-01-25 22:57:20,935 - modelscope - INFO - epoch [3][6/18]	lr: 1.000e-05, eta: 0:00:07, iter_time: 0.271, data_load_time: 0.122, memory: 11147, loss: 0.6524
2024-01-25 22:57:21,308 - modelscope - INFO - epoch [3][7/18]	lr: 1.000e-05, eta: 0:00:06, iter_time: 0.323, data_load_time: 0.091, memory: 11147, loss: 1.2068
2024-01-25 22:57:21,885 - modelscope - INFO - epoch [3][8/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.480, data_load_time: 0.141, memory: 11147, loss: 0.3999
2024-01-25 22:57:22,429 - modelscope - INFO - epoch [3][9/18]	lr: 1.000e-05, eta: 0:00:05, iter_time: 0.558, data_load_time: 0.238, memory: 11147, loss: 1.3902
2024-01-25 22:57:22,944 - modelscope - INFO - epoch [3][10/18]	lr: 1.000e-05, eta: 0:00:04, iter_time: 0.554, data_load_time: 0.224, memory: 11147, loss: 0.9272
2024-01-25 22:57:23,327 - modelscope - INFO - epoch [3][11/18]	lr: 1.000e-05, eta: 0:00:04, iter_time: 0.423, data_load_time: 0.185, memory: 11147, loss: 0.6970
2024-01-25 22:57:23,814 - modelscope - INFO - epoch [3][12/18]	lr: 1.000e-05, eta: 0:00:03, iter_time: 0.416, data_load_time: 0.145, memory: 11147, loss: 0.5413
2024-01-25 22:57:24,240 - modelscope - INFO - epoch [3][13/18]	lr: 1.000e-05, eta: 0:00:02, iter_time: 0.442, data_load_time: 0.216, memory: 11147, loss: 0.8826
2024-01-25 22:57:24,708 - modelscope - INFO - epoch [3][14/18]	lr: 1.000e-05, eta: 0:00:02, iter_time: 0.499, data_load_time: 0.200, memory: 11147, loss: 0.5340
2024-01-25 22:57:24,943 - modelscope - INFO - epoch [3][15/18]	lr: 1.000e-05, eta: 0:00:01, iter_time: 0.311, data_load_time: 0.169, memory: 11147, loss: 0.6754
2024-01-25 22:57:25,618 - modelscope - INFO - epoch [3][16/18]	lr: 1.000e-05, eta: 0:00:01, iter_time: 0.511, data_load_time: 0.091, memory: 11147, loss: 0.6638
2024-01-25 22:57:26,036 - modelscope - INFO - epoch [3][17/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.481, data_load_time: 0.256, memory: 11147, loss: 1.0455
2024-01-25 22:57:26,480 - modelscope - INFO - epoch [3][18/18]	lr: 1.000e-05, eta: 0:00:00, iter_time: 0.474, data_load_time: 0.192, memory: 11147, loss: 0.3829
2024-01-25 22:57:26,566 - modelscope - INFO - Building dataloader for evaluating ...
2024-01-25 22:57:26,567 - modelscope - INFO - Reading done, 120 items in total
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:08<00:00, 14.86it/s]
2024-01-25 22:57:34,646 - modelscope - INFO - Evaluation results for src-ref input format
2024-01-25 22:57:34,657 - modelscope - INFO - 	zh-en: -24.897959
2024-01-25 22:57:34,657 - modelscope - INFO - Average evaluation result for src-ref input format: -0.248980
2024-01-25 22:57:34,657 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.50it/s]
2024-01-25 22:57:40,815 - modelscope - INFO - Evaluation results for src input format
2024-01-25 22:57:40,826 - modelscope - INFO - 	zh-en: -42.857143
2024-01-25 22:57:40,826 - modelscope - INFO - Average evaluation result for src input format: -0.428571
2024-01-25 22:57:40,827 - modelscope - INFO -
Total test samples: 100%|█████████████████████████████████████| 120/120 [00:06<00:00, 19.13it/s]
2024-01-25 22:57:47,103 - modelscope - INFO - Evaluation results for ref input format
2024-01-25 22:57:47,114 - modelscope - INFO - 	zh-en: -24.897959
2024-01-25 22:57:47,114 - modelscope - INFO - Average evaluation result for ref input format: -0.248980
2024-01-25 22:57:47,115 - modelscope - INFO -
2024-01-25 22:57:47,115 - modelscope - INFO - Saving checkpoint at 3 epoch
2024-01-25 22:57:59,812 - modelscope - INFO - epoch(eval) [3][30]	memory: 11147, evaluation/src-ref_avg: -0.2490, evaluation/src-ref_zh-en: -0.2490, evaluation/src_avg: -0.4286, evaluation/src_zh-en: -0.4286, evaluation/ref_avg: -0.2490, evaluation/ref_zh-en: -0.2490
2024-01-25 22:57:59,814 - modelscope - INFO - Train finished. Uploading models, waiting...
2024-01-25 22:57:59,861 - modelscope - INFO - {'done': True}
2024-01-25 22:58:00,930 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
2024-01-25 22:58:01,641 - modelscope - WARNING - Model revision not specified, use revision: v2.6.0
.
----------------------------------------------------------------------
Ran 2 tests in 263.994s

OK