Description of problem: 镜像默认缺少libblas.so.3库,导致ModelScope社区自带用例执行失败。 Version-Release number of selected component (if applicable): 容器镜像信息: [root@localhost modelscope]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.openanolis.cn/openanolis/modelscope 1.10.0-an8 736efcaf9a7c 5 weeks ago 25.3GB How reproducible: 以gpu的方式创建并启动容器 docker create --gpus all -it -v /tmp:/tmp 736efcaf9a7c 进入容器后clone modelscope社区代码 git clone https://github.com/modelscope/modelscope.git cd modelscope [root@1a186991ac23 modelscope]# python3 tests/trainers/test_document_grounded_dialog_rerank_trainer.py 2024-01-25 21:49:08,082 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found. 2024-01-25 21:49:08,086 - modelscope - INFO - TensorFlow version 2.9.2 Found. 2024-01-25 21:49:08,086 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-01-25 21:49:08,127 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 da292646adab6334a8f0cd8a272bf9b1 and a total number of 946 components indexed 2024-01-25 21:49:10,639 - modelscope - INFO - No subset_name specified, defaulting to the default 2024-01-25 21:49:11,298 - modelscope - INFO - Generating dataset dataset_builder (/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files) 2024-01-25 21:49:11,298 - modelscope - INFO - Loading meta-data file ... 3513it [00:01, 2544.77it/s] Downloading data files: 0it [00:00, ?it/s] Extracting data files: 0it [00:00, ?it/s] /opt/conda/lib/python3.8/site-packages/datasets/download/streaming_download_manager.py:765: ResourceWarning: unclosed file <_io.BufferedReader name='/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files/167d5ff1e9ecd650c2550b1a1501d09c'> return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs) ResourceWarning: Enable tracemalloc to get the object allocation traceback 2024-01-25 21:49:14,071 - modelscope - INFO - Use user-specified model revision: v1.0.0 Downloading: 100%|█████████████████████████████████████████████████| 721/721 [00:00<00:00, 143kB/s] Downloading: 100%|█████████████████████████████████████████████████| 721/721 [00:00<00:00, 168kB/s] Downloading: 100%|█████████████████████████████████████████████████| 279/279 [00:00<00:00, 114kB/s] Downloading: 100%|███████████████████████████████████████████▉| 1.04G/1.04G [01:28<00:00, 12.6MB/s] Downloading: 100%|████████████████████████████████████████████| 8.39k/8.39k [00:00<00:00, 2.33MB/s] Downloading: 100%|████████████████████████████████████████████| 4.83M/4.83M [00:00<00:00, 18.2MB/s] Downloading: 100%|█████████████████████████████████████████████████| 279/279 [00:00<00:00, 106kB/s] Downloading: 100%|█████████████████████████████████████████████████| 476/476 [00:00<00:00, 203kB/s] 2024-01-25 21:50:48,034 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/DAMO_ConvAI/nlp_convai_ranking_pretrain EE ====================================================================== ERROR: test_trainer_with_model_and_args (__main__.TestDialogIntentTrainer) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1382, in _get_module return importlib.import_module("." + module_name, self.__name__) File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 671, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 843, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/opt/conda/lib/python3.8/site-packages/transformers/models/rag/modeling_rag.py", line 30, in <module> from .retrieval_rag import RagRetriever File "/opt/conda/lib/python3.8/site-packages/transformers/models/rag/retrieval_rag.py", line 35, in <module> import faiss File "/opt/conda/lib/python3.8/site-packages/faiss/__init__.py", line 18, in <module> from .loader import * File "/opt/conda/lib/python3.8/site-packages/faiss/loader.py", line 65, in <module> from .swigfaiss import * File "/opt/conda/lib/python3.8/site-packages/faiss/swigfaiss.py", line 13, in <module> from . import _swigfaiss ImportError: libblas.so.3: cannot open shared object file: No such file or directory The above exception was the direct cause of the following exception: Traceback (most recent call last): File "tests/trainers/test_document_grounded_dialog_rerank_trainer.py", line 76, in test_trainer_with_model_and_args trainer = DocumentGroundedDialogRerankTrainer( File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/nlp/document_grounded_dialog_rerank_trainer.py", line 35, in __init__ self.model = Model.from_pretrained(model, revision='v1.0.0') File "/opt/conda/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 183, in from_pretrained model = build_model(model_cfg, task_name=task_name) File "/opt/conda/lib/python3.8/site-packages/modelscope/models/builder.py", line 35, in build_model model = build_from_cfg( File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 184, in build_from_cfg LazyImportModule.import_module(sig) File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/import_utils.py", line 475, in import_module importlib.import_module(module_name) File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 671, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 843, in exec_module File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed File "/opt/conda/lib/python3.8/site-packages/modelscope/models/nlp/dgds/document_grounded_dialog_rerank.py", line 12, in <module> from .backbone import ClassifyRerank File "/opt/conda/lib/python3.8/site-packages/modelscope/models/nlp/dgds/backbone.py", line 24, in <module> from transformers import (AutoConfig, DPRConfig, DPRQuestionEncoder, File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1373, in __getattr__ value = getattr(module, name) File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1372, in __getattr__ module = self._get_module(self._class_to_module[name]) File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1384, in _get_module raise RuntimeError( RuntimeError: Failed to import transformers.models.rag.modeling_rag because of the following error (look up to see its traceback): libblas.so.3: cannot open shared object file: No such file or directory ====================================================================== ERROR: test_trainer_with_model_and_args (__main__.TestDialogIntentTrainer) ---------------------------------------------------------------------- Traceback (most recent call last): File "tests/trainers/test_document_grounded_dialog_rerank_trainer.py", line 24, in tearDown shutil.rmtree('./model') File "/opt/conda/lib/python3.8/shutil.py", line 709, in rmtree onerror(os.lstat, path, sys.exc_info()) File "/opt/conda/lib/python3.8/shutil.py", line 707, in rmtree orig_st = os.lstat(path) FileNotFoundError: [Errno 2] No such file or directory: './model' ---------------------------------------------------------------------- Ran 1 test in 98.394s FAILED (errors=2) Steps to Reproduce: 同上 Actual results: 缺少库执行失败 Expected results: 可正常执行通过 Additional info: 对比modelscope 官方ubuntu镜像: root@ee8906738fa2:/tmp/modelscope# python3 tests/trainers/test_document_grounded_dialog_rerank_trainer.py 2024-01-25 21:49:36,017 - modelscope - INFO - PyTorch version 2.1.0+cu118 Found. 2024-01-25 21:49:36,019 - modelscope - INFO - TensorFlow version 2.14.0 Found. 2024-01-25 21:49:36,019 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer 2024-01-25 21:49:36,071 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 44f0b88effe82ceea94a98cf99709694 and a total number of 946 components indexed 2024-01-25 21:49:44,082 - modelscope - INFO - No subset_name specified, defaulting to the default 2024-01-25 21:49:44,608 - modelscope - INFO - Generating dataset dataset_builder (/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files) 2024-01-25 21:49:44,608 - modelscope - INFO - Loading meta-data file ... 3513it [00:01, 2601.70it/s] Downloading data files: 0it [00:00, ?it/s] Extracting data files: 0it [00:00, ?it/s] /opt/conda/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:765: ResourceWarning: unclosed file <_io.BufferedReader name='/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files/167d5ff1e9ecd650c2550b1a1501d09c'> return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs) ResourceWarning: Enable tracemalloc to get the object allocation traceback 2024-01-25 21:49:47,286 - modelscope - INFO - Use user-specified model revision: v1.0.0 Downloading: 100%|█████████████████████████████████████████████| 721/721 [00:00<00:00, 7.71MB/s] Downloading: 100%|█████████████████████████████████████████████| 721/721 [00:00<00:00, 8.31MB/s] Downloading: 100%|█████████████████████████████████████████████| 279/279 [00:00<00:00, 3.38MB/s] Downloading: 100%|████████████████████████████████████████▉| 1.04G/1.04G [01:28<00:00, 12.6MB/s] Downloading: 100%|█████████████████████████████████████████| 8.39k/8.39k [00:00<00:00, 40.9MB/s] Downloading: 100%|█████████████████████████████████████████| 4.83M/4.83M [00:00<00:00, 15.8MB/s] Downloading: 100%|█████████████████████████████████████████████| 279/279 [00:00<00:00, 3.07MB/s] Downloading: 100%|█████████████████████████████████████████████| 476/476 [00:00<00:00, 5.64MB/s] 2024-01-25 21:51:21,323 - modelscope - INFO - initialize model from /mnt/workspace/.cache/modelscope/DAMO_ConvAI/nlp_convai_ranking_pretrain 2024-01-25 21:51:32,455 - modelscope - INFO - gathered positive pids for 10 instances /opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( 2024-01-25 21:51:32,463 - modelscope - INFO - ***** Running training ***** 2024-01-25 21:51:32,463 - modelscope - INFO - Instantaneous batch size per GPU = 1 2024-01-25 21:51:32,463 - modelscope - INFO - Total train batch size (w. parallel, distributed & accumulation) = 32 2024-01-25 21:51:32,463 - modelscope - INFO - Gradient Accumulation steps = 32 2024-01-25 21:51:32,463 - modelscope - INFO - Total optimization steps = 0 2024-01-25 21:51:32,463 - modelscope - INFO - Num Epochs = 1 2024-01-25 21:51:32,464 - modelscope - INFO - loss_history = [] 2024-01-25 21:51:32,464 - modelscope - INFO - truncated to max length (128) 0 times 2024-01-25 21:51:32,464 - modelscope - INFO - Saving model checkpoint to ./model . ---------------------------------------------------------------------- Ran 1 test in 111.888s OK
在镜像中默认安装libnccl解决该问题。