Bug 8055 - [ModelScope][容器镜像]镜像默认缺少libblas.so.3库,导致ModelScope社区自带用例执行失败。
Summary: [ModelScope][容器镜像]镜像默认缺少libblas.so.3库,导致ModelScope社区自带用例执行失败。
Status: RESOLVED FIXED
Alias: None
Product: Anolis OS 8
Classification: Anolis OS
Component: Images&Installations (show other bugs) Images&Installations
Version: 8.8
Hardware: All Linux
: P2-High S2-major
Target Milestone: ---
Assignee: zhongling
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-25 22:05 UTC by yunmeng365524
Modified: 2024-02-21 10:07 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description yunmeng365524 2024-01-25 22:05:17 UTC
Description of problem:
镜像默认缺少libblas.so.3库,导致ModelScope社区自带用例执行失败。

Version-Release number of selected component (if applicable):
容器镜像信息:
[root@localhost modelscope]# docker images
REPOSITORY                                     TAG          IMAGE ID       CREATED       SIZE
registry.openanolis.cn/openanolis/modelscope   1.10.0-an8   736efcaf9a7c   5 weeks ago   25.3GB


How reproducible:
以gpu的方式创建并启动容器
docker create --gpus all -it -v /tmp:/tmp 736efcaf9a7c
进入容器后clone modelscope社区代码
git clone https://github.com/modelscope/modelscope.git
cd modelscope
[root@1a186991ac23 modelscope]# python3 tests/trainers/test_document_grounded_dialog_rerank_trainer.py
2024-01-25 21:49:08,082 - modelscope - INFO - PyTorch version 2.0.1+cu118 Found.
2024-01-25 21:49:08,086 - modelscope - INFO - TensorFlow version 2.9.2 Found.
2024-01-25 21:49:08,086 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-01-25 21:49:08,127 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 da292646adab6334a8f0cd8a272bf9b1 and a total number of 946 components indexed
2024-01-25 21:49:10,639 - modelscope - INFO - No subset_name specified, defaulting to the default
2024-01-25 21:49:11,298 - modelscope - INFO - Generating dataset dataset_builder (/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files)
2024-01-25 21:49:11,298 - modelscope - INFO - Loading meta-data file ...
3513it [00:01, 2544.77it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
/opt/conda/lib/python3.8/site-packages/datasets/download/streaming_download_manager.py:765: ResourceWarning: unclosed file <_io.BufferedReader name='/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files/167d5ff1e9ecd650c2550b1a1501d09c'>
  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
2024-01-25 21:49:14,071 - modelscope - INFO - Use user-specified model revision: v1.0.0
Downloading: 100%|█████████████████████████████████████████████████| 721/721 [00:00<00:00, 143kB/s]
Downloading: 100%|█████████████████████████████████████████████████| 721/721 [00:00<00:00, 168kB/s]
Downloading: 100%|█████████████████████████████████████████████████| 279/279 [00:00<00:00, 114kB/s]
Downloading: 100%|███████████████████████████████████████████▉| 1.04G/1.04G [01:28<00:00, 12.6MB/s]
Downloading: 100%|████████████████████████████████████████████| 8.39k/8.39k [00:00<00:00, 2.33MB/s]
Downloading: 100%|████████████████████████████████████████████| 4.83M/4.83M [00:00<00:00, 18.2MB/s]
Downloading: 100%|█████████████████████████████████████████████████| 279/279 [00:00<00:00, 106kB/s]
Downloading: 100%|█████████████████████████████████████████████████| 476/476 [00:00<00:00, 203kB/s]
2024-01-25 21:50:48,034 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/DAMO_ConvAI/nlp_convai_ranking_pretrain
EE
======================================================================
ERROR: test_trainer_with_model_and_args (__main__.TestDialogIntentTrainer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1382, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/rag/modeling_rag.py", line 30, in <module>
    from .retrieval_rag import RagRetriever
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/rag/retrieval_rag.py", line 35, in <module>
    import faiss
  File "/opt/conda/lib/python3.8/site-packages/faiss/__init__.py", line 18, in <module>
    from .loader import *
  File "/opt/conda/lib/python3.8/site-packages/faiss/loader.py", line 65, in <module>
    from .swigfaiss import *
  File "/opt/conda/lib/python3.8/site-packages/faiss/swigfaiss.py", line 13, in <module>
    from . import _swigfaiss
ImportError: libblas.so.3: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tests/trainers/test_document_grounded_dialog_rerank_trainer.py", line 76, in test_trainer_with_model_and_args
    trainer = DocumentGroundedDialogRerankTrainer(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/trainers/nlp/document_grounded_dialog_rerank_trainer.py", line 35, in __init__
    self.model = Model.from_pretrained(model, revision='v1.0.0')
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/base/base_model.py", line 183, in from_pretrained
    model = build_model(model_cfg, task_name=task_name)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/builder.py", line 35, in build_model
    model = build_from_cfg(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/registry.py", line 184, in build_from_cfg
    LazyImportModule.import_module(sig)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/utils/import_utils.py", line 475, in import_module
    importlib.import_module(module_name)
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/nlp/dgds/document_grounded_dialog_rerank.py", line 12, in <module>
    from .backbone import ClassifyRerank
  File "/opt/conda/lib/python3.8/site-packages/modelscope/models/nlp/dgds/backbone.py", line 24, in <module>
    from transformers import (AutoConfig, DPRConfig, DPRQuestionEncoder,
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1373, in __getattr__
    value = getattr(module, name)
  File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1372, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/opt/conda/lib/python3.8/site-packages/transformers/utils/import_utils.py", line 1384, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.rag.modeling_rag because of the following error (look up to see its traceback):
libblas.so.3: cannot open shared object file: No such file or directory

======================================================================
ERROR: test_trainer_with_model_and_args (__main__.TestDialogIntentTrainer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/trainers/test_document_grounded_dialog_rerank_trainer.py", line 24, in tearDown
    shutil.rmtree('./model')
  File "/opt/conda/lib/python3.8/shutil.py", line 709, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/opt/conda/lib/python3.8/shutil.py", line 707, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: './model'

----------------------------------------------------------------------
Ran 1 test in 98.394s

FAILED (errors=2)


Steps to Reproduce:
同上

Actual results:
缺少库执行失败

Expected results:
可正常执行通过

Additional info:
对比modelscope 官方ubuntu镜像:
root@ee8906738fa2:/tmp/modelscope# python3 tests/trainers/test_document_grounded_dialog_rerank_trainer.py
2024-01-25 21:49:36,017 - modelscope - INFO - PyTorch version 2.1.0+cu118 Found.
2024-01-25 21:49:36,019 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2024-01-25 21:49:36,019 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2024-01-25 21:49:36,071 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 44f0b88effe82ceea94a98cf99709694 and a total number of 946 components indexed
2024-01-25 21:49:44,082 - modelscope - INFO - No subset_name specified, defaulting to the default
2024-01-25 21:49:44,608 - modelscope - INFO - Generating dataset dataset_builder (/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files)
2024-01-25 21:49:44,608 - modelscope - INFO - Loading meta-data file ...
3513it [00:01, 2601.70it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
/opt/conda/lib/python3.10/site-packages/datasets/download/streaming_download_manager.py:765: ResourceWarning: unclosed file <_io.BufferedReader name='/root/.cache/modelscope/hub/datasets/DAMO_ConvAI/FrDoc2BotRerank/master/data_files/167d5ff1e9ecd650c2550b1a1501d09c'>
  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
2024-01-25 21:49:47,286 - modelscope - INFO - Use user-specified model revision: v1.0.0
Downloading: 100%|█████████████████████████████████████████████| 721/721 [00:00<00:00, 7.71MB/s]
Downloading: 100%|█████████████████████████████████████████████| 721/721 [00:00<00:00, 8.31MB/s]
Downloading: 100%|█████████████████████████████████████████████| 279/279 [00:00<00:00, 3.38MB/s]
Downloading: 100%|████████████████████████████████████████▉| 1.04G/1.04G [01:28<00:00, 12.6MB/s]
Downloading: 100%|█████████████████████████████████████████| 8.39k/8.39k [00:00<00:00, 40.9MB/s]
Downloading: 100%|█████████████████████████████████████████| 4.83M/4.83M [00:00<00:00, 15.8MB/s]
Downloading: 100%|█████████████████████████████████████████████| 279/279 [00:00<00:00, 3.07MB/s]
Downloading: 100%|█████████████████████████████████████████████| 476/476 [00:00<00:00, 5.64MB/s]
2024-01-25 21:51:21,323 - modelscope - INFO - initialize model from /mnt/workspace/.cache/modelscope/DAMO_ConvAI/nlp_convai_ranking_pretrain
2024-01-25 21:51:32,455 - modelscope - INFO - gathered positive pids for 10 instances
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2024-01-25 21:51:32,463 - modelscope - INFO - ***** Running training *****
2024-01-25 21:51:32,463 - modelscope - INFO -   Instantaneous batch size per GPU = 1
2024-01-25 21:51:32,463 - modelscope - INFO -   Total train batch size (w. parallel, distributed & accumulation) = 32
2024-01-25 21:51:32,463 - modelscope - INFO -   Gradient Accumulation steps = 32
2024-01-25 21:51:32,463 - modelscope - INFO -   Total optimization steps = 0
2024-01-25 21:51:32,463 - modelscope - INFO -   Num Epochs = 1
2024-01-25 21:51:32,464 - modelscope - INFO - loss_history = []
2024-01-25 21:51:32,464 - modelscope - INFO - truncated to max length (128) 0 times
2024-01-25 21:51:32,464 - modelscope - INFO - Saving model checkpoint to ./model
.
----------------------------------------------------------------------
Ran 1 test in 111.888s

OK
Comment 1 zhongling 2024-02-21 10:07:40 UTC
在镜像中默认安装libnccl解决该问题。