经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 大数据/云/AI » 人工智能基础 » 查看文章
成为钢铁侠!只需一块RTX3090,微软开源贾维斯(J.A.R.V.I.S.)人工智能AI助理系统
来源:cnblogs  作者:刘悦的技术博客  时间:2023/4/7 8:56:10  对本文有异议

梦想照进现实,微软果然不愧是微软,开源了贾维斯(J.A.R.V.I.S.)人工智能助理系统,贾维斯(jarvis)全称为Just A Rather Very Intelligent System(只是一个相当聪明的人工智能系统),它可以帮助钢铁侠托尼斯塔克完成各种任务和挑战,包括控制和管理托尼的机甲装备,提供实时情报和数据分析,帮助托尼做出决策等等。

如今,我们也可以拥有自己的贾维斯人工智能助理,成本仅仅是一块RTX3090显卡。

贾维斯(Jarvis)的环境配置

一般情况下,深度学习领域相对主流的入门级别显卡是2070或者3070,而3090可以算是消费级深度学习显卡的天花板了:

再往上走就是工业级别的A系列和V系列显卡,显存是一个硬指标,因为需要加载本地的大模型,虽然可以改代码对模型加载进行“阉割”,但功能上肯定也会有一定的损失。如果没有3090,也可以组两块3060 12G的并行,显存虽然可以达标,但算力和综合性能抵不过3090。

确保本地具备足以支撑贾维斯(Jarvis)的硬件环境之后,老规矩,克隆项目:

  1. git clone https://github.com/microsoft/JARVIS.git

随后进入项目目录:

  1. cd JARVIS

修改项目的配置文件 server/config.yaml:

  1. openai:
  2. key: your_personal_key # gradio, your_personal_key
  3. huggingface:
  4. cookie: # required for huggingface inference
  5. local: # ignore: just for development
  6. endpoint: http://localhost:8003
  7. dev: false
  8. debug: false
  9. log_file: logs/debug.log
  10. model: text-davinci-003 # text-davinci-003
  11. use_completion: true
  12. inference_mode: hybrid # local, huggingface or hybrid
  13. local_deployment: minimal # no, minimal, standard or full
  14. num_candidate_models: 5
  15. max_description_length: 100
  16. proxy:
  17. httpserver:
  18. host: localhost
  19. port: 8004
  20. modelserver:
  21. host: localhost
  22. port: 8005
  23. logit_bias:
  24. parse_task: 0.1
  25. choose_model: 5

这里主要修改三个配置即可,分别是openaikey,huggingface官网的cookie令牌,以及OpenAI的model,默认使用的模型是text-davinci-003。

修改完成后,官方推荐使用虚拟环境conda,Python版本3.8,私以为这里完全没有任何必要使用虚拟环境,直接上Python3.10即可,接着安装依赖:

  1. pip3 install -r requirements.txt

项目依赖库如下:

  1. git+https://github.com/huggingface/diffusers.git@8c530fc2f6a76a2aefb6b285dce6df1675092ac6#egg=diffusers
  2. git+https://github.com/huggingface/transformers@c612628045822f909020f7eb6784c79700813eda#egg=transformers
  3. git+https://github.com/patrickvonplaten/controlnet_aux@78efc716868a7f5669c288233d65b471f542ce40#egg=controlnet_aux
  4. tiktoken==0.3.3
  5. pydub==0.25.1
  6. espnet==202301
  7. espnet_model_zoo==0.1.7
  8. flask==2.2.3
  9. flask_cors==3.0.10
  10. waitress==2.1.2
  11. datasets==2.11.0
  12. asteroid==0.6.0
  13. speechbrain==0.5.14
  14. timm==0.6.13
  15. typeguard==2.13.3
  16. accelerate==0.18.0
  17. pytesseract==0.3.10
  18. gradio==3.24.1

这里web端接口是用Flask2.2高版本搭建的,但奇怪的是微软并未使用Flask新版本的异步特性。

安装完成之后,进入模型目录:

  1. cd models

下载模型和数据集:

  1. sh download.sh

这里一定要做好心理准备,因为模型就已经占用海量的硬盘空间了,数据集更是不必多说,所有文件均来自huggingface:

  1. models="
  2. nlpconnect/vit-gpt2-image-captioning
  3. lllyasviel/ControlNet
  4. runwayml/stable-diffusion-v1-5
  5. CompVis/stable-diffusion-v1-4
  6. stabilityai/stable-diffusion-2-1
  7. Salesforce/blip-image-captioning-large
  8. damo-vilab/text-to-video-ms-1.7b
  9. microsoft/speecht5_asr
  10. facebook/maskformer-swin-large-ade
  11. microsoft/biogpt
  12. facebook/esm2_t12_35M_UR50D
  13. microsoft/trocr-base-printed
  14. microsoft/trocr-base-handwritten
  15. JorisCos/DCCRNet_Libri1Mix_enhsingle_16k
  16. espnet/kan-bayashi_ljspeech_vits
  17. facebook/detr-resnet-101
  18. microsoft/speecht5_tts
  19. microsoft/speecht5_hifigan
  20. microsoft/speecht5_vc
  21. facebook/timesformer-base-finetuned-k400
  22. runwayml/stable-diffusion-v1-5
  23. superb/wav2vec2-base-superb-ks
  24. openai/whisper-base
  25. Intel/dpt-large
  26. microsoft/beit-base-patch16-224-pt22k-ft22k
  27. facebook/detr-resnet-50-panoptic
  28. facebook/detr-resnet-50
  29. openai/clip-vit-large-patch14
  30. google/owlvit-base-patch32
  31. microsoft/DialoGPT-medium
  32. bert-base-uncased
  33. Jean-Baptiste/camembert-ner
  34. deepset/roberta-base-squad2
  35. facebook/bart-large-cnn
  36. google/tapas-base-finetuned-wtq
  37. distilbert-base-uncased-finetuned-sst-2-english
  38. gpt2
  39. mrm8488/t5-base-finetuned-question-generation-ap
  40. Jean-Baptiste/camembert-ner
  41. t5-base
  42. impira/layoutlm-document-qa
  43. ydshieh/vit-gpt2-coco-en
  44. dandelin/vilt-b32-finetuned-vqa
  45. lambdalabs/sd-image-variations-diffusers
  46. facebook/timesformer-base-finetuned-k400
  47. facebook/maskformer-swin-base-coco
  48. Intel/dpt-hybrid-midas
  49. lllyasviel/sd-controlnet-canny
  50. lllyasviel/sd-controlnet-depth
  51. lllyasviel/sd-controlnet-hed
  52. lllyasviel/sd-controlnet-mlsd
  53. lllyasviel/sd-controlnet-openpose
  54. lllyasviel/sd-controlnet-scribble
  55. lllyasviel/sd-controlnet-seg
  56. "
  57. # CURRENT_DIR=$(cd `dirname $0`; pwd)
  58. CURRENT_DIR=$(pwd)
  59. for model in $models;
  60. do
  61. echo "----- Downloading from https://huggingface.co/"$model" -----"
  62. if [ -d "$model" ]; then
  63. # cd $model && git reset --hard && git pull && git lfs pull
  64. cd $model && git pull && git lfs pull
  65. cd $CURRENT_DIR
  66. else
  67. # git clone 包含了lfs
  68. git clone https://huggingface.co/$model $model
  69. fi
  70. done
  71. datasets="Matthijs/cmu-arctic-xvectors"
  72. for dataset in $datasets;
  73. do
  74. echo "----- Downloading from https://huggingface.co/datasets/"$dataset" -----"
  75. if [ -d "$dataset" ]; then
  76. cd $dataset && git pull && git lfs pull
  77. cd $CURRENT_DIR
  78. else
  79. git clone https://huggingface.co/datasets/$dataset $dataset
  80. fi
  81. done

也可以考虑拆成两个shell,开多进程下载,速度会快很多。

但事实上,真的,别下了,文件属实过于巨大,这玩意儿真的不是普通人能耍起来的,当然选择不下载本地模型和数据集也能运行,请看下文。

漫长的下载流程结束之后,贾维斯(Jarvis)就配置好了。

运行贾维斯(Jarvis)

如果您选择下载了所有的模型和数据集(佩服您是条汉子),终端内启动服务:

  1. python models_server.py --config config.yaml

随后会在系统的8004端口启动一个Flask服务进程,然后发起Http请求即可运行贾维斯(Jarvis):

  1. curl --location 'http://localhost:8004/hugginggpt' \
  2. --header 'Content-Type: application/json' \
  3. --data '{
  4. "messages": [
  5. {
  6. "role": "user",
  7. "content": "please generate a video based on \"Spiderman is surfing\""
  8. }
  9. ]
  10. }'

这个的意思是让贾维斯(Jarvis)生成一段“蜘蛛侠在冲浪”的视频。

当然了,以笔者的硬件环境,是不可能跑起来的,所以可以对加载的模型适当“阉割”,在models_server.py文件的81行左右:

  1. other_pipes = {
  2. "nlpconnect/vit-gpt2-image-captioning":{
  3. "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
  4. "feature_extractor": ViTImageProcessor.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
  5. "tokenizer": AutoTokenizer.from_pretrained(f"{local_fold}/nlpconnect/vit-gpt2-image-captioning"),
  6. "device": "cuda:0"
  7. },
  8. "Salesforce/blip-image-captioning-large": {
  9. "model": BlipForConditionalGeneration.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),
  10. "processor": BlipProcessor.from_pretrained(f"{local_fold}/Salesforce/blip-image-captioning-large"),
  11. "device": "cuda:0"
  12. },
  13. "damo-vilab/text-to-video-ms-1.7b": {
  14. "model": DiffusionPipeline.from_pretrained(f"{local_fold}/damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"),
  15. "device": "cuda:0"
  16. },
  17. "facebook/maskformer-swin-large-ade": {
  18. "model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-large-ade"),
  19. "feature_extractor" : AutoFeatureExtractor.from_pretrained("facebook/maskformer-swin-large-ade"),
  20. "device": "cuda:0"
  21. },
  22. "microsoft/trocr-base-printed": {
  23. "processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),
  24. "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-printed"),
  25. "device": "cuda:0"
  26. },
  27. "microsoft/trocr-base-handwritten": {
  28. "processor": TrOCRProcessor.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),
  29. "model": VisionEncoderDecoderModel.from_pretrained(f"{local_fold}/microsoft/trocr-base-handwritten"),
  30. "device": "cuda:0"
  31. },
  32. "JorisCos/DCCRNet_Libri1Mix_enhsingle_16k": {
  33. "model": BaseModel.from_pretrained("JorisCos/DCCRNet_Libri1Mix_enhsingle_16k"),
  34. "device": "cuda:0"
  35. },
  36. "espnet/kan-bayashi_ljspeech_vits": {
  37. "model": Text2Speech.from_pretrained(f"espnet/kan-bayashi_ljspeech_vits"),
  38. "device": "cuda:0"
  39. },
  40. "lambdalabs/sd-image-variations-diffusers": {
  41. "model": DiffusionPipeline.from_pretrained(f"{local_fold}/lambdalabs/sd-image-variations-diffusers"), #torch_dtype=torch.float16
  42. "device": "cuda:0"
  43. },
  44. "CompVis/stable-diffusion-v1-4": {
  45. "model": DiffusionPipeline.from_pretrained(f"{local_fold}/CompVis/stable-diffusion-v1-4"),
  46. "device": "cuda:0"
  47. },
  48. "stabilityai/stable-diffusion-2-1": {
  49. "model": DiffusionPipeline.from_pretrained(f"{local_fold}/stabilityai/stable-diffusion-2-1"),
  50. "device": "cuda:0"
  51. },
  52. "runwayml/stable-diffusion-v1-5": {
  53. "model": DiffusionPipeline.from_pretrained(f"{local_fold}/runwayml/stable-diffusion-v1-5"),
  54. "device": "cuda:0"
  55. },
  56. "microsoft/speecht5_tts":{
  57. "processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),
  58. "model": SpeechT5ForTextToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_tts"),
  59. "vocoder": SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),
  60. "embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),
  61. "device": "cuda:0"
  62. },
  63. "speechbrain/mtl-mimic-voicebank": {
  64. "model": WaveformEnhancement.from_hparams(source="speechbrain/mtl-mimic-voicebank", savedir="models/mtl-mimic-voicebank"),
  65. "device": "cuda:0"
  66. },
  67. "microsoft/speecht5_vc":{
  68. "processor": SpeechT5Processor.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),
  69. "model": SpeechT5ForSpeechToSpeech.from_pretrained(f"{local_fold}/microsoft/speecht5_vc"),
  70. "vocoder": SpeechT5HifiGan.from_pretrained(f"{local_fold}/microsoft/speecht5_hifigan"),
  71. "embeddings_dataset": load_dataset(f"{local_fold}/Matthijs/cmu-arctic-xvectors", split="validation"),
  72. "device": "cuda:0"
  73. },
  74. "julien-c/wine-quality": {
  75. "model": joblib.load(cached_download(hf_hub_url("julien-c/wine-quality", "sklearn_model.joblib")))
  76. },
  77. "facebook/timesformer-base-finetuned-k400": {
  78. "processor": AutoImageProcessor.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),
  79. "model": TimesformerForVideoClassification.from_pretrained(f"{local_fold}/facebook/timesformer-base-finetuned-k400"),
  80. "device": "cuda:0"
  81. },
  82. "facebook/maskformer-swin-base-coco": {
  83. "feature_extractor": MaskFormerFeatureExtractor.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),
  84. "model": MaskFormerForInstanceSegmentation.from_pretrained(f"{local_fold}/facebook/maskformer-swin-base-coco"),
  85. "device": "cuda:0"
  86. },
  87. "Intel/dpt-hybrid-midas": {
  88. "model": DPTForDepthEstimation.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas", low_cpu_mem_usage=True),
  89. "feature_extractor": DPTFeatureExtractor.from_pretrained(f"{local_fold}/Intel/dpt-hybrid-midas"),
  90. "device": "cuda:0"
  91. }
  92. }

把不需要的模型屏蔽即可。

与此同时,还可以不通过本地服务接口,修改config.yaml的inference_mode: huggingface

随后直接在终端运行ChatGPT版本的贾维斯:

  1. cd server
  2. python3 awesome_chat.py --config config.yaml --mode cli

运行效果:

  1. Welcome to Jarvis! A collaborative system that consists of an LLM as the controller and numerous expert models as collaborative executors. Jarvis can plan tasks, schedule Hugging Face models, generate friendly responses based on your requests, and help you with many things. Please enter your request (`exit` to exit).
  2. [ User ]: Given a collection of image A: /examples/cat.jpg, B: /examples/z3.jpg, C: /examples/z1.jpg, please tell me how many zebras in these picture?
  3. [ Jarvis ]: Based on your request, I have used several models to process the images you provided. For the image A: /examples/cat.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a cat sitting on a window sill looking out". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/a3dc.jpg', which contains two objects with labels 'potted plant' and 'cat'. For the image B: /examples/z3.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "two zebras standing in a field of dry grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/f5df.jpg', which contains three objects with labels 'zebra'. For the image C: /examples/z1.jpg, I used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text "a zebra standing in a field of tall grass". Then I used the object-detection model facebook/detr-resnet-50 to generate the image with predicted box '/images/69c7.jpg', which contains one object with label 'zebra'. Finally, I used the canny-control model canny-control to generate the image '/images/7e18.png' based on the image /examples/savanna.jpg. Therefore, there are four zebras in these pictures. Is there anything else I can help you with?
  4. [ User ]: Please answer all the named entities in the sentence: Iron Man is a superhero appearing in American comic books published by Marvel Comics. The character was co-created by writer and editor Stan Lee, developed by scripter Larry Lieber, and designed by artists Don Heck and Jack Kirby.
  5. [ Jarvis ]: Sure, I understand your request. Based on the inference results of the models, I will answer your request step by step. Firstly, I used the token-classification model dslim/bert-base-NER to recognize the named entities in the sentence. The inference result is that the named entities are Iron Man, Stan Lee, Larry Lieber, Don Heck and Jack Kirby. Then, I used the question-answering model deepset/roberta-base-squad2 to answer your request. The inference result is that there is no output since the context cannot be empty. Therefore, I cannot make it. I hope this answer your request. Is there anything else I can help you with?
  6. [ User ]: Please generate a canny image based on /examples/savanna.jpg
  7. [ Jarvis ]: Sure. I have generated a canny image based on /examples/savanna.jpg. To do this, I first used the image-to-text model nlpconnect/vit-gpt2-image-captioning to generate the text description of the image, which is "a herd of giraffes and zebras grazing in a field". Then I used the canny-control model to generate the canny image of the original image, which is stored in the path "/images/1ea2.png". Finally, I used the canny-text-to-image model lllyasviel/sd-controlnet-canny to generate the canny image based on the text description and the canny image, which is stored in the path "/images/ba66.png". Do you need anything else?

可以理解为ChatGPT版本的文字生成图片,外面套了个贾维斯(Jarvis)的壳儿,演出效果满分。

结语

总的来说,和微软之前开源的“可视化图形ChatGPT”一样,此类项目的象征意义要远远大于现实意义。贾维斯(Jarvis)代表的是大多数技术同仁的共同愿景,对于这类人工智能技术的发展,可以肯定,但由于硬件门槛过高的原因,短期内还不能过于期待。

原文链接:https://www.cnblogs.com/v3ucn/p/17291634.html

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号