??可以使用自定义NLU组件和策略扩展Rasa,本文提供了如何开发自己的自定义Graph Component指南。
??Rasa提供各种开箱即用的NLU组件和策略。可以使用自定义Graph Component对其进行自定义或从头开始创建自己的组件。
??要在Rasa中使用自定义Graph Component,它必须满足以下要求:
- 它必须使用类型注释。Rasa利用类型注释来验证模型配置。不允许前向引用。如果使用Python 3.7,可以使用from __future__ import annotations来摆脱前向引用。
一.Graph Components
??Rasa使用传入的模型配置(config.yml)来构建DAG,描述了config.yml中Component间的依赖关系以及数据如何在它们之间流动。这有两个主要好处:
- Rasa可以使用计算图来优化模型的执行。这方面的例子包括训练步骤的高效缓存或并行执行独立的步骤。
- Rasa可以灵活地表示不同的模型架构。只要图保持非循环,Rasa理论上可以根据模型配置将任何数据传递给任何图组件,而无需将底层软件架构与使用的模型架构绑定。
??将config.yml转换为计算图时,Policy和NLU组件成为该图中的节点。虽然模型配置中的Policy和NLU组件之间存在区别,但当它们被放置在图中时,这种区别就被抽象出来了。此时,Policy和NLU组件成为抽象图组件。在实践中,这由GraphComponent接口表示:Policy和NLU组件都必须继承此接口,才能与Rasa的图兼容并可执行。

二.入门指南
??在开始之前,必须决定是实现自定义NLU组件还是Policy。如果正在实现自定义策略,那么建议扩展现有的rasa.core.policies.policy.Policy类,该类已经实现了GraphComponent接口。如下所示:
- from rasa.core.policies.policy import Policy
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
-
- # TODO: Correctly register your graph component
- @DefaultV1Recipe.register(
- [DefaultV1Recipe.ComponentType.POLICY_WITHOUT_END_TO_END_SUPPORT], is_trainable=True
- )
- class MyPolicy(Policy):
- ...
??如果要实现自定义NLU组件,要从以下框架开始:
- from typing import Dict, Text, Any, List
-
- from rasa.engine.graph import GraphComponent, ExecutionContext
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
- from rasa.shared.nlu.training_data.message import Message
- from rasa.shared.nlu.training_data.training_data import TrainingData
-
- # TODO: Correctly register your component with its type
- @DefaultV1Recipe.register(
- [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
- )
- class CustomNLUComponent(GraphComponent):
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> GraphComponent:
- # TODO: Implement this
- ...
-
- def train(self, training_data: TrainingData) -> Resource:
- # TODO: Implement this if your component requires training
- ...
-
- def process_training_data(self, training_data: TrainingData) -> TrainingData:
- # TODO: Implement this if your component augments the training data with
- # tokens or message features which are used by other components
- # during training.
- ...
-
- return training_data
-
- def process(self, messages: List[Message]) -> List[Message]:
- # TODO: This is the method which Rasa Open Source will call during inference.
- ...
- return messages
??下面会介绍如何解决上述示例中的TODO,以及需要在自定义组件中实现的其它方法。
自定义词法分析器:如果创建了一个自定义的tokenizer,应该扩展rasa.nlu.tokenizers.tokenizer. Tokenizer类。train和process方法已经实现,所以只需要覆盖tokenize方法。
三.GraphComponent接口
??要使用Rasa运行自定义NLU组件或Policy,必须实现GraphComponent接口。如下所示:
- from __future__ import annotations
- from abc import ABC, abstractmethod
- from typing import List, Type, Dict, Text, Any, Optional
-
- from rasa.engine.graph import ExecutionContext
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
-
-
- class GraphComponent(ABC):
- """Interface for any component which will run in a graph."""
-
- @classmethod
- def required_components(cls) -> List[Type]:
- """Components that should be included in the pipeline before this component."""
- return []
-
- @classmethod
- @abstractmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> GraphComponent:
- """Creates a new `GraphComponent`.
-
- Args:
- config: This config overrides the `default_config`.
- model_storage: Storage which graph components can use to persist and load
- themselves.
- resource: Resource locator for this component which can be used to persist
- and load itself from the `model_storage`.
- execution_context: Information about the current graph run.
-
- Returns: An instantiated `GraphComponent`.
- """
- ...
-
- @classmethod
- def load(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- **kwargs: Any,
- ) -> GraphComponent:
- """Creates a component using a persisted version of itself.
-
- If not overridden this method merely calls `create`.
-
- Args:
- config: The config for this graph component. This is the default config of
- the component merged with config specified by the user.
- model_storage: Storage which graph components can use to persist and load
- themselves.
- resource: Resource locator for this component which can be used to persist
- and load itself from the `model_storage`.
- execution_context: Information about the current graph run.
- kwargs: Output values from previous nodes might be passed in as `kwargs`.
-
- Returns:
- An instantiated, loaded `GraphComponent`.
- """
- return cls.create(config, model_storage, resource, execution_context)
-
- @staticmethod
- def get_default_config() -> Dict[Text, Any]:
- """Returns the component's default config.
-
- Default config and user config are merged by the `GraphNode` before the
- config is passed to the `create` and `load` method of the component.
-
- Returns:
- The default config of the component.
- """
- return {}
-
- @staticmethod
- def supported_languages() -> Optional[List[Text]]:
- """Determines which languages this component can work with.
-
- Returns: A list of supported languages, or `None` to signify all are supported.
- """
- return None
-
- @staticmethod
- def not_supported_languages() -> Optional[List[Text]]:
- """Determines which languages this component cannot work with.
-
- Returns: A list of not supported languages, or
- `None` to signify all are supported.
- """
- return None
-
- @staticmethod
- def required_packages() -> List[Text]:
- """Any extra python dependencies required for this component to run."""
- return []
-
- @classmethod
- def fingerprint_addon(cls, config: Dict[str, Any]) -> Optional[str]:
- """Adds additional data to the fingerprint calculation.
-
- This is useful if a component uses external data that is not provided
- by the graph.
- """
- return None
1.create方法
??create方法用于在训练期间实例化图组件,并且必须被覆盖。Rasa在调用该方法时传递以下参数:
(1)config:这是组件的默认配置,与模型配置文件中提供给图组件的配置合并。
(2)model_storage:可以使用此功能来持久化和加载图组件。有关其用法的更多详细信息,请参阅模型持久化部分。
(3)resource:模型存储中组件的唯一标识符。有关其用法的更多详细信息,请参阅模型持久性部分。
(4)execution_context:提供有关当前执行模式的额外信息:
- model_id: 推理过程中使用的模型的唯一标识符。在训练过程中,此参数为None。
- should_add_diagnostic_data:如果为True,则应在实际预测的基础上向图组件的预测中添加额外的诊断元数据。
- is_finetuning:如果为True,则可以使用微调来训练图组件。
- graph_schema:graph_schema描述用于训练助手或用它进行预测的计算图。
- node_name:node_name是图模式中步骤的唯一标识符,由所调用的图组件完成。
2.load方法
??在推理过程中,使用load方法来实例化图组件。此方法的默认实现会调用create方法。如果图组件将数据作为训练的一部分,建议覆盖此方法。有关各个参数的描述,参阅create方法。
3.get_default_config方法
??get_default_config方法返回图组件的默认配置。它的默认实现返回一个空字典,这意味着图组件没有任何配置。Rasa将在运行时使用配置文件(config.yml)中的给定值更新默认配置。
4.supported_languages方法
??supported_languages方法指定了图组件支持的语言。Rasa将使用模型配置文件中的语言键来验证图组件是否可用于指定的语言。如果图组件返回None(这是默认实现),则表示图组件支持not_supported_languages中未包含的所有语言。示例如下所示:
- None:支持所有语言,但不支持not_supported_languages中定义的语言
5.not_supported_languages方法
??not_supported_languages方法指定图组件不支持哪些语言。Rasa将使用模型配置文件中的语言键来验证图组件是否可用于指定的语言。如果图组件返回None(这是默认实现),则表示它支持supported_languages中指定的所有语言。示例如下所示:
- 无或[]:支持supported_languages中指定的所有语言。
- ["en"]:该图形组件可用于除英语以外的任何语言。
6.required_packages方法
??required_packages方法表明需要安装哪些额外的Python包才能使用此图组件。如果在运行时找不到所需的库,Rasa将在执行过程中抛出错误。默认情况下,此方法返回一个空列表,这意味着图组件没有任何额外的依赖关系。示例如下所示:
- ["spacy"]:需要安装Python包spacy才能使用此图组件。
四.模型持久化
??一些图组件需要在训练期间持久化数据,这些数据在推理时应该对图组件可用。一个典型的用例是存储模型权重。为此,Rasa为图组件的create和load方法提供了model_storage和resource参数,如下面的代码片段所示。model_storage提供对所有图组件数据的访问。resource允许唯一标识图组件在模型存储中的位置。
- from __future__ import annotations
-
- from typing import Any, Dict, Text
-
- from rasa.engine.graph import GraphComponent, ExecutionContext
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
-
- class MyComponent(GraphComponent):
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> MyComponent:
- ...
-
- @classmethod
- def load(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- **kwargs: Any
- ) -> MyComponent:
- ...
1.写模型存储
??下面的代码片段演示了如何将图组件的数据写入模型存储。要在训练后持久化图组件,train方法需要访问model_storage和resource的值。因此,应该在初始化时存储model_storage和resource的值。
图组件的train方法必须返回resource的值,以便Rasa可以在训练之间缓存训练结果。self._model_storage.write_to(self._resource)上下文管理器提供了一个目录路径,可以在其中持久化图组件所需的任何数据。
- from __future__ import annotations
- import json
- from typing import Optional, Dict, Any, Text
-
- from rasa.engine.graph import GraphComponent, ExecutionContext
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
- from rasa.shared.nlu.training_data.training_data import TrainingData
-
- class MyComponent(GraphComponent):
-
- def __init__(
- self,
- model_storage: ModelStorage,
- resource: Resource,
- training_artifact: Optional[Dict],
- ) -> None:
- # Store both `model_storage` and `resource` as object attributes to be able
- # to utilize them at the end of the training
- self._model_storage = model_storage
- self._resource = resource
-
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> MyComponent:
- return cls(model_storage, resource, training_artifact=None)
-
- def train(self, training_data: TrainingData) -> Resource:
- # Train your graph component
- ...
-
- # Persist your graph component
- with self._model_storage.write_to(self._resource) as directory_path:
- with open(directory_path / "artifact.json", "w") as file:
- json.dump({"my": "training artifact"}, file)
-
- # Return resource to make sure the training artifacts
- # can be cached.
- return self._resource
2.读模型存储
??Rasa将调用图组件的load方法来实例化它以进行推理。可以使用上下文管理器self._model_storage.read_from(resource)来获取图组件数据所保存的目录的路径。使用提供的路径,可以加载保存的数据并用它初始化图组件。请注意,如果给定的资源没有找到保存的数据,model_storage将抛出ValueError。
- from __future__ import annotations
- import json
- from typing import Optional, Dict, Any, Text
-
- from rasa.engine.graph import GraphComponent, ExecutionContext
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
-
- class MyComponent(GraphComponent):
-
- def __init__(
- self,
- model_storage: ModelStorage,
- resource: Resource,
- training_artifact: Optional[Dict],
- ) -> None:
- self._model_storage = model_storage
- self._resource = resource
-
- @classmethod
- def load(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- **kwargs: Any,
- ) -> MyComponent:
- try:
- with model_storage.read_from(resource) as directory_path:
- with open(directory_path / "artifact.json", "r") as file:
- training_artifact = json.load(file)
- return cls(
- model_storage, resource, training_artifact=training_artifact
- )
- except ValueError:
- # This allows you to handle the case if there was no
- # persisted data for your component
- ...
五.用模型配置注册Graph Components
??为了让图组件可用于Rasa,可能需要使用recipe注册图组件。Rasa使用recipe将模型配置的内容转换为可执行的graph。目前,Rasa支持default.v1和实验性graph.v1 recipe。对于default.v1 recipe,需要使用DefaultV1Recipe.register装饰器注册图组件:
- from rasa.engine.graph import GraphComponent
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
-
- @DefaultV1Recipe.register(
- component_types=[DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER],
- is_trainable=True,
- model_from="SpacyNLP",
- )
- class MyComponent(GraphComponent):
- ...
??Rasa使用register装饰器中提供的信息以及图组件在配置文件中的位置来调度图组件及其所需数据的执行。DefaultV1Recipe.register装饰器允许指定以下详细信息:
1.component_types
??指定了图组件在助手内实现的目的。可以指定多种类型(例如,如果图组件既是意图分类器又是实体提取器)。
(1)ComponentType.MODEL_LOADER
??语言模型的组件类型。如果指定了model_from=
六.在模型配置中使用自定义组件
??可以在模型配置中使用自定义图组件,就像其它NLU组件或策略一样。唯一的变化是,必须指定完整的模块名称,而不是仅指定类名。完整的模块名称取决于模块相对于指定的PYTHONPATH的位置。默认情况下,Rasa会将运行CLI的目录添加到PYTHONPATH。例如,如果从/Users/<user>/my-rasa-project运行CLI,并且模块MyComponent在/Users/<user>/my-rasa-project/custom_components/my_component.py 中,则模块路径为custom_components.my_component.MyComponent。除了name条目之外,所有内容都将作为config传递给组件。config.yml文件如下所示:
- recipe: default.v1
- language: en
- pipeline:
- # other NLU components
- - name: your.custom.NLUComponent
- setting_a: 0.01
- setting_b: string_value
-
- policies:
- # other dialogue policies
- - name: your.custom.Policy
七.实现提示
1.消息元数据
??当在训练数据中为意图示例定义元数据时,NLU组件可以在处理过程中访问意图元数据和意图示例元数据,如下所示:
- # in your component class
- def process(self, message: Message, **kwargs: Any) -> None:
- metadata = message.get("metadata")
- print(metadata.get("intent"))
- print(metadata.get("example"))
2.稀疏和稠密消息特征
??如果创建了一个自定义的消息特征器,可以返回两种不同的特征:序列特征和句子特征。序列特征是一个大小为(number-of-tokens x feature-dimension)的矩阵,即该矩阵包含序列中每个token的特征向量。句子特征由大小为(1 x feature-dimension)的矩阵表示。
八.自定义组件的例子
1.稠密消息特征器
??使用预训练模型的一个dense message featurizer的例子,如下所示:
- import numpy as np
- import logging
- from bpemb import BPEmb
- from typing import Any, Text, Dict, List, Type
-
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
- from rasa.engine.graph import ExecutionContext, GraphComponent
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
- from rasa.nlu.featurizers.dense_featurizer.dense_featurizer import DenseFeaturizer
- from rasa.nlu.tokenizers.tokenizer import Tokenizer
- from rasa.shared.nlu.training_data.training_data import TrainingData
- from rasa.shared.nlu.training_data.features import Features
- from rasa.shared.nlu.training_data.message import Message
- from rasa.nlu.constants import (
- DENSE_FEATURIZABLE_ATTRIBUTES,
- FEATURIZER_CLASS_ALIAS,
- )
- from rasa.shared.nlu.constants import (
- TEXT,
- TEXT_TOKENS,
- FEATURE_TYPE_SENTENCE,
- FEATURE_TYPE_SEQUENCE,
- )
-
-
- logger = logging.getLogger(__name__)
-
-
- @DefaultV1Recipe.register(
- DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=False
- )
- class BytePairFeaturizer(DenseFeaturizer, GraphComponent):
- @classmethod
- def required_components(cls) -> List[Type]:
- """Components that should be included in the pipeline before this component."""
- return [Tokenizer]
-
- @staticmethod
- def required_packages() -> List[Text]:
- """Any extra python dependencies required for this component to run."""
- return ["bpemb"]
-
- @staticmethod
- def get_default_config() -> Dict[Text, Any]:
- """Returns the component's default config."""
- return {
- **DenseFeaturizer.get_default_config(),
- # specifies the language of the subword segmentation model
- "lang": None,
- # specifies the dimension of the subword embeddings
- "dim": None,
- # specifies the vocabulary size of the segmentation model
- "vs": None,
- # if set to True and the given vocabulary size can't be loaded for the given
- # model, the closest size is chosen
- "vs_fallback": True,
- }
-
- def __init__(
- self,
- config: Dict[Text, Any],
- name: Text,
- ) -> None:
- """Constructs a new byte pair vectorizer."""
- super().__init__(name, config)
- # The configuration dictionary is saved in `self._config` for reference.
- self.model = BPEmb(
- lang=self._config["lang"],
- dim=self._config["dim"],
- vs=self._config["vs"],
- vs_fallback=self._config["vs_fallback"],
- )
-
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> GraphComponent:
- """Creates a new component (see parent class for full docstring)."""
- return cls(config, execution_context.node_name)
-
- def process(self, messages: List[Message]) -> List[Message]:
- """Processes incoming messages and computes and sets features."""
- for message in messages:
- for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
- self._set_features(message, attribute)
- return messages
-
- def process_training_data(self, training_data: TrainingData) -> TrainingData:
- """Processes the training examples in the given training data in-place."""
- self.process(training_data.training_examples)
- return training_data
-
- def _create_word_vector(self, document: Text) -> np.ndarray:
- """Creates a word vector from a text. Utility method."""
- encoded_ids = self.model.encode_ids(document)
- if encoded_ids:
- return self.model.vectors[encoded_ids[0]]
-
- return np.zeros((self.component_config["dim"],), dtype=np.float32)
-
- def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
- """Sets the features on a single message. Utility method."""
- tokens = message.get(TEXT_TOKENS)
-
- # If the message doesn't have tokens, we can't create features.
- if not tokens:
- return None
-
- # We need to reshape here such that the shape is equivalent to that of sparsely
- # generated features. Without it, it'd be a 1D tensor. We need 2D (n_utterance, n_dim).
- text_vector = self._create_word_vector(document=message.get(TEXT)).reshape(
- 1, -1
- )
- word_vectors = np.array(
- [self._create_word_vector(document=t.text) for t in tokens]
- )
-
- final_sequence_features = Features(
- word_vectors,
- FEATURE_TYPE_SEQUENCE,
- attribute,
- self._config[FEATURIZER_CLASS_ALIAS],
- )
- message.add_features(final_sequence_features)
- final_sentence_features = Features(
- text_vector,
- FEATURE_TYPE_SENTENCE,
- attribute,
- self._config[FEATURIZER_CLASS_ALIAS],
- )
- message.add_features(final_sentence_features)
-
- @classmethod
- def validate_config(cls, config: Dict[Text, Any]) -> None:
- """Validates that the component is configured properly."""
- if not config["lang"]:
- raise ValueError("BytePairFeaturizer needs language setting via `lang`.")
- if not config["dim"]:
- raise ValueError(
- "BytePairFeaturizer needs dimensionality setting via `dim`."
- )
- if not config["vs"]:
- raise ValueError("BytePairFeaturizer needs a vector size setting via `vs`.")
2.稀疏消息特征器
??以下是稀疏消息特征器的示例,它训练了一个新模型:
- import logging
- from typing import Any, Text, Dict, List, Type
-
- from sklearn.feature_extraction.text import TfidfVectorizer
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
- from rasa.engine.graph import ExecutionContext, GraphComponent
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
- from rasa.nlu.featurizers.sparse_featurizer.sparse_featurizer import SparseFeaturizer
- from rasa.nlu.tokenizers.tokenizer import Tokenizer
- from rasa.shared.nlu.training_data.training_data import TrainingData
- from rasa.shared.nlu.training_data.features import Features
- from rasa.shared.nlu.training_data.message import Message
- from rasa.nlu.constants import (
- DENSE_FEATURIZABLE_ATTRIBUTES,
- FEATURIZER_CLASS_ALIAS,
- )
- from joblib import dump, load
- from rasa.shared.nlu.constants import (
- TEXT,
- TEXT_TOKENS,
- FEATURE_TYPE_SENTENCE,
- FEATURE_TYPE_SEQUENCE,
- )
-
- logger = logging.getLogger(__name__)
-
-
- @DefaultV1Recipe.register(
- DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=True
- )
- class TfIdfFeaturizer(SparseFeaturizer, GraphComponent):
- @classmethod
- def required_components(cls) -> List[Type]:
- """Components that should be included in the pipeline before this component."""
- return [Tokenizer]
-
- @staticmethod
- def required_packages() -> List[Text]:
- """Any extra python dependencies required for this component to run."""
- return ["sklearn"]
-
- @staticmethod
- def get_default_config() -> Dict[Text, Any]:
- """Returns the component's default config."""
- return {
- **SparseFeaturizer.get_default_config(),
- "analyzer": "word",
- "min_ngram": 1,
- "max_ngram": 1,
- }
-
- def __init__(
- self,
- config: Dict[Text, Any],
- name: Text,
- model_storage: ModelStorage,
- resource: Resource,
- ) -> None:
- """Constructs a new tf/idf vectorizer using the sklearn framework."""
- super().__init__(name, config)
- # Initialize the tfidf sklearn component
- self.tfm = TfidfVectorizer(
- analyzer=config["analyzer"],
- ngram_range=(config["min_ngram"], config["max_ngram"]),
- )
-
- # We need to use these later when saving the trained component.
- self._model_storage = model_storage
- self._resource = resource
-
- def train(self, training_data: TrainingData) -> Resource:
- """Trains the component from training data."""
- texts = [e.get(TEXT) for e in training_data.training_examples if e.get(TEXT)]
- self.tfm.fit(texts)
- self.persist()
- return self._resource
-
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> GraphComponent:
- """Creates a new untrained component (see parent class for full docstring)."""
- return cls(config, execution_context.node_name, model_storage, resource)
-
- def _set_features(self, message: Message, attribute: Text = TEXT) -> None:
- """Sets the features on a single message. Utility method."""
- tokens = message.get(TEXT_TOKENS)
-
- # If the message doesn't have tokens, we can't create features.
- if not tokens:
- return None
-
- # Make distinction between sentence and sequence features
- text_vector = self.tfm.transform([message.get(TEXT)])
- word_vectors = self.tfm.transform([t.text for t in tokens])
-
- final_sequence_features = Features(
- word_vectors,
- FEATURE_TYPE_SEQUENCE,
- attribute,
- self._config[FEATURIZER_CLASS_ALIAS],
- )
- message.add_features(final_sequence_features)
- final_sentence_features = Features(
- text_vector,
- FEATURE_TYPE_SENTENCE,
- attribute,
- self._config[FEATURIZER_CLASS_ALIAS],
- )
- message.add_features(final_sentence_features)
-
- def process(self, messages: List[Message]) -> List[Message]:
- """Processes incoming message and compute and set features."""
- for message in messages:
- for attribute in DENSE_FEATURIZABLE_ATTRIBUTES:
- self._set_features(message, attribute)
- return messages
-
- def process_training_data(self, training_data: TrainingData) -> TrainingData:
- """Processes the training examples in the given training data in-place."""
- self.process(training_data.training_examples)
- return training_data
-
- def persist(self) -> None:
- """
- Persist this model into the passed directory.
-
- Returns the metadata necessary to load the model again. In this case; `None`.
- """
- with self._model_storage.write_to(self._resource) as model_dir:
- dump(self.tfm, model_dir / "tfidfvectorizer.joblib")
-
- @classmethod
- def load(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> GraphComponent:
- """Loads trained component from disk."""
- try:
- with model_storage.read_from(resource) as model_dir:
- tfidfvectorizer = load(model_dir / "tfidfvectorizer.joblib")
- component = cls(
- config, execution_context.node_name, model_storage, resource
- )
- component.tfm = tfidfvectorizer
- except (ValueError, FileNotFoundError):
- logger.debug(
- f"Couldn't load metadata for component '{cls.__name__}' as the persisted "
- f"model data couldn't be loaded."
- )
- return component
-
- @classmethod
- def validate_config(cls, config: Dict[Text, Any]) -> None:
- """Validates that the component is configured properly."""
- pass
九.NLP元学习器
??NLU Meta学习器是一个高级用例。以下部分仅适用于拥有一个基于先前分类器输出学习参数的组件的情况。对于具有手动设置参数或逻辑的组件,可以创建一个is_trainable=False的组件,而不用担心前面的分类器。
??NLU Meta学习器是意图分类器或实体提取器,它们使用其它经过训练的意图分类器或实体提取器的预测,并尝试改进其结果。Meta学习器的一个例子是平均两个先前意图分类器输出的组件,或者是一个fallback分类器,它根据意图分类器对训练示例的置信度设置阈值。
??从概念上讲,要构建可训练的fallback分类器,首先需要将该fallback分类器创建为自定义组件:
- from typing import Dict, Text, Any, List
-
- from rasa.engine.graph import GraphComponent, ExecutionContext
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
- from rasa.engine.storage.resource import Resource
- from rasa.engine.storage.storage import ModelStorage
- from rasa.shared.nlu.training_data.message import Message
- from rasa.shared.nlu.training_data.training_data import TrainingData
- from rasa.nlu.classifiers.fallback_classifier import FallbackClassifier
-
- @DefaultV1Recipe.register(
- [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER], is_trainable=True
- )
- class MetaFallback(FallbackClassifier):
-
- def __init__(
- self,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> None:
- super().__init__(config)
-
- self._model_storage = model_storage
- self._resource = resource
-
- @classmethod
- def create(
- cls,
- config: Dict[Text, Any],
- model_storage: ModelStorage,
- resource: Resource,
- execution_context: ExecutionContext,
- ) -> FallbackClassifier:
- """Creates a new untrained component (see parent class for full docstring)."""
- return cls(config, model_storage, resource, execution_context)
-
- def train(self, training_data: TrainingData) -> Resource:
- # Do something here with the messages
- return self._resource
??接下来,需要创建一个定制的意图分类器,它也是一个特征器,因为分类器的输出需要被下游的另一个组件使用。对于定制的意图分类器组件,还需要定义如何将其预测添加到指定process_training_data方法的消息数据中。确保不要覆盖意图的真实标签。这里有一个模板,显示了如何为此目的对DIET进行子类化:
- from rasa.engine.recipes.default_recipe import DefaultV1Recipe
- from rasa.shared.nlu.training_data.training_data import TrainingData
- from rasa.nlu.classifiers.diet_classifier import DIETClassifier
-
- @DefaultV1Recipe.register(
- [DefaultV1Recipe.ComponentType.INTENT_CLASSIFIER,
- DefaultV1Recipe.ComponentType.ENTITY_EXTRACTOR,
- DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER], is_trainable=True
- )
- class DIETFeaturizer(DIETClassifier):
-
- def process_training_data(self, training_data: TrainingData) -> TrainingData:
- # classify and add the attributes to the messages on the training data
- return training_data
参考文献:
[1]Custom Graph Components:https://rasa.com/docs/rasa/custom-graph-components