ActiveFence 集成#

NeMo Guardrails 支持开箱即用地使用 ActiveFence ActiveScore API 作为输入和输出 rail(您需要设置 ACTIVEFENCE_API_KEY 环境变量)。

rails:
  input:
    flows:
      # The simplified version
      - activefence moderation on input

      # The detailed version with individual risk scores
      # - activefence moderation on input detailed

activefence moderation on input 流程使用最大风险评分以及 0.85 的阈值来决定是否允许文本(即,如果风险评分高于阈值,则认为其违反)。activefence moderation on input detailed 具有每个违规类别的单独评分。

要自定义分数,您必须覆盖配置中的 默认流程。例如,要更改 activefence moderation on input 的阈值,您可以将以下流程添加到您的配置中

define subflow activefence moderation on input
  """Guardrail based on the maximum risk score."""
  $result = execute call activefence api

  if $result.max_risk_score > 0.85
    bot inform cannot answer
    stop

ActiveFence 的 ActiveScore API 可以灵活地单独控制各种受支持违规行为。为了利用这一点,您可以使用违规字典 (violations_dict),这是 API 的输出之一,为不同的违规行为设置不同的阈值。下面是一个这样的输入审核流程的示例

define flow activefence input moderation detailed
  $result = execute call activefence api

  if $result.violations.get("abusive_or_harmful.hate_speech", 0) > 0.8
    bot inform cannot engage in abusive or harmful behavior
    stop

define bot inform cannot engage in abusive or harmful behavior
  "I will not engage in any abusive or harmful behavior."