Sitemap

Schema-driven source documentation.

SITEMAP31 fields3 examples

Commonly Asked Questions

Assistant knowledge mapped to this source type from assistant_knowledge.json.

Required

Fields required for a valid configuration payload under `config.required`.

Path	Type	Required	Description	Default	Constraints
required	object	Yes	—	—	no extra properties
required.sitemap_url	string	Yes	Website sitemap URL (for example, https://example.com/sitemap.xml)	—	format uri

Masked

Sensitive fields under `config.masked` (secrets/credentials).

No fields in this section.

Optional

Optional configuration fields under `config.optional`.

Path	Type	Required	Description	Default	Constraints
optional	object	No	—	—	no extra properties
optional.assets	object	No	Related asset extraction and lineage controls.	—	no extra properties
optional.assets.fetch_related_assets	boolean	No	Fetch linked media/doc assets and persist them as lineage-linked assets	true	—
optional.assets.include_external_links	boolean	No	Include external links in page lineage/hashes	false	—
optional.assets.max_asset_bytes	integer	No	Maximum number of bytes to download per linked asset when inferring type/text	5242880	min 1024
optional.assets.max_related_assets_per_page	integer	No	Maximum number of non-HTML linked assets to materialize per crawled page	40	min 0
optional.crawl	object	No	Crawl behaviour and request-level limits.	—	no extra properties
optional.crawl.crawl_page_timeout_ms	integer	No	Per-page browser timeout in milliseconds	120000	min 1000
optional.crawl.max_nested_sitemaps	integer	No	Maximum number of nested sitemap documents to traverse	100	min 1
optional.crawl.request_timeout_seconds	number	No	HTTP timeout for sitemap and linked asset requests	30	min 1
optional.crawl.user_agent	string	No	Optional user agent for sitemap and asset HTTP requests	—	—

Examples

Reference payloads generated from shared source examples JSON.

Website content index scan

Ingest website pages from a sitemap with related media lineage

Schedule

{
  "enabled": true,
  "preset": "nightly",
  "cron": "28 0 * * *",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://example.com/sitemap.xml"
  },
  "optional": {
    "crawl": {
      "max_nested_sitemaps": 50
    },
    "assets": {
      "fetch_related_assets": true,
      "include_external_links": false
    }
  },
  "sampling": {
    "strategy": "LATEST",
    "limit": 25
  }
}

Website index scan (validation run)

Start with a 10-page cap to validate parsing and extraction quality

Schedule

{
  "enabled": true,
  "preset": "weekly",
  "cron": "41 1 * * 0",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://www.example.com/sitemap.xml"
  },
  "optional": {
    "crawl": {
      "crawl_page_timeout_ms": 120000,
      "request_timeout_seconds": 30
    },
    "assets": {
      "fetch_related_assets": true
    }
  },
  "sampling": {
    "strategy": "RANDOM",
    "limit": 10
  }
}

Full website crawl with link integrity detectors

Run broken-link checks across all sitemap pages and their extracted links

Schedule

{
  "enabled": true,
  "preset": "nightly",
  "cron": "7 1 * * *",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://example.com/sitemap.xml"
  },
  "sampling": {
    "strategy": "ALL"
  },
  "detectors": [
    {
      "type": "BROKEN_LINKS",
      "enabled": true
    }
  ]
}

Source name

Required

Optional

Sampling