Sitemap
Schema-driven source documentation.
SITEMAP31 fields3 examples
Commonly Asked Questions
Assistant knowledge mapped to this source type from
assistant_knowledge.json.Required
Fields required for a valid configuration payload under `config.required`.
| Path | Type | Required | Description | Default | Constraints |
|---|---|---|---|---|---|
| required | object | Yes | — | — | no extra properties |
| required.sitemap_url | string | Yes | Website sitemap URL (for example, https://example.com/sitemap.xml) | — | format uri |
Masked
Sensitive fields under `config.masked` (secrets/credentials).
No fields in this section.
Optional
Optional configuration fields under `config.optional`.
| Path | Type | Required | Description | Default | Constraints |
|---|---|---|---|---|---|
| optional | object | No | — | — | no extra properties |
| optional.assets | object | No | Related asset extraction and lineage controls. | — | no extra properties |
| optional.assets.fetch_related_assets | boolean | No | Fetch linked media/doc assets and persist them as lineage-linked assets | true | — |
| optional.assets.include_external_links | boolean | No | Include external links in page lineage/hashes | false | — |
| optional.assets.max_asset_bytes | integer | No | Maximum number of bytes to download per linked asset when inferring type/text | 5242880 | min 1024 |
| optional.assets.max_related_assets_per_page | integer | No | Maximum number of non-HTML linked assets to materialize per crawled page | 40 | min 0 |
| optional.crawl | object | No | Crawl behaviour and request-level limits. | — | no extra properties |
| optional.crawl.crawl_page_timeout_ms | integer | No | Per-page browser timeout in milliseconds | 120000 | min 1000 |
| optional.crawl.max_nested_sitemaps | integer | No | Maximum number of nested sitemap documents to traverse | 100 | min 1 |
| optional.crawl.request_timeout_seconds | number | No | HTTP timeout for sitemap and linked asset requests | 30 | min 1 |
| optional.crawl.user_agent | string | No | Optional user agent for sitemap and asset HTTP requests | — | — |
Examples
Reference payloads generated from shared source examples JSON.
Website content index scan
Ingest website pages from a sitemap with related media lineage
Schedule
{
"enabled": true,
"preset": "nightly",
"cron": "28 0 * * *",
"timezone": "UTC"
}Config Payload
{
"type": "SITEMAP",
"required": {
"sitemap_url": "https://example.com/sitemap.xml"
},
"optional": {
"crawl": {
"max_nested_sitemaps": 50
},
"assets": {
"fetch_related_assets": true,
"include_external_links": false
}
},
"sampling": {
"strategy": "LATEST",
"limit": 25
}
}Website index scan (validation run)
Start with a 10-page cap to validate parsing and extraction quality
Schedule
{
"enabled": true,
"preset": "weekly",
"cron": "41 1 * * 0",
"timezone": "UTC"
}Config Payload
{
"type": "SITEMAP",
"required": {
"sitemap_url": "https://www.example.com/sitemap.xml"
},
"optional": {
"crawl": {
"crawl_page_timeout_ms": 120000,
"request_timeout_seconds": 30
},
"assets": {
"fetch_related_assets": true
}
},
"sampling": {
"strategy": "RANDOM",
"limit": 10
}
}Full website crawl with link integrity detectors
Run broken-link checks across all sitemap pages and their extracted links
Schedule
{
"enabled": true,
"preset": "nightly",
"cron": "7 1 * * *",
"timezone": "UTC"
}Config Payload
{
"type": "SITEMAP",
"required": {
"sitemap_url": "https://example.com/sitemap.xml"
},
"sampling": {
"strategy": "ALL"
},
"detectors": [
{
"type": "BROKEN_LINKS",
"enabled": true
}
]
}