Skip to Content
Unified docs shell with shared Classifyre tokens and acid-green highlight accents.
SourcesSitemap

Sitemap

Schema-driven source documentation.

SITEMAP31 fields3 examples
Commonly Asked Questions
Assistant knowledge mapped to this source type from assistant_knowledge.json.

Required
Fields required for a valid configuration payload under `config.required`.
PathTypeRequiredDescriptionDefaultConstraints
requiredobjectYesno extra properties
required.sitemap_urlstringYesWebsite sitemap URL (for example, https://example.com/sitemap.xml)format uri
Masked
Sensitive fields under `config.masked` (secrets/credentials).

No fields in this section.

Optional
Optional configuration fields under `config.optional`.
PathTypeRequiredDescriptionDefaultConstraints
optionalobjectNono extra properties
optional.assetsobjectNoRelated asset extraction and lineage controls.no extra properties
optional.assets.fetch_related_assetsbooleanNoFetch linked media/doc assets and persist them as lineage-linked assetstrue
optional.assets.include_external_linksbooleanNoInclude external links in page lineage/hashesfalse
optional.assets.max_asset_bytesintegerNoMaximum number of bytes to download per linked asset when inferring type/text5242880min 1024
optional.assets.max_related_assets_per_pageintegerNoMaximum number of non-HTML linked assets to materialize per crawled page40min 0
optional.crawlobjectNoCrawl behaviour and request-level limits.no extra properties
optional.crawl.crawl_page_timeout_msintegerNoPer-page browser timeout in milliseconds120000min 1000
optional.crawl.max_nested_sitemapsintegerNoMaximum number of nested sitemap documents to traverse100min 1
optional.crawl.request_timeout_secondsnumberNoHTTP timeout for sitemap and linked asset requests30min 1
optional.crawl.user_agentstringNoOptional user agent for sitemap and asset HTTP requests
Examples
Reference payloads generated from shared source examples JSON.
Website content index scan
Ingest website pages from a sitemap with related media lineage

Schedule

{
  "enabled": true,
  "preset": "nightly",
  "cron": "28 0 * * *",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://example.com/sitemap.xml"
  },
  "optional": {
    "crawl": {
      "max_nested_sitemaps": 50
    },
    "assets": {
      "fetch_related_assets": true,
      "include_external_links": false
    }
  },
  "sampling": {
    "strategy": "LATEST",
    "limit": 25
  }
}
Website index scan (validation run)
Start with a 10-page cap to validate parsing and extraction quality

Schedule

{
  "enabled": true,
  "preset": "weekly",
  "cron": "41 1 * * 0",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://www.example.com/sitemap.xml"
  },
  "optional": {
    "crawl": {
      "crawl_page_timeout_ms": 120000,
      "request_timeout_seconds": 30
    },
    "assets": {
      "fetch_related_assets": true
    }
  },
  "sampling": {
    "strategy": "RANDOM",
    "limit": 10
  }
}
Full website crawl with link integrity detectors
Run broken-link checks across all sitemap pages and their extracted links

Schedule

{
  "enabled": true,
  "preset": "nightly",
  "cron": "7 1 * * *",
  "timezone": "UTC"
}

Config Payload

{
  "type": "SITEMAP",
  "required": {
    "sitemap_url": "https://example.com/sitemap.xml"
  },
  "sampling": {
    "strategy": "ALL"
  },
  "detectors": [
    {
      "type": "BROKEN_LINKS",
      "enabled": true
    }
  ]
}