Scheduling Automated Website Crawls

Scheduling Automated Website Crawls

Schedules Crawling AI Agent Knowledge Base

FlowHunt’s Schedule feature allows you to automate the crawling and indexing of websites, sitemaps, domains, and YouTube channels. This ensures your AI Agent’s knowledge base stays current with fresh content without manual intervention.

How Scheduling Works

  • Automated crawling:
    Set up recurring crawls that run daily, weekly, monthly, or yearly to keep your knowledge base updated.

  • Multiple crawl types:
    Choose from Domain crawl, Sitemap crawl, URL crawl, or YouTube channel crawl based on your content source.

  • Advanced options:
    Configure browser rendering, link following, screenshots, proxy rotation, and URL filtering for optimal results.

Schedule Configuration Options

Basic Settings

Type: Choose your crawl method:

  • Domain crawl: Crawl an entire domain systematically
  • Sitemap crawl: Use the website’s sitemap.xml for efficient crawling
  • URL crawl: Target specific URLs or pages
  • YouTube channel crawl: Index video content from YouTube channels

Frequency: Set how often the crawl runs:

  • Daily, Weekly, Monthly, or Yearly

URL: Enter the target URL, domain, or YouTube channel to crawl

Advanced Crawling Options

With Browser (extra credits): Enable when crawling JavaScript-heavy websites that require full browser rendering. This option is slower and more expensive but necessary for sites that load content dynamically.

Follow links (extra credits): Process additional URLs found within pages. Useful when sitemaps don’t contain all URLs, but can consume significant credits as it crawls discovered links.

Take screenshot (extra credits): Capture visual screenshots during crawling. Helpful for websites without og:images or those requiring visual context for AI processing.

With Proxy Rotation (extra credits): Rotate IP addresses for each request to avoid detection by Web Application Firewalls (WAF) or anti-bot systems.

URL Filtering

Skip matching URLs: Enter strings (one per line) to exclude URLs containing these patterns from crawling. Example:

/admin/
/login
.pdf

Example: Crawling flowhunt.io with /blog Skipped

This example explains what happens when you use FlowHunt’s Schedule feature to crawl the flowhunt.io domain while setting /blog as a matching URL to skip in the URL filtering settings.

Configuration Settings

  • Type: Domain crawl
  • URL: flowhunt.io
  • Frequency: Weekly
  • URL Filtering (Skip matching URLs): /blog
  • Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)

What Happens

  1. Crawl Initiation:

    • FlowHunt starts a domain crawl of flowhunt.io, targeting all accessible pages on the domain (e.g., flowhunt.io, flowhunt.io/features, flowhunt.io/pricing, etc.).
  2. URL Filtering Applied:

    • The crawler evaluates each discovered URL against the skip pattern /blog.
    • Any URL containing /blog (e.g., flowhunt.io/blog, flowhunt.io/blog/post1, flowhunt.io/blog/category) is excluded from the crawl.
    • Other URLs, such as flowhunt.io/about, flowhunt.io/contact, or flowhunt.io/docs, are crawled as they don’t match the /blog pattern.
  3. Crawl Execution:

    • The crawler systematically processes the remaining URLs on flowhunt.io, indexing their content for your AI Agent’s knowledge base.
    • Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from non-excluded URLs.
  4. Outcome:

    • Your AI Agent’s knowledge base is updated with fresh content from flowhunt.io, excluding anything under the /blog path.
    • The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages (outside of /blog) without manual intervention.

Index just matching URLs: Enter strings (one per line) to only crawl URLs containing these patterns. Example:

/blog/
/articles/
/knowledge/

Example of Including Matching URLs

Configuration Settings

  • Type: Domain crawl
  • URL: flowhunt.io
  • Frequency: Weekly
  • URL Filtering (Index just matching URLs):
    /blog/
    /articles/
    /knowledge/
    
  • Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)
  1. Crawl Initiation:

    • FlowHunt starts a domain crawl of flowhunt.io, targeting all accessible pages on the domain (e.g., flowhunt.io, flowhunt.io/blog, flowhunt.io/articles, etc.).
  2. URL Filtering Applied:

    • The crawler evaluates each discovered URL against the index patterns /blog/, /articles/, and /knowledge/.
    • Only URLs containing these patterns (e.g., flowhunt.io/blog/post1, flowhunt.io/articles/news, flowhunt.io/knowledge/guide) are included in the crawl.
    • Other URLs, such as flowhunt.io/about, flowhunt.io/pricing, or flowhunt.io/contact, are excluded because they don’t match the specified patterns.
  3. Crawl Execution:

    • The crawler processes only the URLs matching /blog/, /articles/, or /knowledge/, indexing their content for your AI Agent’s knowledge base.
    • Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from the included URLs.
  4. Outcome:

    • Your AI Agent’s knowledge base is updated with fresh content from flowhunt.io pages under the /blog/, /articles/, and /knowledge/ paths.
    • The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages within these sections without manual intervention.

Custom Headers: Add custom HTTP headers for crawling requests. Format as HEADER=Value (one per line): This feature is highly useful for tailoring crawls to specific website requirements. By enabling custom headers, users can authenticate requests to access restricted content, mimic specific browser behaviors, or comply with a website’s API or access policies. For example, setting an Authorization header can grant access to protected pages, while a custom User-Agent can help avoid bot detection or ensure compatibility with sites that restrict certain crawlers. This flexibility ensures more accurate and comprehensive data collection, making it easier to index relevant content for an AI Agent’s knowledge base while adhering to a website’s security or access protocols.

MYHEADER=Any value
Authorization=Bearer token123
User-Agent=Custom crawler

How to Create a Schedule

  1. Navigate to Schedules in your FlowHunt dashboard Navigate to Schedules

  2. Click “Add new Schedule” Click Add new Schedule

  3. Configure basic settings:

    • Select crawl type (Domain/Sitemap/URL/YouTube)
    • Set frequency (Daily/Weekly/Monthly/Yearly)
    • Enter target URL
  4. Expand Advanced options if needed:

    • Enable browser rendering for JS-heavy sites
    • Configure link following for comprehensive crawling
    • Set up URL filtering rules
      • Add custom headers if required Expand Advanced options
  5. Click “Add new Schedule” to activate

Best Practices

For Most Websites:

  • Start with basic Sitemap or Domain crawl
  • Use default settings initially
  • Add advanced options only if needed

For JavaScript-Heavy Sites:

  • Enable “With Browser” option
  • Consider taking screenshots for visual content
  • May require proxy rotation if blocked

For Large Sites:

  • Use URL filtering to focus on relevant content
  • Set appropriate frequency to balance freshness with credit usage
  • Monitor credit consumption with advanced features

For E-commerce or Dynamic Content:

  • Use Daily or Weekly frequency
  • Enable link following for product pages
  • Consider custom headers for authenticated content

Credit Usage

Advanced features consume additional credits:

  • Browser rendering increases processing time and cost
  • Following links multiplies crawled pages
  • Screenshots add visual processing overhead
  • Proxy rotation adds network overhead

Monitor your credit usage and adjust schedules based on your needs and budget.

Troubleshooting Common Issues

Crawl Failures:

  • Enable “With Browser” for JavaScript-dependent sites
  • Add “With Proxy Rotation” if blocked by WAF
  • Check custom headers for authentication

Too Many/Few Pages:

  • Use “Skip matching URLs” to exclude unwanted content
  • Use “Index just matching URLs” to focus on specific sections
  • Adjust link following settings

Missing Content:

  • Enable “Follow links” if sitemap is incomplete
  • Check URL filtering rules aren’t too restrictive
  • Verify target URL is accessible