Scheduling Automated Website Crawls

FlowHunt’s Schedule feature allows you to automate the crawling and indexing of websites, sitemaps, domains, and YouTube channels. This ensures your AI Agent’s knowledge base stays current with fresh content without manual intervention.

How Scheduling Works

Automated crawling:
Set up recurring crawls that run daily, weekly, monthly, or yearly to keep your knowledge base updated.
Multiple crawl types:
Choose from Domain crawl, Sitemap crawl, URL crawl, or YouTube channel crawl based on your content source.
Advanced options:
Configure browser rendering, link following, screenshots, proxy rotation, and URL filtering for optimal results.

Schedule Configuration Options

Basic Settings

Type: Choose your crawl method:

Domain crawl: Crawl an entire domain systematically
Sitemap crawl: Use the website’s sitemap.xml for efficient crawling
URL crawl: Target specific URLs or pages
YouTube channel crawl: Index video content from YouTube channels

Frequency: Set how often the crawl runs:

Daily, Weekly, Monthly, or Yearly

URL: Enter the target URL, domain, or YouTube channel to crawl

Advanced Crawling Options

With Browser (extra credits): Enable when crawling JavaScript-heavy websites that require full browser rendering. This option is slower and more expensive but necessary for sites that load content dynamically.

Follow links (extra credits): Process additional URLs found within pages. Useful when sitemaps don’t contain all URLs, but can consume significant credits as it crawls discovered links.

Take screenshot (extra credits): Capture visual screenshots during crawling. Helpful for websites without og:images or those requiring visual context for AI processing.

With Proxy Rotation (extra credits): Rotate IP addresses for each request to avoid detection by Web Application Firewalls (WAF) or anti-bot systems.

URL Filtering

Skip matching URLs: Enter strings (one per line) to exclude URLs containing these patterns from crawling. Example:

/admin/
/login
.pdf

Example: Crawling flowhunt.io with /blog Skipped

This example explains what happens when you use FlowHunt’s Schedule feature to crawl the flowhunt.io domain while setting /blog as a matching URL to skip in the URL filtering settings.

Configuration Settings

Type: Domain crawl
URL: flowhunt.io
Frequency: Weekly
URL Filtering (Skip matching URLs): /blog
Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)

What Happens

Crawl Initiation:
- FlowHunt starts a domain crawl of flowhunt.io, targeting all accessible pages on the domain (e.g., flowhunt.io, flowhunt.io/features, flowhunt.io/pricing, etc.).
URL Filtering Applied:
- The crawler evaluates each discovered URL against the skip pattern /blog.
- Any URL containing /blog (e.g., flowhunt.io/blog, flowhunt.io/blog/post1, flowhunt.io/blog/category) is excluded from the crawl.
- Other URLs, such as flowhunt.io/about, flowhunt.io/contact, or flowhunt.io/docs, are crawled as they don’t match the /blog pattern.
Crawl Execution:
- The crawler systematically processes the remaining URLs on flowhunt.io, indexing their content for your AI Agent’s knowledge base.
- Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from non-excluded URLs.
Outcome:
- Your AI Agent’s knowledge base is updated with fresh content from flowhunt.io, excluding anything under the /blog path.
- The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages (outside of /blog) without manual intervention.

Index just matching URLs: Enter strings (one per line) to only crawl URLs containing these patterns. Example:

/blog/
/articles/
/knowledge/

Example of Including Matching URLs

Configuration Settings

Type: Domain crawl
URL: flowhunt.io
Frequency: Weekly
URL Filtering (Index just matching URLs):
```
/blog/
/articles/
/knowledge/
```
Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)

Crawl Initiation:
- FlowHunt starts a domain crawl of flowhunt.io, targeting all accessible pages on the domain (e.g., flowhunt.io, flowhunt.io/blog, flowhunt.io/articles, etc.).
URL Filtering Applied:
- The crawler evaluates each discovered URL against the index patterns /blog/, /articles/, and /knowledge/.
- Only URLs containing these patterns (e.g., flowhunt.io/blog/post1, flowhunt.io/articles/news, flowhunt.io/knowledge/guide) are included in the crawl.
- Other URLs, such as flowhunt.io/about, flowhunt.io/pricing, or flowhunt.io/contact, are excluded because they don’t match the specified patterns.
Crawl Execution:
- The crawler processes only the URLs matching /blog/, /articles/, or /knowledge/, indexing their content for your AI Agent’s knowledge base.
- Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from the included URLs.
Outcome:
- Your AI Agent’s knowledge base is updated with fresh content from flowhunt.io pages under the /blog/, /articles/, and /knowledge/ paths.
- The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages within these sections without manual intervention.

Custom Headers: Add custom HTTP headers for crawling requests. Format as HEADER=Value (one per line): This feature is highly useful for tailoring crawls to specific website requirements. By enabling custom headers, users can authenticate requests to access restricted content, mimic specific browser behaviors, or comply with a website’s API or access policies. For example, setting an Authorization header can grant access to protected pages, while a custom User-Agent can help avoid bot detection or ensure compatibility with sites that restrict certain crawlers. This flexibility ensures more accurate and comprehensive data collection, making it easier to index relevant content for an AI Agent’s knowledge base while adhering to a website’s security or access protocols.

MYHEADER=Any value
Authorization=Bearer token123
User-Agent=Custom crawler

How to Create a Schedule

Navigate to Schedules in your FlowHunt dashboard
Click “Add new Schedule”
Configure basic settings:
- Select crawl type (Domain/Sitemap/URL/YouTube)
- Set frequency (Daily/Weekly/Monthly/Yearly)
- Enter target URL
Expand Advanced options if needed:
- Enable browser rendering for JS-heavy sites
- Configure link following for comprehensive crawling
- Set up URL filtering rules
  - Add custom headers if required
Click “Add new Schedule” to activate