
Scheduling Automated Website Crawls
Learn how to set up automated schedules for crawling websites, sitemaps, domains, and YouTube channels to keep your AI Agent knowledge base up-to-date.
FlowHunt’s Schedule feature allows you to automate the crawling and indexing of websites, sitemaps, domains, and YouTube channels. This ensures your AI Agent’s knowledge base stays current with fresh content without manual intervention.
How Scheduling Works
Automated crawling:
Set up recurring crawls that run daily, weekly, monthly, or yearly to keep your knowledge base updated.Multiple crawl types:
Choose from Domain crawl, Sitemap crawl, URL crawl, or YouTube channel crawl based on your content source.Advanced options:
Configure browser rendering, link following, screenshots, proxy rotation, and URL filtering for optimal results.
Schedule Configuration Options
Basic Settings
Type: Choose your crawl method:
- Domain crawl: Crawl an entire domain systematically
- Sitemap crawl: Use the website’s sitemap.xml for efficient crawling
- URL crawl: Target specific URLs or pages
- YouTube channel crawl: Index video content from YouTube channels
Frequency: Set how often the crawl runs:
- Daily, Weekly, Monthly, or Yearly
URL: Enter the target URL, domain, or YouTube channel to crawl
Advanced Crawling Options
With Browser (extra credits): Enable when crawling JavaScript-heavy websites that require full browser rendering. This option is slower and more expensive but necessary for sites that load content dynamically.
Follow links (extra credits): Process additional URLs found within pages. Useful when sitemaps don’t contain all URLs, but can consume significant credits as it crawls discovered links.
Take screenshot (extra credits): Capture visual screenshots during crawling. Helpful for websites without og:images or those requiring visual context for AI processing.
With Proxy Rotation (extra credits): Rotate IP addresses for each request to avoid detection by Web Application Firewalls (WAF) or anti-bot systems.
URL Filtering
Skip matching URLs: Enter strings (one per line) to exclude URLs containing these patterns from crawling. Example:
/admin/
/login
.pdf
Example: Crawling flowhunt.io with /blog Skipped
This example explains what happens when you use FlowHunt’s Schedule feature to crawl the flowhunt.io
domain while setting /blog
as a matching URL to skip in the URL filtering settings.
Configuration Settings
- Type: Domain crawl
- URL:
flowhunt.io
- Frequency: Weekly
- URL Filtering (Skip matching URLs):
/blog
- Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)
What Happens
Crawl Initiation:
- FlowHunt starts a domain crawl of
flowhunt.io
, targeting all accessible pages on the domain (e.g.,flowhunt.io
,flowhunt.io/features
,flowhunt.io/pricing
, etc.).
- FlowHunt starts a domain crawl of
URL Filtering Applied:
- The crawler evaluates each discovered URL against the skip pattern
/blog
. - Any URL containing
/blog
(e.g.,flowhunt.io/blog
,flowhunt.io/blog/post1
,flowhunt.io/blog/category
) is excluded from the crawl. - Other URLs, such as
flowhunt.io/about
,flowhunt.io/contact
, orflowhunt.io/docs
, are crawled as they don’t match the/blog
pattern.
- The crawler evaluates each discovered URL against the skip pattern
Crawl Execution:
- The crawler systematically processes the remaining URLs on
flowhunt.io
, indexing their content for your AI Agent’s knowledge base. - Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from non-excluded URLs.
- The crawler systematically processes the remaining URLs on
Outcome:
- Your AI Agent’s knowledge base is updated with fresh content from
flowhunt.io
, excluding anything under the/blog
path. - The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages (outside of
/blog
) without manual intervention.
- Your AI Agent’s knowledge base is updated with fresh content from
Index just matching URLs: Enter strings (one per line) to only crawl URLs containing these patterns. Example:
/blog/
/articles/
/knowledge/
Example of Including Matching URLs
Configuration Settings
- Type: Domain crawl
- URL:
flowhunt.io
- Frequency: Weekly
- URL Filtering (Index just matching URLs):
/blog/ /articles/ /knowledge/
- Other settings: Default (no browser rendering, no link following, no screenshots, no proxy rotation)
Crawl Initiation:
- FlowHunt starts a domain crawl of
flowhunt.io
, targeting all accessible pages on the domain (e.g.,flowhunt.io
,flowhunt.io/blog
,flowhunt.io/articles
, etc.).
- FlowHunt starts a domain crawl of
URL Filtering Applied:
- The crawler evaluates each discovered URL against the index patterns
/blog/
,/articles/
, and/knowledge/
. - Only URLs containing these patterns (e.g.,
flowhunt.io/blog/post1
,flowhunt.io/articles/news
,flowhunt.io/knowledge/guide
) are included in the crawl. - Other URLs, such as
flowhunt.io/about
,flowhunt.io/pricing
, orflowhunt.io/contact
, are excluded because they don’t match the specified patterns.
- The crawler evaluates each discovered URL against the index patterns
Crawl Execution:
- The crawler processes only the URLs matching
/blog/
,/articles/
, or/knowledge/
, indexing their content for your AI Agent’s knowledge base. - Since browser rendering, link following, screenshots, and proxy rotation are disabled, the crawl is lightweight, focusing only on static content from the included URLs.
- The crawler processes only the URLs matching
Outcome:
- Your AI Agent’s knowledge base is updated with fresh content from
flowhunt.io
pages under the/blog/
,/articles/
, and/knowledge/
paths. - The crawl runs weekly, ensuring the knowledge base stays current with new or updated pages within these sections without manual intervention.
- Your AI Agent’s knowledge base is updated with fresh content from
Custom Headers:
Add custom HTTP headers for crawling requests. Format as HEADER=Value
(one per line):
This feature is highly useful for tailoring crawls to specific website requirements. By enabling custom headers, users can authenticate requests to access restricted content, mimic specific browser behaviors, or comply with a website’s API or access policies. For example, setting an Authorization header can grant access to protected pages, while a custom User-Agent can help avoid bot detection or ensure compatibility with sites that restrict certain crawlers. This flexibility ensures more accurate and comprehensive data collection, making it easier to index relevant content for an AI Agent’s knowledge base while adhering to a website’s security or access protocols.
MYHEADER=Any value
Authorization=Bearer token123
User-Agent=Custom crawler
How to Create a Schedule
Navigate to Schedules in your FlowHunt dashboard
Click “Add new Schedule”
Configure basic settings:
- Select crawl type (Domain/Sitemap/URL/YouTube)
- Set frequency (Daily/Weekly/Monthly/Yearly)
- Enter target URL
Expand Advanced options if needed:
- Enable browser rendering for JS-heavy sites
- Configure link following for comprehensive crawling
- Set up URL filtering rules
- Add custom headers if required
- Add custom headers if required
Click “Add new Schedule” to activate
Best Practices
For Most Websites:
- Start with basic Sitemap or Domain crawl
- Use default settings initially
- Add advanced options only if needed
For JavaScript-Heavy Sites:
- Enable “With Browser” option
- Consider taking screenshots for visual content
- May require proxy rotation if blocked
For Large Sites:
- Use URL filtering to focus on relevant content
- Set appropriate frequency to balance freshness with credit usage
- Monitor credit consumption with advanced features
For E-commerce or Dynamic Content:
- Use Daily or Weekly frequency
- Enable link following for product pages
- Consider custom headers for authenticated content
Credit Usage
Advanced features consume additional credits:
- Browser rendering increases processing time and cost
- Following links multiplies crawled pages
- Screenshots add visual processing overhead
- Proxy rotation adds network overhead
Monitor your credit usage and adjust schedules based on your needs and budget.
Troubleshooting Common Issues
Crawl Failures:
- Enable “With Browser” for JavaScript-dependent sites
- Add “With Proxy Rotation” if blocked by WAF
- Check custom headers for authentication
Too Many/Few Pages:
- Use “Skip matching URLs” to exclude unwanted content
- Use “Index just matching URLs” to focus on specific sections
- Adjust link following settings
Missing Content:
- Enable “Follow links” if sitemap is incomplete
- Check URL filtering rules aren’t too restrictive
- Verify target URL is accessible