Blocking Bots in Sitecore Site

One comment

This is a challenging topic as the bots are improving each day. Apart from limiting connection counts or bandwidth in firewalls or webserver few other measures would block bots in Sitecore site. Let’s look into few approaches on how bots can be denied in Sitecore site.

1. Denying access based on User Agent for popular bots/ web-crawlers :

We can have Url Rewrite rules or IIS request filter rules to allow or disallow certain user agents. This is at least 50 percent effective because the people who try to ping/crawl the site mostly DO NOT write or create their own bot. They will simply use existing bots to crawl site. Nowadays there are some bots which can send request like firefox or chrome but most of the bots still have their own user agent when they send their requests.

<rule name="RequestBlockingRule1" patternSyntax="Wildcard" stopProcessing="true">
                    <match url="*" />
                    <conditions>
                        <add input="{HTTP_USER_AGENT}" pattern="YandexBot" />
                    </conditions>
                    <action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="Get Lost." />
     </rule>

These are some popular useragents which are identified as bots:

EasouSpider|Add Catalog|PaperLiBot|Spiceworks|ZumBot|RU_Bot|Wget|Java/1.7.0_25|Slurp|FunWebProducts|80legs|Aboundex|AcoiRobot|Acoon Robot|AhrefsBot|aihit|AlkalineBOT|AnzwersCrawl|Arachnoidea|ArchitextSpider|archive|Autonomy Spider|Baiduspider|BecomeBot|benderthewebrobot|BlackWidow|Bork-edition|Bot mailto:craftbot@yahoo.com|botje|catchbot|changedetection|Charlotte|ChinaClaw|commoncrawl|ConveraCrawler|Covario|crawler|curl|Custo|data mining development project|DigExt|DISCo|discobot|discoveryengine|DOC|DoCoMo|DotBot|Download Demon|Download Ninja|eCatch|EirGrabber|EmailSiphon|EmailWolf|eurobot|Exabot|Express WebPictures|ExtractorPro|EyeNetIE|Ezooms|Fetch|Fetch API|filterdb|findfiles|findlinks|FlashGet|flightdeckreports|FollowSite Bot|Gaisbot|genieBot|GetRight|GetWeb!|gigablast|Gigabot|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|GT::WWW|hailoo|heritrix|HMView|houxou|HTTP::Lite|HTTrack|ia_archiver|IBM EVV|id-search|IDBot|Image Stripper|Image Sucker|Indy Library|InterGET|Internet Ninja|internetmemory|ISC Systems iRc Search 2.1|JetCar|JOC Web Spider|k2spider|larbin|larbin|LeechFTP|libghttp|libwww|libwww-perl|linko|LinkWalker|lwp-trivial|Mass Downloader|metadatalabs|MFC_Tear_Sample|Microsoft URL Control|MIDown tool|Missigua|Missigua Locator|Mister PiX|MJ12bot|MOREnet|MSIECrawler|msnbot|naver|Navroad|NearSite|Net Vampire|NetAnts|NetSpider|NetZIP|NextGenSearchBot|NPBot|Nutch|Octopus|Offline Explorer|Offline Navigator|omni-explorer|PageGrabber|panscient|panscient.com|Papa Foto|pavuk|pcBrowser|PECL::HTTP|PHP/|PHPCrawl|picsearch|pipl|pmoz|PredictYourBabySearchToolbar|RealDownload|Referrer Karma|ReGet|reverseget|rogerbot|ScoutJet|SearchBot|seexie|seoprofiler|Servage Robot|SeznamBot|shopwiki|sindice|sistrix|SiteSnagger|SiteSnagger|smart.apnoti.com|SmartDownload|Snoopy|Sosospider|spbot|suggybot|SuperBot|SuperHTTP|SuperPagesUrlVerifyBot|Surfbot|SurveyBot|SurveyBot|swebot|Synapse|Tagoobot|tAkeOut|Teleport|Teleport Pro|TeleportPro|TweetmemeBot|TwengaBot|twiceler|UbiCrawler|uptimerobot|URI::Fetch|urllib|User-Agent|VoidEYE|VoilaBot|WBSearchBot|Web Image Collector|Web Sucker|WebAuto|WebCopier|WebCopier|WebFetch|WebGo IS|WebLeacher|WebReaper|WebSauger|Website eXtractor|Website Quester|WebStripper|WebStripper|WebWhacker|WebZIP|WebZIP|Wells Search II|WEP Search|Widow|winHTTP|WWWOFFLE|Xaldon WebSpider|Xenu|yacybot|yandex|YandexBot|YandexImages|yBot|YesupBot|YodaoBot|yolinkBot|youdao|Zao|Zealbot|Zeus|ZyBORG 

Do not confuse this withSitecore.Analytics.ExcludeRobots.config file. It contains excluded list of IP addresses and user agents. This configuration ensures that only genuine contacts are registered in the xDB but above setup on IIS blocks the requests coming from Bots.

2. Updating allowed user agents in Robots.txt :

This is for bots that will RESPECT robots.txt  ( like search engines googlebot|msnbot|slurp..).

[Options]
RuleList=DenyYandex
[DenyYandex]
DenyDataSection=Agents
ScanHeaders=User-Agent
[Agents]
Yandex

3. Rate limiting ( effective in blocking constant post request  ESPECIALLY for forms without captcha )   :

This is to avoid multiple requests at a time. Bots can send out huge number of form posts or get requests. This strategy is to limit the number of requests.

This approach is usually used by banks / financial institutions or secured applications which will stop responding if there are multiple requests within certain time frame. This is apply for a bot, when it tries to BOMBARD the site with lot of post, it can identify and send out a 429 response code.

<system.webServer>
   <security>
      <dynamicIpSecurity enableLoggingOnlyMode="true">
         <denyByConcurrentRequests enabled="true" maxConcurrentRequests="10" />
         <denyByRequestRate enabled="true" maxRequests="30" 
            requestIntervalInMilliseconds="300" />
      </dynamicIpSecurity>
   </security>
</system.webServer>


3. Implementing Captcha in Forms  :

This is used mostly to bots that try to automatically harvest email addresses or try to automatically sign up for or make use of Web sites, blogs or forums.All the forms should have this feature in order to block the bots from posting forms.

 The Sitecore form extensions already has a feature that allows you to integrate Google’s recaptcha.


1 comments on “Blocking Bots in Sitecore Site”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.