MehmetAK07
(Mehmet AK)
November 15, 2016, 3:19am
1
hi guys,
i want to deny all robots web crawler except google,yahoo, etc.
this sample robot.txt is true ?
User-agent: *
disallow:/
User-agent: Googlebot
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-agent: googlebot-image
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-agent: googlebot-mobile
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-agent: MSNBot
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-agent: yahoobot
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-agent: yahoo-blogs/v3.9
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
User-Agent: bingbot
Disallow: /galleries/
Disallow: /assets/
Disallow: /protected/
Disallow: /images/
Disallow: /template/
Disallow: /themes/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
mdomba
(Maurizio Domba Cerin)
November 15, 2016, 7:28am
2
I always put the “all bots” section as last , if it’s at the top some bot could stop right there at the first line without seeing everything after that.Also, you can group all those bots you want to allow to be like
User-agent: Googlebot
User-agent: googlebot-image
User-agent: googlebot-mobile
User-agent: Mediapartners-Google*
User-agent: <any other bot you want to allow>
Disallow:
User-agent: *
Disallow: /
And you probably already know that but just to mention, this works only for "polite" bots… so there still can be some bots that will scan your site regardless of the robots.txt file.
MehmetAK07
(Mehmet AK)
November 16, 2016, 7:22am
3
thanks [size=2]Maurizio Domba Cerin . [/size]
I always put the “all bots” section as last , if it’s at the top some bot could stop right there at the first line without seeing everything after that.Also, you can group all those bots you want to allow to be like
User-agent: Googlebot
User-agent: googlebot-image
User-agent: googlebot-mobile
User-agent: Mediapartners-Google*
User-agent: <any other bot you want to allow>
Disallow:
User-agent: *
Disallow: /
And you probably already know that but just to mention, this works only for "polite" bots… so there still can be some bots that will scan your site regardless of the robots.txt file.
MehmetAK07
(Mehmet AK)
November 16, 2016, 10:57am
4
is it true ?
Can I block just bad robots?
[font=Verdana, Arial, Helvetica, sans-serif][right]<< | >>
[/right][/font][font=Verdana, Arial, Helvetica, sans-serif]In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.[/font]
[font=Verdana, Arial, Helvetica, sans-serif]If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.[/font]
[font=Verdana, Arial, Helvetica, sans-serif][size=2]If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large[/size][/font][font=Verdana, Arial, Helvetica, sans-serif][size=2] [/size][/font]Botnet[font=Verdana, Arial, Helvetica, sans-serif][size=2], then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.[/size][/font]
http://www.robotstxt.org/faq/blockjustbad.html
mdomba
(Maurizio Domba Cerin)
November 16, 2016, 11:14am
5
Yes it is, that’s why I mentioned above “polite” bots.
MehmetAK07
(Mehmet AK)
November 16, 2016, 12:20pm
6
is there an other solution to block bad robots ?
mdomba
(Maurizio Domba Cerin)
November 16, 2016, 12:32pm
7
Just google a bit there are a lot of information there, here is one article to start with - http://www.blogtips.org/web-crawlers-love-the-good-but-kill-the-bad-and-the-ugly/