Each crawler has a specific name specified in the robots.txt file, such as the following: Google - Googlebot, Bing - Bingbot, Yahoo - Yahoobot, Twitter - Twitterbot, Facebook - Facebot ..., in addition to You can specify indefinitely with the attribute (*).
After each bot is added with two rules: Allow and Disallow.
How to create a robots.txt file
The structure of the robots.txt file is read as follows:
user-agent: the name of the search engine bot
Disallow: Links blocked
allow: Links allows
Sitemap: <domain> Sitemap /sitemap.xml
illustrated example:
Suppose I'll Allows the Google bot, Twitter, Facebook, Google partner (Adsense) to collect data as follows:
User-agent: Googlebot
User-agent: Twitterbot
User-agent: Facebot
Disallow: /p
Disallow: /search
Allow: /
User-agent: Mediapartners-Google
Allow: /
Sitemap: https://www.cuongbv.com/sitemap.xml
When reading content In the file that understands the Google bots, Twitter, Facebook block all static page links (/ p) and search pages (/ search) and allow Google Mediapartners-Google Partners to collect all links.
Add filter rules from blocked links or block a link from allowed links
Assume in 2 rules: Disallow: / p and Disallow: / search add rules Allow filter to get links contained in links This blocked link and block a link from the Allow: / link, for example:
User-agent: Googlebot
User-agent: Twitterbot
User-agent: Facebot
Disallow: /p
Disallow: /search
Disallow: /2018/11/cach-chia-se-bai-viet-len-facebook-an-toan-va-tuong-tac-cao.html
Allow: /
Allow: /p/about-us.html
Allow: /search/label/blogspot-seo
User-agent: Mediapartners-Google
Allow: /
Sitemap: https://www.cuongbv.com/sitemap.xml
Add rules (*) Advanced filter
Disallow: *?showComment=*
Disallow: *?spref=fb
Disallow: *?spref=tw
Disallow: *?spref=gp
Disallow: *?spref=pi
Disallow: *?utm_source=*
With this add (*) rule, no need to know any links that have the value behind the asterisk (*) will be blocked.
Post a Comment
Post a Comment