The robots file records supported by all three include:
Disallow – tells the spider not to crawl certain files or directories. The following code will prevent spiders from crawling all website files:
Allow – tells the spider that it should crawl certain files. Allow and Disallow work together, can tell the spider a directory, most of them are not crawled, only grab a part. For example, the following code will make the spider not capture other files in the ab directory, but only the files in the cd:
$ wildcard – matches the character at the end of the URL. The following code will allow spiders to access URLs suffixed with .htm:
* Wildcards – tells the spider to match any character. As the following code will prohibit spiders from crawling all htm files:
Sitemaps location – tell spiders where your sitemap is, in the format:
Meta tags supported by all three include:
NOINDEX – tells spiders not to index a web page.
NOFOLLOW – tells the spider not to follow the links on the page.
NOSNIPPET – tells the spider not to display the caption in the search results.
NOARCHIVE – tells the spider not to display the snapshot.
NOODP – Tell spiders not to use the title and description in the open directory.
These records or labels are now supported by all three. The wildcard seems to be not supported by Yahoo! Baidu now also supports Disallow, Allow and two wildcards. Meta tag I did not find an official description of whether Baidu supports it.
Only the Meta tags supported by Google are:
UNAVAILABLE_AFTER – tells the spider when the page expires. After this date, it should no longer appear in search results.
NOIMAGEINDEX – tells the spider not to index the image on the page.
NOTRANSLATE – tells the spider not to translate the page content.
Yahoo! also supports the Meta tag:
Crawl-Delay – The frequency at which spiders are allowed to delay crawling.
NOYDIR – Similar to the NOODP tag, but refers to the Yahoo directory, not the open directory.
Robots-nocontent – tells the spider that the part of the html that is being tagged is not part of the content of the page, or from another angle, telling the spider which part of the page is the main content of the page (things to be retrieved).
MSN also supports the Meta tag:
Also remind everyone that the robots.txt file can not exist, return a 404 error, which means that the spider is allowed to grab all content. However, when the robots.txt file is crawled, a timeout error occurs, which may cause the search engine not to include the website, because the spider does not know whether the robots.txt file exists or what is inside, which is different from confirming that the file does not exist. .
Mastering the usage and writing of robots files is the basic skill of SEO. The robots file is also the first to check if the page is not included or included.