Search robots. What is the robots of the search engines of Yandex and Google simple words what work spiders are performed

Definitions and terminology
Names Robotov
A bit of history
What do search engines do
The behavior of robots on the site
Robot Management
conclusions

What are search engine robots? What function they performny? What are the features of the work of search robots? here wewe will try to give the answer to these and some other questions,robots with work.

Definitions and terminology

In English there are several options for search robots: Robots, Web Bots, Crawlers, Spiders; In Russian, one term was actually stuck in Russian - robots, or abbreviated - bots.

On the website www. robotstxt. ORG is given the following definition robots:

"The web robot is a program that is bypassing the hypertext structure of WWW, recursively requesting and removing documents."

Keyword in this definition - recursivelythose. It is understood that after receiving the document, the robot will request documents on the links from it, etc.

Namesrobots

Most search robots have their own unique name (except those robots that for some reason are masked for custom browsers).

The name of the robot can be seen in the User-Agent field of server log files, server statistical systems reports, as well as on search engines help pages.

So, the Yandex robot is collectively called Yandex, Rambler's robot - Stackrambler, Robot Yahoo! - SLURP, etc. Even custom software collects for subsequent viewing can be specially presented using information in the User-Agent field.

In addition to the name of the robot, there may be more information in the User-Agent field: the robot version, the purpose and address of the page with additional information.

Littlestories

Back in the first half of the 1990s, during the development of the Internet, there was a problem of web robots related to the fact that some of the first robots could significantly download a web server, up to its refusal, due to the fact that they did a large number Queries to the site for too short time. System administrators and web server administrators were not able to manage the behavior of a robot within their sites, and could only completely close the access robot not only to the site, but to the server.

In 1994, the Robots.txt protocol was developed, which sets the exceptions for robots and allowing users to manage search robots within their sites. You read about these possibilities in chapter 6 "How to make a site available for search engines."

In the future, as the network grows, the number of search robots increased, and their functionality is constantly expanding. Some search robots did not live to this day, remaining only in the archives of server log files of the late 1990s. Who now remembers the T-REX robot, collecting information for the Lycos system? External like a dinosaur named which is named. Or where can I find Scooter - Altavista robot? NIGHT! But in 2002, he still actively indexed documents.

Even in the name of the main robot Yandex, you can find the echo of the past days: a fragment of its full name "Compatible; Win16; " It was added for compatibility with some old web servers.

whatdorobotssearchsystems

What functions can robots perform?

There are several different robots in the search engine, and each has its own destination. We list some of the tasks performed by robots:

request processing and recovery of documents;
check references;
update monitoring; check availability of the site or server;
analysis of the content of pages for subsequent placement of contextarrex;
collecting content in alternative formats (graphics, data in formatsRatom formats).

As an example, we give a list of Yandex robots. Yandex uses several types of robots with different functions. You can identify them by the User-Agent string.

Yandex / 1.01.001 (compatible; Win 16; i) - a mining indexing robot.
Yandex / 1.01.001 (compatible; win 16; p) image indexer.
Yandex / 1.01.001 (compatible; win 16; h) - batch, which defines the sites.
Yandex / 1.03.003 (compatible; Win 16; D) -Bot, referring to the page when adding it via the "Add URL" form.
Yandex / 1.03.000 (compatible; win 16; m) - a robot, referring to the opening of the page on the link "Found Words".
Yandexblog / 0.99.101 (compatible; dos3.30; Mozilla / 5.0; in; robot) - a robot, indexing XML files to search for blogs.
YandexSomething / 1.0 is a robot, indexing news streams of Yandex. Navigation partners and Robots files. TXT for robot search blogs.

In addition, several tested robots work in Yandex - "kivoks ",which only check the availability of documents, but do not index them.

Yandex / 2.01.000 (compatible; Win 16; Dyatel; c) - "Low-kivalka" Yandex.Catalog. If the site is not available for the other, it is removed from the publication. As soon as the site begins to respond, it appears onventomatically in the catalog.
Yandex / 2.01.000 (compatible; Win 16; dyatel; z) - "Low-kivalka" Yandex. Tops. Links to inaccessible sites highlighting color.
Yandex / 2.01.000 (compatible; win 16; dyatel; d) - "Trecks-roll" Yandex.Direct. It checks the correctness of links from ads before moderation.

Nevertheless, the most common robots are those requested, receive and archive documents for subsequent processing by other search engine mechanisms. It will appropriate to separate the robot from the indexer.

The search robot bypass sites and gets documents in accordance with your internal address list. In some cases, the robot can perform a basic analysis of documents to replenish the address list. Further processing of documents and the construction of the search engine index is already engaged in the search engine indexer. The robot in this scheme is just a "courier" to collect data.

The behavior of robots on the site

What is the difference between the behavior of the robot on the site from the behavior of a regular user?

Controllability.First of all, the "intelligent" robot must request the Robots file from the server. TXT with indexing instructions.
Selective pumping.When requesting a document, the robot is clearly indicated by the requested data, in contrast to the usual browser, ready to take everything. The main robots of popular search engines will first of all request hypertext and ordinary text documents, leaving the files of styled CSS, images, video. ZIP archives, etc. Currently also in demand information in PDF formats, Rich Text, MS Word, MS Excel and some others.
Unpredictability.It is impossible to track or predict the way robot site, because it does not leave information in the Referer field - the address shop where it came from; The robot simply requests a list of documents, it would seem in random order, and in fact, in accordance with the aspects of the internal list or the index queue.
Speed.A short time between requests from different documents. To the time of the seconds or fractions of seconds between the requests of two formations. For some robots there are even special instructions that are specified in the Robots file. TXT, to limit the speed of the document request, so as not to overload the site.

Hown an HTML page in the eyes of a robot can look, we don't know, but we can try to imagine it, turning off the display of graphics and style design in the browser.

Thus, it can be concluded that the search robots pour the HTML page into their index, but without design elements and without pictures.

Robot Management

How can the webmaster control the behavior of search robots on his site?

As mentioned above, in 1994, a special exclusion protocol for robots was developed as a result of public debates of webmasters. To date, this protocol has not become the standard that obligedobserve all the robots without exception, remaining only in the status of strict recommendations. There is no instance where you can complain to a robot that does not comply with the exception rules, you can only prohibit access to the site already using the web server settings or network interfaces for IP addresses from which the "non-attilled" robot sent its requests.

However, robots of large search engines comply with the rules of exceptions, moreover, their extensions contribute.

On the instructions of a special Robots.txt file. And about the special meta tag Robots described in detail in Chapter 6 "How to make a site available for search engines".

With the help of additional instructions in robots.txt, which are not in the standard, some search engines allow you to more flexibly control the behavior of your robots. So, using the Crawl-Delau instruction, the webmaster can set the time interval between the sequential requests of two documents for the Robots Yahoo! and MSN, and using the NO- instruction; t Specify the address of the main mirror of the site for Yandex. However, working with non-standard instructions in Robots. TXI should be very careful because the robot of another search engine can ignore not only incomprehensible instructions, but also the entire set of rules associated with it.

You can also manage visits to search robots and indirectly, for example, the Google search engine robot will more often re-take those documents to which many referred to other sites.

Robots-spiders in search engines are online bots, in whose task is a systematic viewing of pages in the World Wide Web to provide web indexing. Traditionally, the scanning of the WWW-space is carried out in order to update information about the network posted on the network in order to provide users with topical data on the contents of one or another resource. About types of search robots and their features and will be discussed in this article.

Search spiders may also be called differently: robots, web spiders, cranes. However, regardless of the name, all of them are engaged in constant and continuous study of the contents of the virtual space. The robot saves the list of URLs, documents for which are loaded on a regular basis. If a new link finds a new link in the indexing process, it is added to this list.

Thus, the actuator's actions can be compared with an ordinary man behind the browser. With the only difference that we only open the links only to us, and the robot is all information about. In addition, the robot, having read the contents of the indexed page, transmits data about it in a special form to the storage search engine servers until the user request.

In this case, each robot performs its definite task: some index text content, some - graphics, and the third save content in the archive, etc.

The main task of search engines - Creating an algorithm that will allow you to receive information about quickly and most fully, because even the search giants have no opportunity to provide a comprehensive scanning process. Therefore, each company offers robots unique mathematical formulas, which obeyed the bot and selects the page to visit the next step. This, coupled with ranking algorithms, is one of the most important criteria for which users choose the search engine: where information about sites is more complete, fresh and useful.

The search robot may not know about your site if there are no links (which is rare - today after registering the domain name of the mention of it is found on the network). If there are no links, you need to tell about the search engine. For this, as a rule, "Personal Accounts" of webmasters are used.

What is the main task of search robots

No matter how much we want, but the main task of the search robot is not at all to tell the world about the existence of our site. It is difficult to formulate it, but still, based on the fact that search engines work only because of their customers, that is, users, the robot must provide operational search and indexing of data posted on the data network.. Only this allows the PS to satisfy the need to the audience in the current and relevant requests for issuing.

Of course, robots cannot index 100% of websites. According to studies, the number of page search managed by the leaders does not exceed 70% of the total URL, placed on the Internet. However, how much of your resource is studied by the bot, will affect the number of users who have passed on requests from the search. Therefore, optimizers are tormented in trying to "put the" robot "to acquaint him as quickly as possible.

In RuNet, Yandex only in 2016 moved to the second line to cover the monthly audience, giving way to Google. Therefore, it is not surprising that he has the greatest number of spiders learning space among domestic PS. List their full list pointless: It can be seen in the section "Webmaster help"\u003e Management of the search robot\u003e How to verify that the robot belongs to Yandex.

All search engistors have a strictly regulated User-Agent. Among those with whom will be necessary to meet the site-building:

Mozilla / 5.0 (compatible; yandexbot / 3.0; + http: //yandex.com/bots) - the main indexing bot;
Mozilla / 5.0 (iPhone; CPU iPhone OS 8_1 Like Mac OS X) AppleWebKit / 600.1.4 (KHTML, LIKE GECKO) Version / 8.0 Mobile / 12b411 Safari / 600.1.4 (compatible; yandexbot / 3.0; + http: // yandex .com / bots) - Indexing spider;
Mozilla / 5.0 (compatible; yandeximages / 3.0; + http: //yandex.com/bots) - Bot Yandex.Cartinok;
Mozilla / 5.0 (compatible; yandexmedia / 3.0; + http: //yandex.com/bots) - Indexes multimedia materials;
Mozilla / 5.0 (compatible; yandexfavicons / 1.0; + http: //yandex.com/bots) - Indexes icons of sites.

To attract Yandex spiders on your website, it is recommended to perform a few simple actions:

properly configure robots.txt;
create RSS feed;
place Sitemap with a complete list of indexed pages;
create a page (or pages) that will contain links to all resource documents;
configure http statuses;
ensure social activity after the publication of materials (not only comments, but shaking the document);
intensive placement of new unique texts.

In favor of the last argument, it says the ability of bots to memorize the speed of updating content and come to the site with the discovered frequency of adding new materials.

If you would like to prohibit access to Yandex cralluras to pages (for example, to technical partitions), you need to configure the Robots.txt file. PS spiders are able to understand the exception standard for bots, so difficulties when creating a file usually does not appear.

User-Agent: Yandex

Disallow: /

disable PS Index the entire site.

In addition, Yandex robots are able to consider the recommendations specified in meta tags. Example: Disable a demonstration in issuing links to a copy of the document from the archive. And add to the page of the page tag It will indicate that this document does not need to be indexed.

A complete list of permissible values \u200b\u200bcan be found in the "Using HTML elements" section of the webmaster help.

Google search robots

The main WWW content indexing mechanism Google is called GoogleBot. Its mechanism is configured to study billions of pages daily to search for new or modified documents. At the same time, the bot itself determines which pages scan, and what to ignore.

For this, the crawler is important on the site of the SiteMap file provided by the resource owner. The network of computers that ensures its functioning is so powerful that the bot can make requests to the pages of your site once in a couple of seconds. And the bot is configured to analyze a larger number of pages in order not to call the server. If the site work slows down from frequent spider requests, the scan speed can be changed by configuring in Search Console. At the same time, to increase the speed of scanning, unfortunately, it is impossible.

Google bot can be asked to rebuild the site. To do this, you need to open Search Console and find a feature Add to an index that is available to the tool users to view as GoogleBot. After scanning, the Add button will appear. At the same time, Google does not guarantee the indexation of all changes, since the process is associated with the work of "complex algorithms".

Useful tools

List all the tools that help optimizers work with bots, it is quite difficult, since their weight. In addition to the above mentioned "see how GoogleBot", it is worth noting the Robots.txt Google files and Yandex file analyzers, Sitemap file analyzers, Server Response Service from Russian PS. Thanks to their capabilities, you will represent what your site looks like in the eyes of a spider, which will help to avoid mistakes and ensure the fastest scan of the site.

Hello everyone! Today I will tell you about how the search robot works. You will also learn what search robots are. Their appointment and feature.

To begin with, start, perhaps, from the definition.

The search robot is a kind of program that visits hypertext links, removing all subsequent documents to the search engine index from one or another resource.

Each search robot has its own unique name - crawler, spider, etc.

What does the search robot

As I said, each robot has its own unique name and, accordingly, each performs its own work, or let's say, the purpose.

Let's look at what functions they perform:

Request for access to the site;
Request for processing and retrieving pages;
Request for content analysis;
Search for references;
Monitoring updates;
Request to RSS data (collecting content);
Indexing.

For example, Yandex has several robots that are separately indexed, analyzed, collect information on the following data:

Video;
Pictures;
Site mirror;
XML files;
File robots.txt;
Comments;

In general, in fact, the search robot just visits the Internet resources by collecting the necessary data that it then transmits the search engine indexer.

It is the search engine indexer that processes the data obtained, and builds the search engine index properly. I would even say that the robot is a "courier", which only collects information.

How do robots behave and how to manage

The differences between the behavior of the robot from a simple user on the site is as follows:

1. First, it concerns manageability. First of all, the robot requests from your hosting () file robots.txt, which indicates that it is possible to index, and what is impossible.

2. The special difference between the robot is the speed. Between each requests that relate to two different documents, their speed is seconds, and even the shares of the second.

Even for this, there is a special rule that you can specify in the Robots.txt file so that the robot of search engines to put the restrictions on requests, thereby reducing the load on the blog.

3. Also, I would like to mention their unpredictability. When the robot visits your blog, it is impossible to track its actions, it is impossible to know where he came from. It operates according to his own principle, and in order, as the indexing queue is built.

4. And one more thing, it's when the robot, first of all draws attention to hypertext and text documents, and not on all sorts of files regarding the design of CSS, etc.

Want to see what the page of your blog looks like in the eyes of the search robot? Simply, disconnect the display of Flash, pictures and decoration styles in your browser.

And you will see that any search robot enters the index, only the HTML page code, without any pictures and other content.

And now, it is time to talk as they manage. As I said earlier, you can manage robots through a special Robots.txt file, in which you can prescribe instructions and exceptions we need to control their behavior on your blog.

An integral part of the search engine and intended for the integration of Internet pages in order to enhance information about them in the search engine database. According to the principle of the spider, the usual browser resembles. It analyzes the contents of the page, saves it in some special form on the search engine server, which belongs and is sent by reference to the following pages. Search engine owners often limit the depth of spider penetration inside the site and the maximum size of the scanned text, so too large sites can not be completely indexed search engine. In addition to ordinary spiders, there are so-called " dyatla"- Robots that" face "an indexed site to determine what it is available.

The procedure for bypassing pages, the frequency of visits, protection against looping, as well as the criteria for the allocation of significant information are determined by the algorithms of the information search.

In most cases, the transition from one page to another is carried out by references contained in the first and subsequent pages.

Also, many search engines provide the user with the ability to add a site to the queue for indexing. It usually significantly accelerates the indexing of the site, and in cases where no external links lead to the site, in general, it is practically the only opportunity to indicate its existence. Another way to quickly index the site is to add to the site of web analytics systems belonging to search services. For example, such as Google Analytics, Yandex.Metrica and Рейтинг@Mail.ru from Google, Yandex and Mail.ru, respectively.

You can limit the site index using the robots.txt file. Full indexing protection can be provided by other mechanisms, such as the password setting on the page or the requirement to fill the registration form before you access the content.

Encyclopedic YouTube.

1 / 3
Views:

Looking through server logs, sometimes you can observe excessive interest in the sites from the search robots. If the bots are useful (for example, indexing bots PS) - it remains only to observe, even if the load on the server increases. But there is another mass of secondary robots, the access of which to the site is not required. For yourself and for you, dear reader, I collected information and redid it in a convenient sign.

Who are such search robots

Search botor how else they call them, robot, crawler, spider - nothing like a program that searches and scan the contents of sites by moving through the links on the pages.Search robots are not only in search engines. For example, the Ahrefs service uses spiders to entertain the viewing links, Facebook provides web scraping code code to display reposts of links with headlines, pictures, description. Web scrapping is a collection of information from various resources.

Using the names of spiders in robots.txt

As you can see, any serious project associated with the search for content has its own spiders. And sometimes there is an acute task to limit access to some spiders to the site or its separate sections. This can be done through the robots.txt file in the root directory of the site. I wrote more about the robots configuration earlier, I recommend to get acquainted.

Note - the robots.txt file and its directives can be ignored by search robots. Directives are only recommendations for bots.

You can set the directive for the search robot using the section - appeal to the user-agent of this robot. Sections for different spiders are separated by one empty string.

User-Agent: Googlebot Allow: /

User - Agent: Googlebot

Allow: /

The above is an example of contacting the main search robot Google.

Initially, I planned to add to the record table on how search bots in server logs identify. But since for SEO, this data has little value and for each agent token there may be several types of records, it was decided to do only by the name of bots and their destination.

Search robots G O O G L E

User-Agent	Functions
GoogleBot.	The main crawler indexer of pcs and optimized for smartphones
MediaPartners-Google	Adsense advertising network robot
Apis-google	APIS-Google user agent
ADSBOT-Google	Checks the quality of advertising on web pages intended for PC
ADSBOT-Google-Mobile	Checks the quality of advertising on web pages intended for mobile devices
Googlebot-Image (Googlebot)	Indexes images on site pages
Googlebot-News (Googlebot)	Looking for pages to add news to Google
Googlebot-video (googleBOT)	Indexes video materials
ADSBOT-Google-Mobile-Apps	Checks the quality of advertising in applications for Android devices, works on the same principles as the usual adsbot

Search robots I NDEKS

User-Agent	Functions
Yandex	When specifying this agent token in robots.txt, the appeal goes to all Yandex bots
Yandexbot.	Main indexing robot
Yandexdirect.	Downloads information about the content of partner sites
Yandeximages.	Indexes sites sites
YandexMetrika.	Robot Yandex.Metrics
Yandexmobilebot.	Downloads documents for analyzing the presence of layout for mobile devices
Yandexmedia.	Robot indexing multimedia data
Yandexnews.	Indexer Yandex.Navost
YandexpageChecker.	Mountain Validator
Yandexmarket.	Robot Yandex.Market;
YandexCalenda.	Robot Yandex.Calendar
Yandexdirectdyn.	Generates dynamic banners (Direct)
Yadirectfetcher.	Downloads pages with advertisements to check their availability and clarify the subjects (oh)
YandexaccessibilityBot.	Picks Pages to check their availability to users
Yandexscreenshotbot.	Takes a picture (screenshot) Pages
Yandexvideoparser.	Yandex.Videos service spider
YandexSearchShop.	Downloads yml files of catalogs of goods
Yandexontodbapi.	Robot object response downloading dynamic data

Other popular search bots

User-Agent	Functions
Baiduspider.	Spider Chinese search engine Baidu
Cliqzbot.	Robot Anonymous Search Engine CliQZ
Ahrefsbot.	Search bot service Ahrefs (reference analysis)
Genieo.	Genieo service robot
Bingbot.	King Search Engine
Slurp	Yahoo Search Engine
Duckduckbot.	Web crawler ps duckduckgo
faceBot.	Facebook robot for web crawling
Webalta (Webalta Crawler / 2.0)	Search Krailer PS Webalta
Bomborabot.	Scans Pages involved in the Bombora project
CCBOT.	Crawler based on Nutch, which uses the Apache Hadoop project
Msnbot.	Bot PS MSN.
Mail.ru.	Mail.Ru search engine craller
iA_ARCHIVER	Scratch data for alexa service
Teoma.	ASK service bot

There are a lot of search bots, I selected only the most popular and well-known. If there are bots with whom you have come across due to aggressive and persistent scanning sites, please specify it in the comments, I will also add them to the table.