Search robots - how they work and what they do. Search robots of Google, Yandex, other PS and services Search engine spider

Thematic link collections are lists compiled by a group of professionals or even individual collectors. Very often, a highly specialized topic can be covered better by one specialist than by a group of employees of a large catalog. There are so many thematic collections on the Web that it makes no sense to give specific addresses.

Domain name selection

The catalog is a convenient search system, however, in order to get to the server of Microsoft or IBM, it hardly makes sense to refer to the catalog. It is not difficult to guess the name of the corresponding site: www.microsoft.com , www.ibm.com or www.microsoft.ru , www.ibm.ru - sites of Russian representative offices of these companies.

Similarly, if a user needs a site dedicated to the world's weather, it is logical to look for it on the www.weather.com server. In most cases, searching for a site with a keyword in the title is more efficient than searching for a document in the text of which this word is used. If a Western commercial company (or project) has a monosyllabic name and implements its own server on the Web, then its name is highly likely to fit into the www.name.com format, and for Runet (the Russian part of the Web) - www.name.ru, where name - the name of the company or project. Address guessing can compete successfully with other search methods, because with such a search engine, you can establish a connection with a server that is not registered with any search engine. However, if you cannot find the name you are looking for, you will have to turn to the search engine.

search engines

Tell me what you are looking for on the Internet and I will tell you who you are

If the computer were a highly intelligent system that could easily explain what you are looking for, then it would give out two or three documents - exactly the ones you need. But, unfortunately, this is not the case, and in response to the request, the user usually receives a long list of documents, many of which have nothing to do with what he asked about. Such documents are called irrelevant (from the English relevant - appropriate, relevant). Thus, the relevant document is the document that contains the information you are looking for. Obviously, the percentage of received relevant documents depends on the ability to competently issue a request. The proportion of relevant documents in the list of all documents found by the search engine is called search accuracy. Irrelevant documents are called noise. If all found documents are relevant (no noise), then the search accuracy is 100%. If all relevant documents are found, then the completeness of the search is 100%.

Thus, the quality of the search is determined by two interdependent parameters: the accuracy and completeness of the search. Increasing the completeness of the search reduces the accuracy, and vice versa.

How a search engine works

Search engines can be compared to a help desk whose agents go around businesses collecting information in a database (Figure 4.21). When contacting the service, information is issued from this database. The data in the database becomes outdated, so agents update it periodically. Some enterprises themselves send data about themselves, and agents do not have to come to them. In other words, help desk has two functions: creating and constantly updating data in the database and searching for information in the database at the request of the client.

Rice. 4.21.

Likewise, search engine consists of two parts: the so-called robot (or spider), which bypasses the Web servers and forms a search engine database.

The base of the robot is mainly formed by himself (the robot itself finds links to new resources) and to a much lesser extent - by the owners of the resources who register their sites in the search engine. In addition to the robot (network agent, spider, worm) that forms the database, there is a program that determines the rating of links found.

The principle of operation of a search engine is that it queries its internal catalog (database) for the keywords that the user specifies in the query field and produces a list of links ranked by relevance.

It should be noted that, when processing a specific user request, the search engine operates precisely with internal resources (and does not embark on a journey through the Web, as inexperienced users often believe), and internal resources are naturally limited. Although the search engine database is constantly updated, search engine cannot index all Web documents: their number is too large. Therefore, there is always a possibility that the resource you are looking for is simply unknown to a particular search engine.

This idea is clearly illustrated in Fig. 4.22. Ellipse 1 limits the set of all Web documents that exist at some point in time, ellipse 2 - all documents that are indexed by this search engine, and ellipse 3 - the required documents. Thus, using this search engine, you can find only that part of the required documents that are indexed by it.

Rice. 4.22.

The problem of insufficient search completeness is not only the limited internal resources of the search engine, but also the fact that the speed of the robot is limited, and the number of new Web documents is constantly growing. Increasing the internal resources of the search engine cannot completely solve the problem, since the speed of crawling resources by the robot is finite.

At the same time, assume that search engine contains a copy of the original Internet resources would be wrong. Complete information (source documents) is by no means always stored, more often only a part of it is stored - the so-called indexed list, or index, which is much more compact than the text of documents and allows you to quickly respond to search queries.

To build an index, the source data is transformed so that the volume of the database is minimal, and the search is very fast and gives the maximum useful information. Explaining what an indexed list is, one can draw a parallel with its paper counterpart - the so-called concordance, i.e. a dictionary that lists in alphabetical order the words used by a particular writer, as well as the references to them and the frequency of their use in his works.

Obviously, the concordance (dictionary) is much more compact than the original texts of the works and finding the right word in it is much easier than flipping through the book hoping to stumble upon the right word.

Index building

The index construction scheme is shown in fig. 4.23. Web agents, or spider robots, "crawl" over the Web, analyze the content of Web pages and collect information about what was found on which page.

Rice. 4.23.

When finding the next HTML page, most search engines capture the words, pictures, links and other elements (in different search engines in different ways) contained on it. Moreover, when tracking words on a page, not only their presence is recorded, but also their location, i.e. where these words are located: in the title (title), subtitles ( subtitles ), in meta tags 1 Meta tags are service tags that allow developers to place service information on Web pages, including in order to orient the search engine.( meta tags ) or elsewhere. In this case, it is usually fixed meaningful words, and conjunctions and interjections like "a", "but" and "or" are ignored. Meta tags allow page owners to define keywords and topics on which the page is indexed. This may be relevant when keywords have multiple meanings. Meta tags can guide the search engine when choosing from several meanings of a word to the only correct one. However, meta tags only work reliably when filled in by honest site owners. Unscrupulous Web site owners put in their meta tags the most popular words on the Web that have nothing to do with the topic of the site. As a result, visitors get to unsolicited sites, thereby increasing their ranking. That is why many modern search engines either ignore meta tags or consider them additional to the page text. Each robot maintains its own list of resources punished for unfair advertising.

Obviously, if you search for sites using the keyword "dog", then the search engine should find not just all the pages where the word "dog" is mentioned, but those where this word is related to the topic of the site. In order to determine the extent to which a particular word is relevant to the profile of a certain Web page, it is necessary to assess how often it occurs on the page, whether there are links to other pages for this word or not. In short, it is necessary to rank the words found on the page in order of importance. Words are assigned weights depending on how many times and where they occur (in the page title, at the beginning or end of the page, in a link, in a meta tag, etc.). Each search engine has its own weighting algorithm - this is one of the reasons why search engines give different lists of resources for the same keyword. Because pages are constantly updated, the indexing process must be ongoing. Spiderbots traverse links and build a file containing an index, which can be quite large. To reduce its size, they resort to minimizing the amount of information and compressing the file. With multiple robots, a search engine can process hundreds of pages per second. Today, powerful search engines store hundreds of millions of pages and receive tens of millions of queries daily.

When building an index, the problem of reducing the number of duplicates is also solved - a non-trivial task, given that for a correct comparison, you must first determine the document encoding. An even more difficult task is separating very similar documents (called "near duplicates"), such as those in which only the heading differs and the text is duplicated. There are a lot of similar documents on the Web - for example, someone wrote off an abstract and published it on the site with his signature. Modern search engines allow you to solve such problems.

Looking through the server logs, sometimes you can observe excessive interest in sites from search robots. If the bots are useful (for example, indexing bots of the PS), it remains only to observe, even if the load on the server increases. But there are still a lot of secondary robots, whose access to the site is not required. For myself and for you, dear reader, I have collected information and converted it into a convenient tablet.

Who are search robots

search bot, or as they are also called, robot, crawler, spider - nothing more than a program that searches and scans the content of sites by clicking on the links on the pages.Search robots It's not just search engines. For example, the Ahrefs service uses spiders to improve data on backlinks, Facebook performs web scraping of page code to display link reposts with titles, pictures, and descriptions. Web scraping is the collection of information from various resources.

Using spider names in robots.txt

As you can see, any serious project related to content search has its spiders. And sometimes there is an urgent task to restrict access to some spiders to the site or its individual sections. This can be done through the robots.txt file in the root directory of the site. I wrote more about setting up the robots earlier, I recommend that you read it.

Please note that the robots.txt file and its directives can be ignored by search robots. Directives are just guidelines for bots.

You can set a directive for a search robot using the section - an appeal to the user agent of this robot. Sections for different spiders are separated by one blank line.

User-agent: Googlebot Allow: /

User-agent: Googlebot

allow: /

The above is an example of a call to the main Google crawler.

Initially, I planned to add entries to the table about how search bots identify themselves in the server logs. But since this data is of little importance for SEO and there can be several types of records for each agent token, it was decided to get by with only the name of the bots and their purpose.

Search robots G o o g l e

user-agent	Functions
Googlebot	The main crawler-indexer for PC and smartphone-optimized pages
Mediapartners-Google	AdSense ad network robot
APIs-Google	APIs-Google user agent
AdsBot-Google	Checks the quality of ads on web pages designed for PC
AdsBot-Google-Mobile	Checks the quality of ads on web pages designed for mobile devices
Googlebot Image (Googlebot)	Indexes images on site pages
Googlebot News (Googlebot)	Looking for pages to add to Google News
Googlebot Video (Googlebot)	Indexes video content
AdsBot-Google-Mobile-Apps	Checks the quality of ads in apps for Android devices, works on the same principles as regular AdsBot

Search robots I index

user-agent	Functions
Yandex	When this agent token is specified in robots.txt, the request goes to all Yandex bots
YandexBot	Main indexing robot
YandexDirect	Downloads information about the content of YAN partner sites
YandexImages	Indexes site images
YandexMetrika	Robot Yandex.Metrica
YandexMobileBot	Downloads documents for analysis for the presence of layout for mobile devices
YandexMedia	Robot indexing multimedia data
YandexNews	Yandex.News indexer
YandexPagechecker	Microdata Validator
YandexMarket	Yandex.Market robot;
YandexCalenda	Robot Yandex.Calendar
YandexDirectDyn	Generates dynamic banners (Direct)
YaDirectFetcher	Downloads pages from advertisements to check their availability and clarify the subject (YAN)
YandexAccessibilityBot	Downloads pages to check their availability for users
YandexScreenshotBot	Takes a snapshot (screenshot) of the page
YandexVideoParser	Yandex.Video service spider
YandexSearchShop	Downloads YML files of product catalogs
YandexOntoDBAPI	Object response robot downloading dynamic data

Other popular search bots

user-agent	Functions
Baiduspider	Chinese search engine Baidu spider
cliqzbot	Cliqz anonymous search engine robot
AhrefsBot	Ahrefs search bot (link analysis)
Genieo	Genieo service robot
bingbot	Bing search engine crawler
Slurp	Yahoo search engine crawler
DuckDuckBot	Web crawler PS DuckDuckGo
facebot	Facebook robot for web crawling
WebAlta (WebAlta Crawler/2.0)	Search crawler PS WebAlta
BomboraBot	Scans pages involved in the Bombora project
CCBot	Nutch-based crawler that uses the Apache Hadoop project
MSNBot	Bot PS MSN
Mail.Ru	Mail.Ru search engine crawler
ia_archiver	Scraping data for Alexa service
Teoma	Ask service bot

There are a lot of search bots, I have selected only the most popular and well-known ones. If there are bots that you have encountered due to aggressive and persistent site crawling, please indicate this in the comments, I will also add them to the table.

Usually, search engine is a site that specializes in finding information that matches the user's query criteria. The main task of such sites is to organize and structure information on the network.

Most people, using the services of a search engine, never wonder how exactly the machine works, looking for the necessary information from the depths of the Internet.

For an ordinary user of the network, the very concept of the principles of operation of search engines is not critical, since the algorithms that guide the system are able to satisfy the needs of a person who does not know how to make an optimized query when searching. necessary information. But for a web developer and specialists involved in website optimization, it is simply necessary to have at least initial concepts about the structure and principles of search engines.

Each search engine operates on precise algorithms that are kept in the strictest confidence and are known only to a small circle of employees. But when designing a site or optimizing it, be sure to consider general rules the functioning of search engines, which are discussed in the proposed article.

Despite the fact that each PS has its own structure, after careful study they can be combined into basic, generalizing components:

Indexing module

Indexing Module - This element includes three additional component(robot programs):

1. Spider (spider robot) - downloads pages, filters the text stream, extracting all internal hyperlinks from it. In addition, Spider saves the date of download and the title of the server response, as well as the URL - the page address.

2. Crawler (crawling robot spider) - analyzes all the links on the page, and based on this analysis, determines which page to visit and which one is not worth visiting. In the same way, the crawler finds new resources that should be processed by the PS.

3. indexer (Robot-indexer) - deals with the analysis of Internet pages downloaded by a spider. In this case, the page itself is divided into blocks and analyzed by the indexer using morphological and lexical algorithms. Various parts of a web page fall under the analysis of the indexer: headings, texts and other service information.

All documents processed by this module are stored in the searcher's database, called the system index. In addition to the documents themselves, the database contains the necessary service data - the result of careful processing of these documents, guided by which the search engine fulfills user requests.

search server

The next very important component of the system is the search server, whose task is to process the user's request and generate the search results page.

Processing the user's request, the search server calculates the relevance rating of the selected documents to the user's request. This ranking determines the position that a web page will take in search results. Each document that matches the search criteria is displayed on the results page as a snippet.

The snippet is short description page, including the title, link, keywords and a short text information. Based on the snippet, the user can evaluate the relevance of the pages selected by the search engine to his query.

The most important criterion that the search server is guided by when ranking the results of a query is the TCI indicator () already familiar to us.

All described components of the PS are expensive and very resource-intensive. The performance of a search engine directly depends on the effectiveness of the interaction of these components.

Liked the article? Subscribe to blog news or share on social networks, and I will answer you

6 comments on the post “Search engines are their robots and spiders”

I have been looking for this information for a long time, thanks.

I am glad that your blog is constantly evolving. Posts like this only add to the popularity.

I understood something. The question is, does PR somehow depend on the TIC?

search robot called special program any search engine that is designed to enter into the database (indexing) the sites found on the Internet and their pages. The names are also used: crawler, spider, bot, automaticindexer, ant, webcrawler, bot, webscutter, webrobots, webspider.

Principle of operation

The search robot is a browser type program. He constantly scans the network: he visits indexed (already known to him) sites, follows links from them and finds new resources. When a new resource is found, the procedure robot adds it to the search engine index. The search robot also indexes updates on sites, the frequency of which is fixed. For example, a site that is updated once a week will be visited by a spider with this frequency, and content on news sites can be indexed within minutes of being published. If no link from other resources leads to the site, then in order to attract search robots, the resource must be added through a special form (Google Webmaster Center, Yandex Webmaster Panel, etc.).

Types of search robots

Yandex spiders:

Yandex/1.01.001 I is the main indexing bot,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds site mirrors,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel matches the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from the Yandex advertising network,
Yandex/1.02.000 (F) — indexes site favicons.

Google Spiders:

Googlebot is the main robot,
Googlebot News - crawls and indexes news,
Google Mobile - indexes websites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense - indexes the sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.