What is R package: instruction manual. Statistical Computing Environment R: Teaching Experience r Language Packages

Programming on R. Level 1. Basics

The R language is the world's most popular tool for statistical data analysis. It contains the widest range of possibilities for data analysis, visualization, and creation of documents and web applications. Would you like to master this powerful language under the guidance of an experienced mentor? We invite you to the course "Programming in the R language. Level 1. Basic knowledge".

This course is intended for a wide range of professionals who need to look for patterns in a large amount of data, visualize them and build statistically correct conclusions: sociologists, clinical trial managers / pharmacologists, researchers (astronomy, physics, biology, genetics, medicine, etc.) , IT-analysts, business analysts, financial analysts, marketers. The course will also appeal to specialists who are not suitable for the functionality (or paid) /.

In the classroom, you will gain basic skills in data analysis and visualization in the environment R. Most of the time is devoted to practical tasks and work with real data sets. You will learn all the new tools for working with data and learn how to apply them in your work.

After the course, a certificate of advanced training of the center is issued.

Let's talk a little about a programming language called R. Recently, you could read articles on our blogs about and, those areas where you just need to have a powerful language at hand for working with statistics and graphs. And R is one of those. If you are new to the world of programming, this may be hard to believe, but today R is already more popular than SQL, it is actively used in commercial organizations, research and universities.

Without delving into the rules, syntax, and specific areas of application, let's just look at the main books and resources that will help you learn R from scratch.

What is the R language, why do you need it and how can you use it wisely, you can learn from the wonderful Ruslan Kuptsov, which he spent a little less than a year ago as part of GeekWeek-2015.

Books

Now, when there is a certain order in the head, you can start reading literature, since it is more than enough. Let's start with domestic authors:


Internet resources

Any person who wants to learn any programming language should definitely visit two resources in search of knowledge: the official website of its developers and the largest online community. Well. let's not make an exception for R:

But again imbued with concern for those who have not yet had time to learn English, but really want to learn R, let's mention a few Russian resources:

In the meantime, let's complete the picture with a small list of English-language, but no less informative sites:

CRAN - in fact, a place where you can download the R development environment to your computer. In addition, manuals, examples and other useful reading;

Quick-R - briefly and clearly about statistics, methods of its processing and the R language;

Burns-Stat - about R and about its predecessor S with a huge number of examples;

R for Data Science - another book from Garrett Grolemund, translated into an online textbook format;

Awesome R - a selection of the best code from the official site, hosted on our favorite GitHub;

Mran - R language from Microsoft;

Tutorial R is another organized resource from the official site.

You need to type this in the terminal.

The beauty of R is this:

  1. This program is free (distributed under the GPL license),
  2. Many packages have been written for this program to solve a wide range of tasks. All of them are also free.
  3. The program is very flexible: the sizes of any vectors and matrices can be changed at the request of the user, the data does not have a rigid structure. This property turns out to be extremely useful in the case of forecasting, when the researcher needs to make a forecast for an arbitrary period.

The latter property is especially relevant, since other statistical packages (such as SPSS, Eviews, Stata) suggest that we may only be interested in analyzing data that has a fixed structure (for example, all data in a work file must be of the same frequency with the same start and end).

However, R is not the friendliest program. For the time of working with it, forget about the mouse - almost all the most important actions in it are performed using the command line. However, in order to make life a little easier, and the program itself a little more friendly, there is a frontend program (external interface) called RStudio. You can download it from here. It is installed after R itself has already been installed. RStudio has many convenient tools and a nice interface, however, analysis and forecasting in it is still carried out using the command line.

Let's try to take a look at this wonderful program.

Introduction to RStudio

The RStudio interface looks like this:

In the upper right corner in RStudio, the name of the project is indicated (which so far we have "None" - that is, it is missing). If you click on this inscription and select "New Project" (new project), then we will be prompted to create a project. For basic forecasting purposes, it is enough to select "New Directory" (a new folder for the project), "Empty Project" (an empty project), and then enter the name of the project and select the directory in which to save it. Turn on your imagination and try to come up with a name yourself :).

Working with one project, you can always access the data, commands and scripts stored in it.

The console is located on the left side of the RStudio window. It is in it that we will enter various commands. For example, let's write the following:

x< - rnorm (100 , 0 , 1 )

This command will generate 100 random variables from a normal distribution with zero mean and unit variance, then create a vector called "x" and write the resulting 100 variables into it. Symbol "<-» эквивалентен символу «=» и показывает какое значение присвоить нашей переменной, стоящей слева. Иногда вместо него удобней использовать символ «->”, although our variable in this case should be on the right. For example, the following code will create an object "y" exactly identical to the object "x":

x -> y

These vectors now appear at the top right of the screen, under a tab that I have labeled "Environment":

Changes in the "Environment" tab

This part of the screen will display all the objects that we save during the session. For example, if we create a matrix like this:

\(A = \begin(pmatrix) 1 & 1 \\ 0 & 1 \end(pmatrix) \)

this is the command:

A< - matrix (c (1 , 0 , 1 , 1 ) , 2 , 2 )

then it will appear in the "Environment" tab:

Any function we use requires that we give some values ​​to certain parameters. In function matrix() there are the following options:

  • data is a vector with data to be written to the matrix,
  • nrow is the number of rows in the matrix,
  • ncol is the number of columns in the matrix,
  • byrow is a boolean parameter. If "TRUE" (true), then the filling of the matrix will be carried out row by row (from left to right, line by line). By default, this parameter is set to "FALSE" (false),
  • dimnames - sheet with row and column names.

Some of these options have default values ​​(for example, byrow = FALSE), while others can be omitted (for example, dimnames).

One of the features of "R" is that to any function (for example, to our matrix()) can be addressed by setting the values ​​directly:

Another option is to click on the name of the object in the "Environment" tab.

matrix

where matrix is ​​the name of the function we are interested in. In this case, RStudio will open the Help panel with the description:

You can also find help for a function by typing the name of the function in the "search" window (an icon with a lens) in the "Help" tab.

In case you do not remember exactly how the name of the function is written or what parameters it uses, just start writing its name in the console and press the “Tab” button:

In addition to all this, you can write scripts in RStudio. You may need them if you need to write a program or call a sequence of functions. Scripts are created using the button with a plus sign in the upper left corner (in the drop-down menu, select "R Script"). In the window that opens after that, you can write any functions and comments. For example, if we want to plot a line graph over the x series, this can be done like this:

plot(x)

lines(x)

The first function builds a simple scatter plot, and the second function adds lines on top of the points connecting the points in series. Selecting these two commands and pressing Ctrl+Enter will execute them, causing RStudio to open the Plot tab in the lower right corner and display the plot in it.

If we still need all the commands typed in the future, then this script can be saved (floppy disk in the upper left corner).

In case you need to refer to a command that you have already typed sometime in the past, there is a “History” tab in the upper right part of the screen. In it, you can find and select any command you are interested in and double-click to insert it into the console. In the console itself, you can access previous commands using the "Up" (up) and "Down" (down) buttons on the keyboard. The keyboard shortcut "Ctrl+Up" allows you to display a list of all recent commands in the console.

In general, RStudio has a lot of useful keyboard shortcuts that make it much easier to work with the program. You can read more about them.

As I mentioned earlier, there are many packages for R. All of them are located on the CRAN server and to install any of them you need to know its name. Installation and updating of packages is carried out using the "Packages" tab. By going to it and clicking on the "Install" button, we will see something like the following menu:

Let's type in the window that opens: forecast is a package written by Rob J. Hyndman that contains a bunch of useful features for us. Click the "Install" button, after which the "forecast" package will be installed.

Alternatively, we can install any package, knowing its name, using the command in the console:

install . packages("smooth" )

provided that it is, of course, in the CRAN repository. smooth is a package in which I develop and maintain functions.

Some packages are only available as source on sites like github.com and require you to build them first. To build packages under Windows, you may need the Rtools program.

To use any of the installed packages, you need to enable it. To do this, you need to find it in the list and tick it, or use the command in the console:

library(forecast)

On Windows, one annoying problem can manifest itself: some packages are easy to download and build, but they are not installed into any. R in this case writes something like: "Warning: unable to move temporary installation ...". All you need to do in this case is to add the folder with R to the exceptions in the antivirus (or turn it off while installing the packages).

After downloading the package, all the functions included in it will be available to us. For example, the function tsdisplay(), which can be used like this:

tsdisplay(x)

She will generate three graphs for us, which we will discuss in the Forecaster's Toolkit chapter.

Beyond the package forecast I quite often use the package for various examples Mcomp. It contains series of data from the M-Competition database. Therefore, I recommend that you install it too.

Very often, we will need not just data sets, but data of the “ts” class (time series). In order to make a time series from any variable, you need to run the following command:

x< - ts (x , start = c (1984 , 1 ) , frequency = 12 )

Here the parameter start allows you to specify the date from which our time series starts, and frequency set the data rate. The number 12 in our example indicates that we are dealing with monthly data. As a result of executing this command, we will transform our vector "x" into a time series of monthly data starting from January 1984.

Statistical analysis is an integral part of scientific research. High-quality data processing increases the chances of publishing an article in a reputable journal and taking research to the international level. There are many programs that can provide high-quality analysis, but most of them are paid, and often the license costs several hundred dollars or more. But today we will talk about a statistical environment that does not have to pay for, and its reliability and popularity compete with the best commercial stats. packages: we'll get to know R!

What is R?

Before giving a clear definition, it should be noted that R is more than just a program: it is an environment, a language, and even a movement! We will look at R from different angles.

R is a computing environment, developed by scientists for data processing, mathematical modeling and graphics. R can be used as a simple calculator, you can perform simple statistical analyzes (such as ANOVA or regression analysis) and more complex long-term calculations, test hypotheses, build vector graphs and maps. This is not a complete list of what can be done in this environment. It is worth noting that it is distributed free of charge and can be installed on both Windows and UNIX-class operating systems (Linux and MacOS X). In other words, R is free and cross-platform.

R is a programming language, so you can write your own programs ( scripts) with , and use and create specialized extensions ( packages). A package is a set of files with help information and examples, collected together in one archive. play an important role as they are used as additional extensions based on R. Each package is usually dedicated to a specific topic, for example: the "ggplot2" package is used to generate beautiful vector plots of a certain design, while the "qtl" package is ideal for genetic mapping . There are currently more than 7000 such packages in the R library! All of them are checked for errors and are in the public domain.


R stands for community/movement.
Since R is a free and open source product, its development, testing and debugging is not done by a separate company with hired staff, but by the users themselves. For two decades, a huge community has formed from the core of developers and enthusiasts. According to the latest data, more than 2 million people have volunteered to help develop and promote R in one way or another, ranging from translating documentation, creating training courses, and ending with the development of new applications for science and industry. There are a huge number of forums on the Internet where you can find answers to most questions related to R.

What does the R environment look like?

There are many "shells" for R, which can vary greatly in appearance and functionality. But we'll briefly look at just three of the most popular options: Rgui, Rstudio, and R running in a Linux/UNIX terminal as a command line.


R language in the world of statistical programs

At the moment, there are dozens of high-quality statistical packages, among which SPSS, SAS and MatLab are the clear leaders. However, in 2013, despite high competition, R became the most used software product for statistical analysis in scientific publications (http://r4stats.com/articles/popularity/). In addition, over the past decade, R has become increasingly in demand in the business sector: giant companies such as Google, Facebook, Ford and the New York Times are actively using it to collect, analyze and visualize data (http://www.revolutionanalytics .com/companies-using-r). In order to understand the reasons for the growing popularity of the R language, let's look at its similarities and differences from other statistical products.

In general, most statistical tools can be divided into three types:

  1. GUI programs, based on the principle "click here, here and get the finished result";
  1. statistical programming languages, which require basic programming skills;
  1. "mixed", which also have a graphical interface ( GUI), and the ability to create scripted programs (for example: SAS, STATA, Rcmdr).

Features of programs with GUI

Programs with a graphical interface look familiar to the average user and are easy to learn. But they are not suitable for solving non-trivial tasks, since they have a limited set of stats. methods and it is impossible to write your own algorithms in them. The mixed type combines the convenience of a GUI shell with the power of programming languages. However, in a detailed comparison of statistical capabilities with the programming languages ​​SAS and STATA, both R and MatLab lose (comparison of statistical methods R, MatLab, STATA, SAS, SPSS). In addition, you will have to shell out a decent amount of money for a license for these programs, and the only free alternative is Rcmdr: a wrapper for R with a GUI (Rcommander).

Comparison of R with programming languages ​​MatLab, Python and Julia

Among the programming languages ​​used in statistical calculations, the leading positions are occupied by R and Matlab. They are similar to each other, both in appearance and functionality; but they have different user lobbies, which determines their specificity. Historically, MatLab has been focused on the applied sciences of engineering specialties, so math is its strength. modeling and calculations, plus it's much faster than R! But since R was developed as a narrow language for statistical data processing, many experimental stat. methods appeared and were fixed in it. This fact and zero cost made R an ideal platform for the development and use of new packages used in the basic sciences.

Other "competing" languages ​​are Python and Julia. In my opinion, Python, being a general-purpose programming language, is more suitable for data processing and information gathering using web technologies than for statistical analysis and visualization (the main differences between R and Python are well described). But the statistical language Julia is a rather young and ambitious project. The main feature of this language is the speed of calculations, in some tests exceeding R by 100 times! So far, Julia is in its early stages of development and has few additional packages and followers, but in the long term, Julia is perhaps the only potential competitor to R.

Conclusion

Thus, the R language is currently one of the leading statistical tools in the world. It is actively used in genetics, molecular biology and bioinformatics, environmental sciences (ecology, meteorology) and agricultural disciplines. R is also increasingly used in medical data processing, pushing commercial packages such as SAS and SPSS out of the market.

Advantages of the R environment:

  • free and cross-platform;
  • rich arsenal of stat. methods;
  • high-quality vector graphics;
  • over 7000 tested packages;
  • flexible in use:
    - allows you to create/edit scripts and packages,
    - interacts with other languages ​​such as C, Java and Python,
    - can work with data formats for SAS, SPSS and STATA;
  • active community of users and developers;
  • regular updates, good documentation and tech. support.

Flaws:

  • a small amount of information in Russian (although several training courses and interesting books have appeared over the past five years);
  • relative difficulty in use for a user unfamiliar with programming languages. Partially, this can be smoothed out by working in the Rcmdr GUI shell, which I wrote about above, but for non-standard solutions, you still need to use the command line.

List of useful sources

  1. Official site: http://www.r-project.org/
  2. Site for beginners: http://www.statmethods.net/
  3. One of the best reference books: The R Book, 2nd Edition by Michael J. Crawley, 2012
  4. List of available literature in Russian + good blog

In August 1993, two young New Zealand scientists from the University of Auckland announced their new development, which they called R. As conceived by the creators, Robert Gentleman (Robert Gentleman) and Ross Ihaka (Ross Ihaka), it was supposed to be a new implementation of the S language, different from S-PLUS with some details, for example, handling global and local variables, as well as working with memory. In fact, they did not create a complete analogue of S-PLUS, but a new "branch" on the "S tree". Many of the things that distinguish R from S-PLUS are due to the influence of the Scheme language (a functional programming language, one of the more popular dialects of the Lisp language).

In mid-2016, R overtook SAS and SPSS (which are paid) and entered the top three most common systems for processing statistical information. It should also be noted that R is included in the 10 general-purpose programming languages.

Opportunities

Many statistical methods are implemented in the R environment: linear and non-linear models, statistical hypothesis testing, time series analysis, classification, clustering, graphical visualization. The R language allows you to define your own functions. Many R functions are written in R itself. For computationally complex tasks, it is possible to implement functions in C, C++, and Fortran. Advanced users can directly access R objects from C code. R is a more rigorous object-oriented language than most statistical computing languages. Graphic functions allow you to create graphics of good printing quality, with the ability to include mathematical symbols. It has its own LaTeX-like documentation format.

Although R is most commonly used for statistical computing, it can also be used as a matrix computing tool. Like MATLAB , R treats the result of any number operation as a vector of length one. Generally speaking, there are no scalars in R.

Scripts

Simply opening an R session and typing commands into the program window, one after the other, is just one of the possible ways to work. A much more productive method, which is at the same time the most serious advantage of R, is the creation of scripts (programs), which are then loaded into R and interpreted by it. From the very beginning of work, you should create scripts, even for tasks that seem trifling - this will save a lot of time in the future. Scripting for any reason, and even without a special reason, is one of the foundations of the work culture in R.

Packages

Another important advantage of R is the availability of numerous extensions or packages for it. Several basic packages are present immediately after installing R on a computer, without them the system simply does not work (say, a package that is called base, or a grDevices package that controls the output of graphs), as well as “recommended” packages (a package for specialized cluster analysis cluster, a package for the analysis of nonlinear nlme models and others). In addition, you can install any of the almost eight thousand (as of mid-2016) packages available on CRAN. If you have Internet access, this can be done directly from R with the install.packages() command

Links

  • CRAN (Comprehensive R Archive Network) is a central storage and distribution system for R and its packages.