2.2 Exploring Data

You can view the dataset directly in a number of ways. For small datasets, simply viewing the data by typing pups or print(pups) or clicking the small spreadsheet icon to the right of the variable in the Envoronment pane will show you the entire dataset. For larger datasets, this will be unwieldy, so it is often useful to look at the first few lines of the dataset,

head(pups)
## # A tibble: 6 × 5
##      id weight length   age clutch
##   <int>  <dbl>  <dbl> <int>  <int>
## 1     1   20.3     97    14      1
## 2     2   28.0    104    16      2
## 3     3   31.5    106    16      3
## 4     4   31.5    108    17      1
## 5     5   32.5    109    18      2
## 6     6   33.5    110    18      3

Similarly, the last few lines might be useful,

tail(pups)
## # A tibble: 6 × 5
##      id weight length   age clutch
##   <int>  <dbl>  <dbl> <int>  <int>
## 1    19   39.2    115    26      1
## 2    20   39.5    115    29      2
## 3    21   40.7    116    30      3
## 4    22   41.5    117    31      1
## 5    23   43.0    118    34      2
## 6    24   46.0    123    34      3

You can access a particular column in the dataset by typing dataset_name$column_name, for instance,

pups$age
##  [1] 14 16 16 17 18 18 19 19 20 21 21 21 21 22 22 23 24 25 26 29 30 31 34
## [24] 34

If we do not feel like typing this out every time, we can make a variable containing just the column we want,

pupage <- pups$age

2.2.1 Summary Functions

For the moment, let us focus on this variable. We can start exploring it with a whole bunch of useful R functions.

The length() function returns how many data points there are.

length(pupage)
## [1] 24

The sort() function returns the data, sorted from smallest to largest.

sort(pupage)
##  [1] 14 16 16 17 18 18 19 19 20 21 21 21 21 22 22 23 24 25 26 29 30 31 34
## [24] 34

The mean() function returns the arithmetic mean of the data.

mean(pupage)
## [1] 22.54167

The median() function returns the median of the data.

median(pupage)
## [1] 21

The range() function returns the smallest and largest values in the data.

range(pupage)
## [1] 14 34

The var() function returns the sample variance of the data.

var(pupage)
## [1] 31.47645

The sd() function returns the sample standard deviation of the data.

sd(pupage)
## [1] 5.610388

We note that these data only takes discrete, integer values (1,2,3,4 …). So one way to summarize it conveniently is to make a table. The function for this is table().

table(pupage)
## pupage
## 14 16 17 18 19 20 21 22 23 24 25 26 29 30 31 34 
##  1  2  1  2  2  1  4  2  1  1  1  1  1  1  1  2