2.2 Exploring Data
You can view the dataset directly in a number of ways. For small datasets, simply viewing the data by typing pups
or print(pups)
or clicking the small spreadsheet icon to the right of the variable in the Envoronment
pane will show you the entire dataset. For larger datasets, this will be unwieldy, so it is often useful to look at the first few lines of the dataset,
head(pups)
## # A tibble: 6 × 5
## id weight length age clutch
## <int> <dbl> <dbl> <int> <int>
## 1 1 20.3 97 14 1
## 2 2 28.0 104 16 2
## 3 3 31.5 106 16 3
## 4 4 31.5 108 17 1
## 5 5 32.5 109 18 2
## 6 6 33.5 110 18 3
Similarly, the last few lines might be useful,
tail(pups)
## # A tibble: 6 × 5
## id weight length age clutch
## <int> <dbl> <dbl> <int> <int>
## 1 19 39.2 115 26 1
## 2 20 39.5 115 29 2
## 3 21 40.7 116 30 3
## 4 22 41.5 117 31 1
## 5 23 43.0 118 34 2
## 6 24 46.0 123 34 3
You can access a particular column in the dataset by typing dataset_name$column_name
, for instance,
pups$age
## [1] 14 16 16 17 18 18 19 19 20 21 21 21 21 22 22 23 24 25 26 29 30 31 34
## [24] 34
If we do not feel like typing this out every time, we can make a variable containing just the column we want,
pupage <- pups$age
2.2.1 Summary Functions
For the moment, let us focus on this variable. We can start exploring it with a whole bunch of useful R functions.
The length()
function returns how many data points there are.
length(pupage)
## [1] 24
The sort()
function returns the data, sorted from smallest to largest.
sort(pupage)
## [1] 14 16 16 17 18 18 19 19 20 21 21 21 21 22 22 23 24 25 26 29 30 31 34
## [24] 34
The mean()
function returns the arithmetic mean of the data.
mean(pupage)
## [1] 22.54167
The median()
function returns the median of the data.
median(pupage)
## [1] 21
The range()
function returns the smallest and largest values in the data.
range(pupage)
## [1] 14 34
The var()
function returns the sample variance of the data.
var(pupage)
## [1] 31.47645
The sd()
function returns the sample standard deviation of the data.
sd(pupage)
## [1] 5.610388
We note that these data only takes discrete, integer values (1,2,3,4 …). So one way to summarize it conveniently is to make a table. The function for this is table()
.
table(pupage)
## pupage
## 14 16 17 18 19 20 21 22 23 24 25 26 29 30 31 34
## 1 2 1 2 2 1 4 2 1 1 1 1 1 1 1 2