-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path01-Step-into-R-program-Analyzing-iris-flower-data-set.Rmd
648 lines (503 loc) · 28.7 KB
/
01-Step-into-R-program-Analyzing-iris-flower-data-set.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
# Step into R programming--the iris flower dataset
Learning objectives:
- Use R from [Rstudio Cloud](https://rstudio.cloud/) or local install.
- RStudio interface: Console, Script, Environment, Plots.
- Data frames and vectors.
- The basics workflow: import data, check data types, define vectors from columns, plot and analyze.
## Getting started
1. You can use R and Rstudio through a web browser via [Rstudio
Cloud](https://rstudio.cloud/). Create a free account, or just log-in with your Google account. Click the **New Project** button to get started.
2. Alternatively, you can install R on your laptop from [www.R-project.org](https://cloud.r-project.org/). Then install RStudio Desktop from
[www.RStudio.com](https://www.rstudio.com/products/rstudio/download/#download). If you have an
older version of R or Rstudio, we recommend uninstall them, delete the folder that holds the R packges, and reinstall the latest versions.
```{r Rinterface, echo=FALSE, out.width='80%', fig.cap='Interface of RStudio.', fig.align='center'}
knitr::include_graphics("images/img0100_Rstudio.png")
```
RStudio uses R
in the background but provides a more user-friendly interface. R
commands can be typed directly into the "Console" window. See Figure \@ref(fig:Rinterface).
As a quick sample session, try all of these commands below in the Console window.
R is case-sensitive; "Species" and "species" are two different variables.
If you understand all of these, you probably do not need this book;
consider more advanced books like [R for Data Science](https://r4ds.had.co.nz/).
If it takes you a few months to type these 248 characters, try [www.RapidTyping.com](www.RapidTyping.com) first.
```{r eval = FALSE}
head(iris)
str(iris)
summary(iris)
df <- iris[, 1:4]
boxplot(df)
pairs(df)
stars(df)
PL <- df$Petal.Length
barplot(PL)
hist(PL)
SP <- iris$Species
pie(table(SP))
boxplot(PL ~ SP)
summary(aov(PL ~ SP))
PW <- df$Petal.Width
plot(PL, PW, col = SP)
abline(lm(PW ~ PL))
```
![Animated GIF for screen shot.](https://github.com/gexijin/learnR/raw/master/images/demo.gif)
## The proper way of using RStudio
1. Create a project. An RStudio project is a nice way to organize and compartmentalize different tasks. The idea is, for a new research project,
we put all related files (source code, data, results) into a designated folder.
On RStudio.cloud, you have to do this to initiate an R session.
Click on the **Untitled Project** on the top of the window to rename it as something meaningful, such as "Workshop". If you are running R locally, click **File** from the main menu, and select **New Project**.
Then follow directions to choose or create a folder.
2. Create a script file by going to **File > New File > R Script**. Or click on the green plus sign.
Instead of typing our commands directly into the console window, we will type in the **R Script** window, which could be saved as a record.
To execute one line of code, click anywhere in the line and click the **Run** button. You can also highlight and run a block of code.
Type these two lines in the Script window, save the script as Workshop_script.R. Then click on each line and click **Run** from the icon.
After that, select both lines and click **Run**.
```{r eval = FALSE}
head(iris)
plot(iris)
```
3. Copy file into project folder. Download this dataset from
[GitHub](https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv) and save it as iris.csv.
## Data frames contain rows and columns: the iris flower dataset
In 1936, Edgar Anderson collected data to quantify the geographic variations of
[*iris* flowers](https://en.wikipedia.org/wiki/iris_flower_data_set).
The data set consists of 50 samples from each of the three
sub-species ( *iris setosa*, *iris virginica,* and *iris versicolor*). Four
features were measured in centimeters (cm): the lengths and the widths of both sepals
and petals. This is one of many built-in datasets in R. Download
this dataset from
[GitHub](https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv),
and open it in Excel. Have a quick look at the data, think about what
distinguishes the three species. If we have a flower with sepals of 6.5cm long
and 3.0cm wide, petals of 6.2cm long, and 2.2cm wide, which species does it most
likely belong to? **Think** (!) for a few minutes while eyeballing the data.
Write down your guessed species.
Getting familiar with this dataset is vital for this and next chapters.
```{r echo=FALSE, out.width='50%', fig.cap='iris flower. Photo from Wikipedia.', fig.align='center' }
knitr::include_graphics("images/img0101_iris.png")
```
If you are using [Rstudio Cloud](https://rstudio.cloud/) through a web browser,
you are using a computer on the cloud, which cannot directly access your local files.
Therefore you need to first upload
the data file to the cloud using the **Upload** button in
the **Files** tab in the lower right panel.
The data file can be imported into R by selecting **Import Dataset**
in the **File** menu, and then choose **"From Text(base)..."**. As the default
settings work well in this case, we just need to click **Import** in the pop-up
window. That's it! There is no need to spend half of a semester to learn how to
import data, like some people used to complain when learning SAS. If you have trouble
importing this data file, you can skip this step as this data set is included in R.
```{r echo=FALSE, out.width='80%', fig.cap='Example of a data frame.', fig.align='center'}
knitr::include_graphics("images/img0102_dataframe.png")
```
To answer these questions, let us visualize and analyze the data with R. **Type these commands without the comments marked by “#”.**
```{r echo=TRUE, eval=FALSE}
View(iris) # show as a spreadsheet
```
```{r echo=TRUE}
head(iris) # first few rows
class(iris) # show the data type
str(iris) # show the structure
```
The imported data is stored in a **data frame** called *iris*, which contains
information of 5 variables for 150 observations (flowers). While the first 4 variables/columns,
*Sepal.Length*, *Sepal.Width*, *Petal.Length*, and *Petal.Width*, contain **numeric** values.
The 5^th^ variable/column, *Species*, contains
**characters** indicating which sub-species a sample belongs to. This is an
important distinction, as they are treated differently in R.
Sometimes we need to overwrite data types guessed by R, which can be done during the data
importing process. If you are using the built-in iris data, the variable/column *Species* is a **factor** with three levels.
```{r echo=TRUE}
summary(iris) # summary statistics
```
The *summary* function provides us with information on the distribution of each variables.
There are 50 observations for each of the three sub-species.
Individual values in a **data frame** can be accessed using row and column indices.
```{r echo=TRUE, results='hide'}
iris[3, 4] # returns the value lies at the intersection of row 3 and column 4
iris[3, ] # returns values of row 3
iris[, 4] # returns values of column 4
iris[3, 1:4] # returns values at the intersections of row 3 and columns from 1 to 4
```
**To do the exercises, you need to create a blank Word document.
Copy-and-paste the R code and corresponding outputs to the Word document,
and then save it as a PDF file. Submit both the Word and PDF files to the Dropbox.**
>
```{exercise }
>
Type **data()** in the R Console window. The pop-up window - R data sets - contain all built-in data sets in package **datasets**. Choose a data set whose structure is data frame, then answer the following questions:
>
a. Display the first few rows of the data set. **NOT** all values in your data set.
b. Show the dimension of the data set.
c. Extract a subset which contains values in the 2nd through the 5th rows and the 1st through the 4th columns. If your data set contains fewer rows or columns, please choose another data set.
```
You can replace the **iris** data set with any of the built-in data sets as long as its data structure is **data frame**.
Make sure you check the type of the data set by using **class()** or **str()** function.
Note: You will need to change the variable names and ranges of rows and columns.
```{r}
colnames(iris) # Column names
```
Remember these column names, as we are going to use them in our analysis. Note that sepal
length information is contained in the column named **Sepal.Length**.
We can retrieve an entire column by using the data frame name followed by the column name.
R is case-sensitive. “iris$petal.length” will not be recognized.
```{r echo=TRUE, results='hide'}
iris$Petal.Length # 150 petal lengths
```
If this is too much typing, we can store the data in a new **vector** called PL.
```{r}
PL <- iris$Petal.Length # define a new variable which contains only the Petal.Length values
```
Some might complain that nothing happens after you type in this command.
Although nothing is printed out, you have created a new **vector** called PL in
the memory. You can find it in the top right panel in the **Environment** tab.
The process of programming is
mostly creating and modifying data objects in the memory.
```{r echo=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("images/environment.png")
```
Like money saved in a bank, these data can be retrieved and used:
```{r}
PL #print out 150 numbers stored in PL
mean(PL) #mean, duh
```
>
```{exercise}
>
a. Compute the mean of the sepal length in the data set **iris**.
b. Compute the mean of speed in the built-in data set **cars**.
```
The best way to find and learn R functions is Google search. The R community is uniquely supportive.
There are many example codes and answers to even the dumbest questions on sites like [StackOverflow.com](https://stackoverflow.com/).
>
```{exercise}
>
a. Google “R square root function” to find the R code, then compute the value of $\sqrt{56.7}$.
b. Use R to find the maximal value of the variable *mpg* in the data set **mtcars**.
```
## Analyzing one set of numbers
A **vector** holds a series of numbers or characters. Using various **functions**, R made it
very easy to do the same calculations (add, subtract, multiply,...) on all of
them at once or collectively.
```{r}
summary(PL)
```
This produces a lot of useful information about the distribution of petal length:
- The minimum is 1.000, and the maximum is 6.900.
- Average petal length is 3.758.
- The mid-point, or median, is 4.350, as about half of the numbers are smaller
than 4.350. Why the median is different from the mean? What happens if there
is a typo and one number is entered 340cm instead of 3.40cm?
- The 3rd quartile, or 75^th^ percentile, is 5.100, as 75% of the flowers have
petals shorter than 5.100. If a student's GPA ranks 5^th^ in a class of 25,
he/she is at 80^th^ percentile. The minimum is the 0^th^ percentile and the
maximum is the 100^th^ percentile.
- The 1st quartile, or 25^th^ percentile, is 1.600. Only 25% of the flowers have
petals shorter than 1.600. These summary statistics are graphically
represented as a boxplot in the Figure \@ref(fig:1-3)A. Boxplots are more
useful when multiple sets of numbers are compared.
(ref:1-3) Boxplot of petal length (A) and of all 4 columns (B).
```{r 1-3, echo=c(1, 5), fig.show='hold', out.width='50%', fig.cap='(ref:1-3)', fig.align='center'}
boxplot(PL) # boxplot of Petal Length.
text(x = 1, y = c(5.2, 4.5, 1.75), labels = c("75th pencentile", "Median", "25th pencentile"))
text(1.45, 6.6, labels = "boxplot(PL)", col = "red")
legend("topleft", inset = .02, legend = "A", box.lty = 0)
boxplot(iris[, 1:4]) # boxplots of four columns of iris.
text(4.1, 7.5, labels = "boxplot(iris[, 1:4])", col = "red")
legend("topleft", inset = .001, legend = "B", box.lty = 0)
```
In RStudio, you can copy a plot to the clipboard using the **Export** button on
top of the plot area. Or you can click **zoom**, right click on the pop-up plot
and select **"Copy Image"**. Then you can paste the plot into Word.
>
```{exercise}
>
a. Check the data structure of the built-in data set **mtcars**.
b. Get the boxplot of Mile Per Gallon (*mpg*) in the data set **mtcars**.
```
To quantify the variance, we can compute the **standard deviation** σ:
\begin{align}
σ=\sqrt{\frac{1}{N-1}[(x_{1}-\bar x)^2+(x_{2}-\bar x)^2+...+(x_{N}-\bar x)^2]}
\end{align}
where N is the sample size and
\begin{align}
\bar x=\frac{1}{N}(x_{1}+x_{2}+\cdots+x_{N})
\end{align}
If all the measurements are close to the mean µ, then standard deviation should be small.
```{r}
sd(PL) # standard deviation of Petal Length
SW <- iris$Sepal.Width
sd(SW) # standard deviation of Sepal Width
```
As we can see, these flowers have similar sepal width, but they differ widely in
petal length. This is consistent with the boxplot above. Perhaps changes in
petal length lead to better survival in different habitats.
With R, it is easy to generate graphs.
```{r out.width='80%', fig.cap='Barplot of petal length.', fig.align='center'}
barplot(PL) # bar plot of Petal Length
```
This figure suggests that the first 50 flowers (iris setosa) have much shorter
petals than the other two species. The last 50 flowers (iris virginica) have
slightly longer petals than those of iris versicolor.
(ref:1-4) Scatter plot, histogram, lag plot and normal Q-Q plot.
```{r 1-4, echo=c(1, 4, 6, 9, 10), fig.show='hold', out.width='50%', fig.cap='(ref:1-4)', fig.align='center'}
plot(PL) # Run scatter plot
legend("topleft", inset = .02, cex = 2, legend = "plot(PL)", box.lty = 0)
title("Scatter plot of Petal Length")
hist(PL) # histogram
legend(1.6, 36, cex = 2, legend = "hist(PL)", box.lty = 0)
lag.plot(PL) # lag plot
legend("topleft", inset = .02, cex = 2, legend = "lag.plot(PL)", box.lty = 0)
title("Lag plot of Pental Length")
qqnorm(PL) # Q-Q plot for normal distribution
qqline(PL) # add the regression line
legend("topleft", inset = .02, cex = 2, legend = "qqnorm(PL)", box.lty = 0)
```
There are three different groups in the scatter plot. The petal length of one group is much smaller than the other two groups.
Histogram shows the distribution of data by plotting the frequency of data in bins.
The histogram top right of Figure \@ref(fig:1-4) shows that there are more flowers
with Petal Length between 1 cm and 1.5 cm. It also shows that the data does not show a bell-curved distribution.
The lag plot is a scatter plot against the same set of number with an offset of
1. Any structure in a lag plot indicates non-randomness in the order in which
the data is presented. We can clearly see three clusters, indicating that
values are centered around three levels sequentially.
Q-Q (quantile-quantile) plots can help check if data follows a Normal distribution, which is widely observed in many situations.
It is the pre-requisite for many statistical methods. See Figure \@ref(fig:1-5) for an example of normal distribution.
Quantiles of the data is compared against those in a normal distribution. If the data points on a Q-Q plot form a straight line,
the data has a normal distribution.
>
```{exercise}
>
a. Run x = rnorm(500) to generate 500 random numbers following the Standard Normal distribution
b. Generate scatter plot, histogram, lag plot, and Q-Q plot. Your plots should like those in Figure \@ref(fig:1-5).
```
(ref:1-5) Plots for randomly generated numbers following a normal distribution.
```{r 1-5, echo=FALSE, fig.show='hold', out.width='50%', out.length='50%', fig.cap='(ref:1-5)', fig.align='center'}
x <- rnorm(500)
plot(x) # Run sequence plot
title("Scatter plot of x")
hist(x) # histogram
lag.plot(x)
title("Lag plot of x")
qqnorm(x) # Q-Q plot for normal distribution
qqline(x)
```
>
```{exercise}
>
Generate scatter plot, histogram, lag plot, and Q-Q plot for the Septal Length in the iris dataset.
```
We can do a one-sample t-test for the mean of a normally distributed data.
Suppose the pental length of the species *setosa*, the first 50 observations of *PL*,
follows a normal distribution. We want to test if its mean is different from 1.5 cm at the significance level $\alpha = 0.05$.
Below is the code for the testing.
```{r}
t.test(PL[1:50], mu = 1.5) # two sided t-test for the mean of petal length of setosa
```
In this case, our null hypothesis is that the mean of petal length of *setosa* is 1.5 cm.
Since p-value 0.1282 is greater than the significance level 0.05,
there is no strong evidence to reject the null hypothesis.
This function also tells us the 95% confidence interval for the mean.
Based on our sample of the 50 observations, we have 95% confidence to conclude
that the mean of petal length of setosa (if we were to measure all setosa in the world)
is between 1.412645 cm and 1.511355 cm.
>
```{exercise}
>
Compute 95% confidence interval of sepal length of setosa.
```
We can perform a hypothesis test on whether a set of numbers are derived from normal distribution.
When interpreting results from hypothesis testing, it is important to state
the null hypothesis is. Here the null hypothesis is that the data is from a normal distribution.
```{r}
# normality test
shapiro.test(PL)
```
If petal length is normally distributed, there is only 7.412×10^-10^ chance of getting the observed test statistic of 0.87627. In other words, it is highly unlikely that petal length follows a normal distribution. We reject the normal distribution hypothesis, which could be corroborated by our plots above.
```{r}
# normality test of the first 50 observations of PL
shapiro.test(PL[1:50])
```
Given the significance level $\alpha=0.05$, the p-value of normality test for the pental length of setosa is 0.05481 which is greater than 0.05.
Then we fail to reject the null hypothesis and conclude that we don't have evidence to reject the hypthoses
that the pental length of setosa follows a normal distribution.
In statistics, we rely on both charts and statistical models to draw a conclusion.
>
```{exercise}
>
a. Run Shapiro’s test on sepal width. Does it follow a normal distribution given the significant level $\alpha = 0.05$?
b. Generate histogram and Q-Q plot for sepal width. Do the plots show a Normal approximation?
```
## Analyzing a categorical variable
In the iris dataset, the last column contains the species information. We call this a **categorical variable**.
Bar and Pie charts are very effective in showing proportions. We can see that the three species are each represented with 50 observations.
```{r echo=c(1, 2, 3, 5), fig.show='hold', out.width='50%', fig.cap='Frequencies of categorical values visualized by Pie chart (A) and bar chart (B).', fig.align='center'}
counts <- table(iris$Species) # tabulate the frequencies
counts
pie(counts) # pie chart
legend("topleft", inset = .02, legend = "A", box.lty = 0)
barplot(counts) # bar plot
text(3.71, 48, labels = "B")
```
## The relationship between two numerical variables
Scatter plot is very effective in visualizing the correlation between two columns of numbers.
(ref:2-1) Scatter plot of petal width and petal length.
```{r 2-1, message=FALSE, out.width='80%', fig.cap='(ref:2-1)', fig.align='center'}
PW <- iris$Petal.Width # just lazy
plot(PW, PL) # scatterplot of pental width vs pental length
```
Figure \@ref(fig:2-1) shows that there is a positive correlation between petal
length and petal width. In other words, flowers with longer petals are often
wider. So the petals are getting bigger substantially when both dimensions
increase.
Another unusual feature is that there seem to be two clusters of points. Do the
points in the small cluster represent one particular species of iris? We need to
further investigate this. The following will produce a plot with the species
information color-coded. The resultant Figure \@ref(fig:2-2) clearly shows that
indeed one particular species, *setosa* constitutes the smaller cluster in
the low left. The other two species also show a difference in this plot, even
though they are not easily separated. This is a very important insight into this
dataset.
(ref:2-2) Scatter plot shows the correlation of petal width and petal length.
```{r 2-2, echo = c(1,2,3),out.width='80%', fig.cap='(ref:2-2)', fig.align='center'}
SP <- as.factor(iris$Species)
plot(PW, PL, col = SP) # change colors based on Species
legend("topleft", levels(SP), fill = 1:3) # add legends on topleft.
abline(lm(PL ~ PW))
```
To change the color for a specific data point, we defined a new variable SP
by converting the last column into a **factor**.
The function **str(SP)** shows that the SP is a
factor with 3 levels which are internally coded as 1, 2, and 3.
This enables us to use these as color codes. This kind of base plotting has been simplified by
modern systems such as ggplot2, which we will discuss later. So you do not have to remember any of this.
```{r}
str(SP) # show the structure of Species.
```
Perhaps due to adaption to the environment, change in petal length leads to
better survival. With the smallest petals, *setosa* is found in Arctic
regions. *versicolor* is often found in the Eastern United States and
Eastern Canada. *virginica* "is common along the coastal plain from Florida
to Georgia in the Southeastern United States--*Wikipedia*." It appears the iris
flowers in warmer places are much larger than those in colder ones. With R, it
is very easy to generate lots of graphics. But we still have to do the thinking.
It requires us to interpret the charts in the context.
>
```{exercise}
>
The **mtcars** data description can be found here: [https://stat.ethz.ch/R-manual/R-patched/RHOME/library/datasets/html/mtcars.html](https://stat.ethz.ch/R-manual/R-patched/RHOME/library/datasets/html/mtcars.html).
>
Based on mtcars data set, plot a scatter plot which shows the correlation between Miles/(US) gallon and Displacement (cu.in.). In this data set, the type of cyl is numeric. You will need to use function newcyl -> as.factor(cyl) to transfer the type to factor. Then replace all mtcars$cyl with newcyl.
>
a. Color the scatter plot by Number of cylinders;
b. Add legend to the top right.
```
We can quantitatively characterize the strength of the correlation using several types of correlation coefficients,
such as Pearson’s correlation coefficient. It ranges from -1 to 1. Zero means no linear correlation.
```{r}
cor(PW, PL)
```
This means the petal width and petal length are strongly and positively correlated. we can add this information to the plot as text:
```{r eval=FALSE}
text(1.5, 1.5, paste("R=0.96"))
```
```{r}
cor.test(PW, PL)
```
Through hypothesis testing, we reject the null hypothesis that the true correlation is zero (no correlation).
That means the correlation is statistically significant.
Note that Pearson’s correlation coefficient is not robust against outliers
and other methods such as Spearman’s exists. See help info:
```{r eval=FALSE, echo=TRUE, message=FALSE}
?cor # show help info on the function of cor()
```
We can also determine the equation that links petal length and petal width. This is so called regression analysis. We assume
$Petal.Length = \alpha × Petal.Width + c + \epsilon$,
where $\alpha$ is the slope parameter, $c$ is a constant, and $\epsilon$ is some random error. This linear model can be determined by a method that minimizes the least squared-error:
```{r}
model <- lm(PL ~ PW) # Linear Model (lm)
summary(model) # shows the details
```
Note that in the regression model, the tilde(~) links a response variable
on the left with one or more independent variables on the right side.
It is a shortcut for an operator. We are defining a statistical model where
the PL is modeled as a function of PW. In addition to regression models, we also use formula in plots.
As we can see, we estimated that $\alpha=2.22944$ and $c=1.08356$. Both parameters are significantly
different from zero as the p-values are <2×10^-16^ in both cases. In other words, we can reliably predict
$Petal.Length = 2.22944 × Petal.Width + 1.08356$.
This model can be put on the scatter plot as a line.
```{r fig.keep='none', eval = FALSE}
abline(model) # add regression line to plot.
```
And we can get the diagnostic plots for regression analysis:
```{r fig.keep='none', eval = FALSE}
plot(model) # diagnostic plots for regression
```
Sometimes, we use this type of regression analysis to investigate whether variables are significantly associated.
>
```{exercise}
>
Investigate the relationship between sepal length and sepal width using scatter plots,
correlation coefficients, test of correlation, and linear regression.
Again interpret all your results in PLAIN and proper English.
```
## Testing the differences between two groups
Are boys taller than girls of the same age? Such situations are common.
We have measurements of a random **sample** from a **population**. From the sample we want
to know if the observed differences among the two groups
reflect real differences in the population or due to random sampling error.
(ref:2-4) Boxplot of petal length, grouped by species.
```{r message=FALSE, out.width='80%', fig.cap='(ref:2-4)', fig.align='center'}
Species <- iris$Species
# Comparing distributions of petal length of the three species by boxplot
boxplot(PL ~ Species)
```
Here, R automatically split the dataset into three groups according to Species and created a boxplot for each.
From the boxplot, it is obvious that *Setosa* has much shorter petals. Are there differences between *versicolor* and *virginica* significant?
We only had a small sample of 50 flowers for each species. But we want to draw some conclusion about the two species in general. We could measure all the iris flowers across the world; Or we could use statistics to make inference. Let's extract corresponding data.
```{r results='hide'}
PL1 <- PL[51:100] # extract Petal Length of versicolor
PL1 # a vector with 50 numbers
PL2 <- PL[101:150] # extract Petal length of virginica
PL2 # y contain 50 measurements
```
```{r fig.keep='none'}
boxplot(PL1, PL2) # a boxplot of the two groups of values
t.test(PL1, PL2) # two sample t-test
```
In this two sample t-test, our null hypothesis is that the mean of petal length for *versicolor* is the same as *virginica*. A small p-value of 2.2x10^-16^ indicates under this hypothesis, it is extremely unlikely to observe the difference of 1.292cm through random sampling. Hence we reject that hypothesis and conclude that the means for the two species are significantly different. If we measure all *versicolor* and *virginica* flowers in the world and compute their average petal lengths, it is very likely that the two averages are different. On the other hand, if the p-value is larger than a given significance level, let's say $\alpha = 0.05$, we fail to reject the null hypothesis and conclude that we don't have strong evidence to reject the means are same.
We actually do not need to separate two set of numbers into two data objects in order to do a t-test. We can do it right within the data frame. R can separate data points by columns.
```{r results='hide', fig.keep='none'}
# Extract observations of versicolor which are lined at rows from 51 to 150
df <- iris[51:150, ]
t.test(Petal.Length ~ Species, data = df) # t-test
boxplot(Petal.Length ~ Species, data = droplevels(df)) # removes empty levels in Species
```
>
```{exercise}
>
Use boxplot and t-test to investigate whether the means of
sepal width between versicolor and virginica are different. Interpret your results.
```
## Testing the difference among multiple groups (ANOVA)
As indicated by Figure \@ref(fig:2-5), sepal width has small variation across species. We want to know if the means of sepal width for the 3 species are same or not. This can be done through Analysis of Variance (ANOVA).
(ref:2-5) Boxplot of sepal width across 3 species.
```{r 2-5, echo=c(1:3), fig.cap='(ref:2-5)', fig.align='center'}
SW <- iris$Sepal.Width
boxplot(SW ~ SP)
summary(aov(SW ~ SP))
```
Since p-value is much smaller than 0.05, we reject the null hypothesis. We can conclude that not all the 3 species have the same mean of sepal width. In other words, there are at least two of the species have different means of sepal width. This is the only thing we can conclude from this, since this is how the hypothesis was set up. The boxplot in Figure \@ref(fig:2-5) seems to indicate that *setosa* has wider sepals.
>
```{exercise}
>
Use boxplot and ANOVA to investigate whether the means of spetal width are the same for the 3 species.
```
I hope this give you some idea about what it is like to use R to visualize and analyze data. It is interactive and graphical.
If I made you feel it is complicated, I failed as a instructor.
Most people can learn R themselves by working on a small dataseet of their interest.
Just Google everything and keep trying.
![From Tenor.com](https://github.com/gexijin/learnR/raw/master/images/tenor_typing_cat.gif)
```{r echo=FALSE,eval=FALSE}
# https://www.google.com/url?sa=i&url=https%3A%2F%2Ftenor.com%2Fsearch%2Ftyping-cat-gifs&psig=AOvVaw1CPM5qGGZWAikE70wzGepI&ust=1623775293141000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCNi1wMHIl_ECFQAAAAAdAAAAABAJ
```