-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-import.Rmd
242 lines (155 loc) · 14.7 KB
/
05-import.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# Import data
In this chapter, we will download some datasets and import them. You will need the following packages to follow along.
```{r import setup, echo = FALSE}
options(max.print=15)
```
```{r import setup 2, message = FALSE, warning = FALSE}
library(dplyr) # For manipulating data
library(readr) # For reading flat files
library(DBI) # For database connections
# For API connections
library(httr)
library(jsonlite)
```
## Delimited files
The most common method of obtaining data is via flat files, usually in the form of comma separated files (CSV). While delimited data sources are the most convenient for data sources where direct data connections are otherwise unobtainable, they are not set up for long term sustainability and automation.
The base package, installed with all instances of R, and `read.table()` is a convenient built-in standard function for importing CSV files. Another package, [readr](https://readr.tidyverse.org), includes a similar function called `read_delim()`, which is faster and allows for easy altering of column specifications, which directs the data types each column is imported as (for example, overriding an employee's identification number as a character versus a numeric).
Before importing the file, lets download the file from the repository that contains the Vendor Master. This dataset contains the vendor system ID number, the date it was added, and other traits.
```{r import download}
dir.create("data", showWarnings = FALSE) # Creates a directory in your project
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/vendor_master.csv",
destfile = "data/vendor_master.csv", mode = "wb") # Downloads this csv file into the data folder
```
When importing delimited files, there will always be a few aspects to consider. The *delimiter* is the character that separates each field - most common delimiters are commas (,), pipes (|) and tab separators. Using an uncommon character was a typical workaround when exporting data from legacy systems, as commas within text fields were incorrectly parsed as extra columns. If possible, *qualifiers* should be used to enclose text with a field, typically quotes or double quotes. This will indicate to the system that everything within those quotes belongs to a specific field.
In this case, if you view the file (on the Files panel on the right, go to the data folder, click on vendor_master.csv, and select View File), you will see the data is separated by commas.
```{r import raw}
raw_vendors <- read_delim('data/vendor_master.csv', delim = ",")
```
The message indicates that default column types were assigned to each field imported. If all the fields imported as expected, this message can be ignored.
However, ID numbers, while presented as a number, don't have a real value in any calculations. As a result, you can specify a column specification via the `col_types` argument, copy and pasting the framework in the message and changing the fields as need be:
```{r import cols}
cols <- cols(
id = col_character(), # Changed from col_double()
name = col_character(),
date_added = col_date(format = ""),
`2015` = col_double(),
`2016` = col_double(),
`2017` = col_double(),
`2018` = col_double()
)
raw_vendors <- read_delim('data/vendor_master.csv', delim = ",", col_types = cols)
glimpse(raw_vendors)
```
You may have noticed that while our normal columns are specified as-is, we had to wrap our years with backticks (`), because while a column can technically be a number, asking R to reference a number will interpret it as a value, and not a field name.
### Handling problematic delimited files
While well exported delimited files can be useful, they often contain hidden surprises. Consider this rather innocuous CSV file from active directory (the controller for Windows authentication), with a username and manager fields:
```{r import ad}
# Active directory file, with just a username and manager field.
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/active_directory.csv",
destfile = "data/active_directory.csv", mode = "wb") # Downloads this csv file into the data folder
raw_ad <- read_delim('data/active_directory.csv', delim = ";")
```
These warnings indicate that there were columns expected (as dictated by the first line of column headers), but missing in one or more lines. You can inspect the CSV file in the RStudio interface by clicking on the file on the navigation pane to the right, and select 'View File'. You will notice that the location for both accounts is on a new line, but it belongs to the prior record. The raw characters can be confirmed within R, by reading the file directly as is (i.e. raw):
```{r import char}
ad_char <- readChar('data/active_directory.csv', file.info('data/active_directory.csv')$size)
print(ad_char)
```
These special characters are hidden within the seemingly innocuous delimited file, and are typical of systems where information is extracted from, especially Windows, `\r` represents a carriage return, and `\n` represents a line feed. Together, `\r\n` represents a new line and new record, while `\n` can appear in a file when a new line is made by pressing Shift+Enter.
In this common yet inconvenient case, these can be substituted out with regular expressions. Regular expressions are a standard, cryptic yet powerful way to match text in strings. We will cover specific use cases of these in [Searching Text](#test-searchtext).
In this case, the below regular expression only replaces `\n` when there is no `\r` preceding it. The `gsub()` function will try to match the regular expression criteria in the in the first field, with its replacement value in the second field.
```{r import clean ad}
gsub("(?<!\\r)\\n"," ", ad_char, perl = TRUE)
```
The gsub output shows that the manager's name and location no longer has a `\n` in between them. As a result, it can now be imported cleanly.
```{r import clean ad 2}
ad_raw <- read_delim(gsub("(?<!\\r)\\n"," ", ad_char, perl = TRUE), delim = ";")
print(ad_raw)
```
## Databases {#import-db}
It is likely that the company you are auditing will have their data stored in a database. While having skills in SQL is recommended, having R skills means you are able to perform basic queries on databases. There are many different database brands and vendors in the world, and thus there are many different subtleties on how SQL works for each vendor, but they all operate on the same standard.
The Open Databases Connectivity (ODBC) standard allows different vendors to write drivers, or the technical back-end methods, to connect to their database. Generally, you will need a driver that matches the vendor and version of the database you're using.
The Database Interface (DBI) is the interaction between R and the driver. Practically speaking, it enables R to send queries to the database via the driver that is defined in the ODBC.
The most common way to connect to a database on your network is to install the vendor drivers, and then create a Data Source Name (DSN). To properly create this DSN, you'll need the name of your database, as well as read-only credentials. Alternatively, you may specify the server name, database schema and credentials explicitly, which offers some advantages from a portability perspective as your other team mates will not need to create DSNs, and only need to install the drivers themselves.
For this example, we will use an SQLite database. Unlike commercial databases you have to install, configure and hire database administrators to manage, a SQLite database is a small self-sustaining file. SQLite files can be kept in your normal documents folders, and are perfect for lightweight applications (including training!)
Again, lets start by downloading the file, or in this case, a database:
```{r import-db}
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
```
One thing that is different about connecting to databases is that you need to set up a database connection within R. This will usually consist of a driver (in this case, `RSQLite::SQLite()`), a DSN or a file location (`data/rauditanalytics.sqlite`). In the help file, it asks for other needed fields as well; user, password, host etc. At your company, having that information along with ports, schema names, and whether or not its a trusted connection (authenticating automatically with your own active directory credentials). If you haven't yet already at your company, request a read-only account that can get this information.
We're going to establish a connection with the database, creating the `con` connection object:
```{r import-db sqllite}
con <- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")
```
You can confirm the connection works correctly by listing the tables in the database. One important thing to remember is that you'll be passing this connection object each time as you perform your commands. For example, if you want to see what tables exist in the database and schema, you will still need to tell the command which database you want to connect to.
```{r import-db listtables}
dbListTables(con)
```
To get data out of a database, you'll need to communicate to it with SQL (Structured Query Language). Here is an example of how to select all the records in a table:
```{r import-db gl}
dbGetQuery(con, 'select * from gl')
```
While knowing SQL is advantageous (and eventually critical), sometimes switching between languages is a hassle, especially when performing basic tasks within a database. Using the **dplyr** package, you can generate several of the same SQL queries using R syntax. All you need to do is create a reference to the table in the connection object - in this case, the `con` connection object contains the `gl` table:
```{r import-db gl2}
db_gl <- tbl(con, 'gl')
db_gl
```
Not only is it pointing to the same table, but its also performing the same query, even though it was made up using R and dplyr:
```{r import-db 3}
db_gl %>%
show_query()
```
There are a few differences though between using these approaches.
* `dbGetQuery()` will return a data.frame that is immediately usable, while creating the table connection via `tbl()` results in a preview of the query but hasn't been formally downloaded in the database. In order to use the data within R with `tbl()`, a further `collect()` is needed. Instead of performing a data download every time, it is advantageous to preview the results first before collecting it.
* `dbGetQuery` requires the entire SQL query to be pre-written, and can not be further leveraged within the database. However, the `tbl` object can still be further manipulated. This is especially useful when creating new fields or performing joins in same database.
Lets say that you wanted to filter amounts greater than $75,000 in the database. In the SQL method, you would need to type a whole new SQL query:
```{r}
dbGetQuery(con, 'select * from gl where amount > 75000')
```
Where in the dplyr method, you would only need to build on the same connection object already established. Identical queries, and results
```{r}
db_gl_threshold <- db_gl %>%
dplyr::filter(amount > 75000)
db_gl_threshold
show_query(db_gl_threshold)
```
And when you're happy with this, simply `collect()` to save the query results as a data frame. To be nice to your database administrators, you should also disconnect from the database to free up a connection.
```{r}
gl_threshold <- db_gl_threshold %>%
collect()
dbDisconnect(con)
```
We will go through an advanced application of setting up database queries in [Audit Package Creation](#auditpackage), which builds upon leveraging dplyr to create the pieces that enable powerful queries.
## APIs
As more applications are hosted on the cloud, it is an important skill to obtain information from them without resorting to manually triggered reports. Data can be accessed from these systems via an **Application Programming Interface** (API), and typically it is exposed via the method Representational State Transfer (REST). An API will allow one application to talk to another. Some examples of APIs are web sites with search functions, or loading a list of comments other users have published, or determining who is friends with whom. REST advises how the endpoint is structured, and also suggests how the data is returned.
While APIs may appear to be daunting, retrieving data from them is not difficult at all. A user needs to know what they want, then:
- match what is desired to the 'endpoint', a pre-defined url or path where the data resides,
- send a valid request to `GET` the data, and
- receive the response.
Here is an example. Lets say you want to see what repositories I have available to the public on GitHub.
- You look up the GitHub API documentation, and discover you can [list what repositories a user has](https://developer.github.com/v3/repos/#list-repositories-for-a-user).
- You formulate your request, and the request URL becomes `https://api.github.com/users/jonlinca/repos`
- Then you send the above URL using the httr package and the `GET` command.
```{r import api 1}
library(httr)
library(jsonlite)
response <- GET('https://api.github.com/users/jonlinca/repos') # Endpoint
response
```
The structure of the data in an API looks different than a data frame. Typically there is an overall response (indicating success or the nature of the failure), the type of content (in this case, 'application/json'), and the data itself. The entire download is usually stored as a list with multiple references:
```{r import api 2}
names(response)
```
And the data you want will be in the response's `$content`:
```{r import api 3}
head(response$content)
```
This raw data is generally uninterpretable, and it is because the structure of the data is in a different format - in this case, the Content-Type above indicates it is a json file. Its quite easy to convert the data in this format to something more understandable with the httr `content()` and jsonlite `fromJSON()` and `toJSON()` functions:
```{r import api 4}
content <- httr::content(response)
cleancontent <- jsonlite::fromJSON(jsonlite::toJSON(content), flatten = TRUE)
# Print it out
cleancontent[,c('id', 'name')]
```
Some data sources may be difficult to obtain data from, or perhaps you're not quite ready at the technical skill level to develop your own data connectivity for APIs. One alternative for such information is the use of a third party tool - for example, [CData](https://www.cdata.com/kb/tech/office365-jdbc-r.rst) supports an interface that allows you to interact with programs like Office 365 (including emails and file audit logs) directly. Office 365 also offers APIs that expose Sharepoint and Outlook, although they take more time to understand.