-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLandscape Phase 1 Data Collection.Rmd
124 lines (91 loc) · 3.2 KB
/
Landscape Phase 1 Data Collection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "Landscape Phase 1 Data Collection"
author: "Meng Liu"
date: "`r Sys.Date()`"
output: html_document
editor_options:
chunk_output_type: console
---
# Load libraries
```{r}
#install.packages("pacman")
pacman::p_load(bibliometrix,tidyverse,rio)
```
# Note on the search
Search string:
TI = (“open science” OR “open research” OR “open scholarship”) OR AK = (“open science” OR “open research” OR “open scholarship”) OR KP = (“open science” OR “open research” OR “open scholarship”)
Editions = A&HCI , BKCI-SSH , BKCI-S , CCR-EXPANDED , ESCI , IC , CPCI-SSH , CPCI-S , SCI-EXPANDED , SSCI
Screenshot of search was saved as "Search screenshot 20220626.png".
Link to query:
https://www.webofscience.com/wos/woscc/summary/02f9842c-7756-41a7-9229-0798510200fb-401bcfdc/relevance/1
The search returned 2,355 hits.
All fields were then exported as BibTex (in five batches as the maximum export for each batch is 500).
# Load raw data
```{r}
files <- c("raw data/savedrecs(1).bib",
"raw data/savedrecs(2).bib",
"raw data/savedrecs(3).bib",
"raw data/savedrecs(4).bib",
"raw data/savedrecs(5).bib")
raw_data <- convert2df(files, dbsource = "wos", format = "bibtex")
```
AU Authors
TI Document Title
SO Publication Name (or Source)
JI ISO Source Abbreviation
DT Document Type
DE Authors’ Keywords
ID Keywords associated by SCOPUS or WoS database
AB Abstract
C1 Author Address
RP Reprint Address
CR Cited References
TC Times Cited
PY Year
SC Subject Category
UT Unique Article Identifier
DB Database
# Preprocessing
```{r}
# preprocess data
data <- raw_data %>%
as_tibble() %>%
# assign unique ID
mutate(unique_id = row_number()) %>%
# rename col with descriptive labels
rename(title = TI,
author_keywords = DE,
abstract = AB,
keywords_plus = ID,
doi = DI)
export(data, "data.csv")
```
# Subset pilot sample for screening
```{r}
set.seed(123)
pilot_screening <- data %>%
select(unique_id, title, author_keywords, abstract, keywords_plus, doi) %>%
mutate_all(.,str_to_lower) %>%
sample_n(size = 100)
rio::export(pilot_screening,"pilot_screening.csv")
```
# Subset pilot sample for tagging
## Check total n for tagging
```{r}
set.seed(123)
total_n <- data %>%
summarise(sum(is.na(author_keywords))) %>%
pull()
sample_n = round(total_n*0.1,0)
```
At the moment there are 594 without author keywords which is on the large side (especially for double coding). We'll pilot a random sample (sample_n = 10% * total_n) to estimate the time/manpower required for tagging.
Below is the code for generating the tagging sample. Note that this sample MAY contain irrelevant papers (i.e., not a dependent sample from the screened dataset). This is to allow us to pilot the tagging process during the hackathon without having to wait for the screening dataset to be processed and generated.
```{r}
set.seed(123)
pilot_tagging <- data %>%
filter(is.na(author_keywords)) %>%
select(unique_id, title, author_keywords, abstract, keywords_plus, doi) %>%
mutate_all(.,str_to_lower) %>%
sample_n(size = sample_n)
rio::export(pilot_tagging,"pilot_tagging.csv")
```