-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathr2-trivial.txt
790 lines (731 loc) · 41.4 KB
/
r2-trivial.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
---
some calculations
calculations
do intersection for two postings & -
pseudocode for skiping list &
Positional index usage & example p38
“Proximity” intersection algo &
calcu Soundex & - Example::Soundex of HERMAN H650 p41
Successor Variety &calcu - -e READABLE p28
简单计算 ubung &
编辑距离calcu &
Jaccard coefficient to Scoring,how? & Query q [ides of March] doc1:"Caesar died in March”: JACCARD(q, d1) = 1/6
tf idf calcu & - P28
tf idf calcu & - 从p41开始计算
tf idf algo p46 & - NEXT
Heaps’ law mean? formular &
Zipf’s law mean? formular &
Bytewise compression & calcu - p34
基于块的排序索引方法Blocked Sort-Based Indexing (BSBI) idea & - algo & -
内存式单遍扫描索引构建方法Single-Pass In-Memory Indexing (SPIMI) algo & -
ch08 p27计算 & -
ch08 我的网上照片例子 &计算 - NEXT
Precision and Recall
计算p10 & -
Accuracy ::TP TN 占用的比例
公式 & -
calcu &
p28计算 & - ubung中有对应的题目 &
ppt上计算 & p37 - 给定两个判断,求Kappa统计
网页中那个计算 & - NEXT
计算 状态检索值 Retrievalstatuswert & calcu
计算例子 & - p165 例12-2
输入为Tasse Kanne的那个计算 & - NEXT
ubung中 &
ch13 次高价拍卖 & and calcu。 p47
计算PageRank权重 & - p322 p23公式 & -
Multinomiales Naïve Bayes :idea & formuar & calcu &
ch19 calcu. p16 &
---
Ubungen (will be merged
improtant concepts from Ubung
u1
intersection algorithm x AND y and x OR y algo &
How should the Boolean query x AND NOT y be handled &
Why is the naive evaluation of this query normally very expensive
what can we achieve p6 &
x AND NOT y , x OR NOT y x+y , N
For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? &
u2
The tokenization of a document is a trivial task
stemming can lowers precision
stemming increase or changed recall
// how affects the stemmer precision and recall and vocabulary size &
Stemming decreases the size of the vocabulary.
Stemming should be applied to the documents, but not to the queries
The postings list of a stop word is usually longer than the postings list of a non-stop word
// what other approaches to stemming x3 &
What are the respective Levenshtein editing operations x3 types of operations
u3
The Jaccard coefficient works well for ranking documents
Rare terms are less informative than frequent terms.
// which words are more informasive ? &
The inverted document frequency (idf) has no effect on the ranking for one-term queries ! &
There is exactly one way to calculate the tf-idf weights
What minimal and maximal values can the following variables have ? -
Consider the case of a query term that is not in the set of indexed terms,what do we do ?
What is the 𝑖𝑑𝑓 of a term that occurs in every document?
u4
Zipf's Law states that the ith most frequent term has a collection frequency proportional to 1Τ𝑖
Heaps' Law assumes that the vocabulary can grow infinitely !!!
Elias Gamma Coding is a technique for dictionary compression
Front Coding is a technique for dictionary compression
Large document collections contain many frequent and few rare terms !!!
// contains few frequent items and many rare items &
Gamma coding cannot encode the number zero. Is this a problem for compressing postings lists
u5
External sorting algorithms are used to sort lists which do not fit in main memory
It is impossible to calculate the top-k documents for a query without completely reading
the postings list of each query term
// the idea of NRA ? + calcu. the topk docs. without completely reading the postings list
Why can the MapReduce algorithm as described in the lecture not be run in this case? &?? next
For optimal load balancing, the inverters in MapReduce must get segmented postings
files of similar sizes. For a new collection, the distribution is unknown , what should we do ? &
---
u6 &
a) The key measure for a search engine is user happiness.
b) The 𝐹β measure combines both, precision and recall, into one number.
c) The 11-point interpolated average precision projects the precision-recall curve to a single number.
d) The Mean Average Precision (MAP) is yet another measure for evaluating the result of one query.
e) Pooling means experts manually judge the relevance of each document in the collection.
f) The kappa value is 1 if two judges always agree and 0 if they never agree.
g) Dynamic result summaries can be constructed efficiently from the positional inverted index
How many relevant documents does the
result contain and at which ranks are they?
kappa measure formua -
u7 &
a) The main motivation for relevance feedback and query expansion is to decrease recall.
b) The basic idea of the Rocchio algorithm is to move the query vector towards the vectors of
relevant documents and away from the vectors of irrelevant documents.
c) Relevance feedback is widely available in web search engines, because users can easily
understand it and are willing to “tune” their query.
d) Pseudo-relevance feedback can lead to query drift. !!!
e) In thesaurus-based query expansion, we add terms to a query which are semantically related to
the query terms specified by the user.
f) Thesauri can only be built manually, there is no way to build them automatically.
In Rocchio‘s algorithm, what weight setting for 𝛼, 𝛽, 𝛾 does a “Find pages like this one” search
correspond to? &
Why is positive feedback likely to be more useful than negative feedback to an IR system? &
Why might only using one non-relevant document be more effective than using several? !!! &
u8
a) Probabilistic IR is about estimating the probability that a document is relevant to a query.
b) The odds of an event 𝐴 is defined as 𝑂 𝐴 = 𝑃(𝐴)Τ𝑃(𝐴ҧ) .
c) The classical Binary Independence Retrieval (BIR) model takes term frequencies into account.
d) The BIR model assumes that terms appear independently from each other in the documents.
e) The cluster-hypothesis states, term distributions differ between relevant and irrelevant documents.
f) The system Okapi was a much simpler predecessor of the BIR Model.
in RSV calcu. :
what can be omitted in order to save time? &
Which of the calculations done so far have to be done anew for the new query? &
The classical BIR model does not consider term frequencies.
Is there a way to take these into account
as well?
derived the formulas for the Retrieval Status Value:
We made the assumption that terms appear independent from each other in a document
Provide an example of two terms for which this assumption is likely to hold and not .
u9 &
a) A statistical language model (LM) assigns probabilities to strings of symbols from some alphabet.
b) For every unigram LM, cats hunt mice and mice hunt cats have the same probability.
c) Language models are completely unrelated to Markov chains.
d) We use LMs in IR like this: (step 1) we derive a LM from each document, (step 2) we rank all
documents by the probability that their LM generates the query.
e) From a vocabulary containing 𝑉 terms we can construct approximately 𝑉 𝑛 𝑛-grams.
f) A document containing 𝑑 tokens contains approximately 𝑑 𝑛 𝑛-grams.
// contains d vocabularies not tokens
g) In practice, language models in IR consider 𝑛-grams with 𝑛 ≥ 3, i.e., at least tri-grams.
h) The shorter the query, the more important is smoothing
// longer ? or no relationship ??
u10
a) The size of the web can easily be determined by crawling the web and counting the pages.
b) In the context of web search, information needs are the only subclass of user needs.
c) Shingling is a technique for detecting near duplicates of web pages.
d) The average out-degree of all web pages is higher than their average in-degree.
why -- may be equal
rem !!! &
e) The PageRank algorithm ranks web pages by the number of occurrences of the query terms.
f) The PageRank of a web page is query-dependent.
g) The HITS-algorithm provides two scores per web page: an authority score and a hub score.
h) The HITS-scores of a web page are query-dependen
what happends as α becomes close to 1? &
u11
a) Classification is about assigning a document to one (or more) out of several predefined categories.
b) The accuracy of a classifier is always one minus its error rate.
c) When using supervised learning for classification, we first train a classifier on unlabeled
documents.
d) Overfitting means that a classifier is too general.
e) With 𝑛-times 𝑘-fold cross-validation, we partition a set of 𝑛 labeled documents into 𝑘 (nearly)
equally-sized subsets and for each subset 𝑆𝑖 we train a classifier on the documents in 𝑆𝑖 and
evaluate its performance by applying it to the documents in 𝑆𝑖.
f) Given a set of labeled documents, the sequential covering algorithm determines a set of rules for
rule-based text classification
g) With probabilistic classification, a document 𝑑 is assigned to category 𝑐𝑖 if the probability 𝑃 𝑐𝑖 𝑑)
is the maximum of the probabilities 𝑃 𝑐𝑘 𝑑) for all categories 𝑐𝑘.
h) Feature selection can reduce the training time of a classifier, but it cannot improve its quality.
i) The Rocchio approach for vector-based classification assumes that the document vectors in each
category are close to each other, but distant from the document vectors in the other categories.
j) The 𝑘NN classifier assigns a doc. 𝑑 to the 𝑘 categories whose centroid vectors are closest to 𝑑Ԧ.
k) The idea of support vector machines (SVMs) is to separate the vector space using an optimal
hyperplane and to assign a document 𝑑 to one of two classes depending on whether 𝑑Ԧ lies on the
one or on the other side of the hyperplane.
l) The SVM approach can only be applied to linearly separable datasets
difference between Bernoulli and multinomial ? &
answer to ubung // or can ref. to the answer docs
u1
calculate (NOT y) first as a new postings list, which takes O(N) time.
where N is the total number of documents in the collection
bounded by the number of documents
not optimal , just a gready method
u2
t t t t f t
u3
f f t f
u4
see ubung pdf
u5
partition by docid as well as term for very frequent terms
using a machine, to find the distribution of terms starting from various alphabets
u6
u7
But, in practice, it has been shown to be most useful for increasing recall
might not be conveyed properly to the IR system which results in low precision output
u8
u9
u10
u11
multinomial model uses term freq. where Bernoulli uses document freq. to calcu. the probability.
multinomial calcu. the probability by the number of occurrences of the in the collection
The Bernoulli model , on the other hand , just calculates the probability by the number of
documents which contain the word
---
key concepts
----lev 1 Basic Part + Boolean
freq. error
Stemming,wildcard queries,phrase quries,Spelling Correction x3
ch1 -
was ist IR? &
what are unstructured text (usually text)? & -
what is data, knowledge and Information &
IR system architecture::structure of IR system 示意图 p15 & -
usage of IR Systems &
compare IR and databank & - x6
ch2 -
Why is grep not the solution & p6 -- x3
To answer the query with help of Term-document incident matrix ? example & - p8
why not used? & -
why inverted ? Inverted Index &
what is posting list? & - p11
posting list 建立过程 3个步骤 &
dic和posting如何存储 &
What is the best order for processing this query? &
how to skip ? &
why skip lists are not used now ? & x2
Westlaw特性 & x2
how to use Bi-word index to do Phrase Queries &
Phrase Queries methods x3
key idea of Extended Bi-words &
why Bi words not used ? &
what is Positional index &
how can position index be used to Proximity search &
advantages of positional index ::
1. doesnt require much space
2. can do proximity search
what is the Combination scheme of those two & -
What are “good” bi-words & -
ch3
how to determine lan in one text? & -
what is &
Term,Morphem,Inflection ,
Derivation,Kompositum,Noun Phrase (NP)
promblems for reading words in IR? & *5
what is &
Grammatical markings(POS Tags),Stemming,
Lemmatization,Case Folding,Stop words
key idea of Sondex &
ways of Stemming & x3
Dice’s coefficient and Jaccard diff. &
Successor Variety &
advantages of it & x2
n-gram stemmers: & ::statistics statistical 0.8
Porter's Stemmer explain & -
Stemming?or not &
Danger’s of stemming &
Part-of-Speech Tagging what is &
how can we do Part-of-Speech Tagging & x2
edit distance &
ch4
词典的数据结构what datastructures can be used for dictionaries &
通配符查询::Wildcard queries x3 & - 例子::[hel*o] look up
Wildcard queries vs Phrase Queries &
why not used
permuted index and kgram 比较 & -
why not used ?
Spelling Correction 三种 &
it was expensive , so how to solve this ?
Issues when using Spelling correction & Problems
----lev 2 Basic Rank
ch5
compare: rank or not & -
Why rank & x1
Jaccard coefficient to Scoring,how? &
problems: & -
https://datascience.stackexchange.com/questions/15862/how-to-compute-the-jaccard-similarity-in-this-example-jaccard-vs-cosine
Cosine similarity and Jaccard similarity differ.
what is tf idf &
tf-idf weighting , how &
tf-idf formular & -
tf-idf compare & - those two
under what condition weight is 0? & x2
The vector space models before and now & x3
why Euclidean distance is a bad idea & -
differ. weghts &
Inc.Itn
Itn.bnn
分别代表啥意思,公式列出来
ch6
why compression & x2 -
what is Lossy vs. lossless compression & - -
Heaps’ law mean?
the meaning for IR systems &
zipf Law mean ?
Dictionary as a string & 主要save的是哪里的空间
why 3B here needed &
Dictionary as a string with blocking & 主要save的是哪里的空间
为啥block不能太大 &
how to Tradeoff &
Front coding &
how to store gaps & x2
Variable length code &
Variable byte code &
Gamma Codes for gap encoding & calcu -p39 13 四个步骤 &
Gamma:some advantages & - explain - x2
Gamma seldom used in practice & - explain
ch7
External Memory Sort &
缩写的含义 &
基于块的排序索引方法
Blocked Sort-Based Indexing (BSBI) idea & - algo & -
内存式单遍扫描索引构建方法
Single-Pass In-Memory Indexing (SPIMI) algo & -
Caching策略 x2 &
zwei Arten von Rechner‐Knoten : & - two types of nodes
advantages or dis- of distributed index construction & - NEXT
Mathods of Distributing &
why can parallel? & -
MapReduce Architecture & - NEXT
Delta-Index: how it works ? & when put the delta index to Disk: &
ch8
Term-at-a-Time (TAAT): and DAAT &
NRA : No Random Accesses function explain & -
结束条件 &
p29公式 & - x5 mink,unseen,worst,best,high(t)
ch9
how to Measure the happiness ? & - p5 explain x4 2+2
Precision and Recall
公式 p9 & -
F-Measure Precision/recall tradeoff
formula & - p13
Cost-based Measure & calcu
Accuracy ::TP TN 占用的比例
公式 & -
calcu &
Macro average (precision) and Micro average (precision)
p19 calcu & 公式有问题? 貌似没有
PR curve 呈锯齿形的原因? & -
what are included in benchmarks & x3
Kappa统计 why ? &
Pooling策略--评价所有文档工作量是很大的,可以只评价检索的 &
A/B Testing &
what to descreption in summaries &
生成动态摘要的目标是选出满足如下条件的片段 & p49
----lev 3 Ranking system
ch10
four different examples with picture & - for user feedback -next
idea of Rocchio Algorithm to do feedback &
centroid , opt. query formular &
有图来解释这个公式 & -:
Used in practice &
what effect parameter has & -
Positive versus negative feedback & -
When Does Relevance Feedback Work? & ??
Evaluation of feedback:how many times is enough to do that &
problems,不起作用的原因 & - x3
为相关反馈问题 &
but it workd well? &
what can we do to 间接相关反馈 &
查询扩展(Query expansion) methods & x3
基于统计学的结果,相关性矩阵
calcu & - NEXT p48
ch11
事件的优势率( odds) & f
状态检索值 Retrievalstatuswert ,Retrieval status value & NEXT TUI
公式
意义 &
Test auf Unabhängigkeit formula &
概率检索模型 idea &
二值独立模型BIR & 全称
Okapi formular & calcu &
文档评分纵向比较: & -
概率IR优缺点 & - NEXT explain x2 x2
ch12
语言模型 idea & -
what is language of LM &
双圈节点对应的是 &
描述有限自动机 EA & - explain
计算 P(Katzen fangen Mäuse) & -
二元语言模型为啥不用 & - exp
kann man LM durch Markow-Ketten darstellen & - Horizont(向前看的视野)?
retrival function p20公式 & -
Jelinek-Mercer Smoothing & -
Multiplikation kleiner Werte Führt zu Rechenungenauigkeiten & - how to handle &
LMs vs. BIR x1
联系 & - NEXT
区别 & - NEXT
LMs vs. Vector-Space-Modell
联系 & - NEXT
区别 & - NEXT
ch13 web basic , +++++++++
为什么 Web纷繁杂乱、变化迅速 & x1
web IR特点 & x3
静态页面与动态 & - exp
magure the size of the web & Web规模估计 &
整个Web有向图结构? & - x6
作弊问题:几种方法排名作弊? & - x2
google的用户体验特点: & - x3
普通的 Web 搜索查询似乎可以分成哪三大类? & -
Mercator 采集器 五个模块 &
ppt中例子 & - 理解,给图画出来路径
web中的重复问题 & x2
网络搜索的具体形式特点 &
网络文档特点,主要的四个特性 & -
Crawling,两个特点? &
How are ads ranked & - exp x2+n
次高价拍卖 & and calcu。 p47 --second price auction
Anybody can participate and bid on keywords
主要的两个问题
Keyword arbitrage
Violation of trademarks
常见的Spam技术 & x2
ch14
链接分析的意义 & -
PageRank过程假设: & - x2
面向主题的 PageRank:简单做法 &
处理混合的情况? &
pagerank 特点 & - x2
HITS循环迭代公式 & -
计算优化? & - 超链导向的主题搜索
idea & -
问题: & - x1
hits 特点 & - x2
----lev 4 ML classification
ch15
在垃圾邮件处理中,有哪些基本的处理方法 & p7 x2
Maschinelles Lernen,how it works &画图
*总体:分类的评价 & x3 + 2
basic
Accuracy and Error Rate
Recall, Precision and F1
formula & -
erweitern
Confusion Matrix p27 pic 画图
cross validation x3
*Undersampling , oversampling,Hybrid-Verfahren & -- deal with data skew
下采样方法说出名字就可以 & x4
Oversampling & x2 说出名字,简称,全称
Feature-Scaling x3
*Feature-Auswahl x3 &
ch16 -算法细节,如何继续优化
Form der Regeln &
*Evaluation von Regeln
Grundlegende Maße,两个指标 公式 &,计算 & p15
太大太小?
Drei Qualitätsmaße ,formular &,p21 ?? NEXT
combine the two values
eg. & p26
*Sequential Covering,
idea &,
formular &,eg p33 &&&,
*算法 p30
Nachbearbeitung x2+ x2
ch17 -公式有点费劲
*贝叶斯分类:
why Naïve & x1
why not Naive & x2
Multinomiales Naïve Bayes :idea & formuar & calcu &
三个层级的公式
*第一层次-
基本cj d, p , tk cj 估计, gamma (d)
how to de smoothing?
+代码对应 & 数组...
+ML实际上存储的是什么,学习的是啥
+term freq =1 的时候就是Bernoulli Naïve Bayes? false ! why
Bernoulli Naïve Bayes : idea & formuar &
也是三个层次
smoothing? why + 2 in the denominator
Anmerkungen & x2 advantages
Aktuelle Anwendung & x1
ch18
*Klassifikation im Vektorraum
Vektorraum:vector space
exp &
Rocchio Klassifikator
idea &
formuar &
*Probleme bei Rocchio & x2
kNN
process & x2
*Wahl von k? & x2-3
变种 & x2
ch19 -svm 细节
Darstellung Hyperebene mit Ebenengleichung ::p9
Vorgehen bei mehr als zwei Kategorien ? Multiclass SVMs?,exp & p10 x3steps
::when there are more than two classes x2
*Trainingsphase:Perzeptron
what is R ,ita &
verfahren exp & x3steps
if learn rate is large ? &
formula of perceptron &
alpha factor in the formular what is that &
dual form &
what have we learned ? &
SVM
*SVM Linearer Fall
距离的表达式 &
Einfach Verfahren &
formular p30 & -
it can be solved with Lagrange multiplier or quadratic programming
-Well-studied solution algorithms
why larger than 1 ? (constraint formula )
//Duale Formulierung & -
Beispiel p33 &
*Erweitertes Optimierungsverfahren
using slack variable in SVM::引入松弛变量
formular & p38
Einfluss von C exp & -
tradeoff parameter (chosen by cross-validation)
what is m :: number of slacks?
condition ? &
*说名字-Kernel Trick--Nichtlineare Entscheidungsgrenzen
key idea &
kernel types & x4
RBF Kernel--Gauß‘scher Kernel
Polynomial Kernel
Sigmoid Kernel
Cosinus Similarity Kernel
Chi-squared Kernel
how to choose ? 10k
---
---
Lan training
base::
generic storage...
similar. general
math
calculus..
Introduction to Calculus
regular polyhedra
Sets, Functions, Graphs and Limits
Elementary Set Theory, Subsets, Set Operations, Coordinate Systems
The Exponential and Logarithmic Functions
The Hyperbolic Functions
Symmetry of Functions, Translation and Scaling of Axes,
Equations of Lines
Parabola, Ellipse, Hyperbola
Conic Sections in Polar Coordinates
Differential Calculus
The Derivative of
take the derivative of both sides
And then we take the derivative.!!!!
-r now , we are going to take the derivativ
Integral Calculus
Integrate
Table of Integrals
I will integrate the function on that region !!!!
-r now , we are going to integrate this function on the rigion
Method of Partial Fractions
The Definite Integral
Improper Integrals
Cylindrical Coordinates
applications of
Spring-mass System
Damping Forces
Mechanical Resonance
ch1
Information retrieval
is the task,
given a set of documents and a user query,
finding the relevant documents
// find the relevant docs , according to the collection and the queries
Automated information retrieval systems are used to reduce
what has been called information overload
// why should we gen. IR system : there are huge amount informaiton that we should manage
An IR system is a software that provide access to books,
journals and other documents, stores them and manages the document.
// we can access to these books , using the system
Web search engines are the most visible IR applications.
// Medical Imaging is the most visible application in CG
information retrieval process begins when
Queries are formal statements of information needs,
search strings in web search engines
with different degrees of relevancy.
opposed to classical SQL queries of a database
compute a numeric score on how well each object in the database matches the query
// the key idea of our sys. is to compute a numeric score to ranking the doc.
so as to provide a very quick access
The process may then be iterated if the user wishes to refine the query
Now the world has changed, and hundreds of millions of people engage
in information retrieval every day when they use a web search engine or search their email
web search engine:: pronu.
unstructured text
does not have a pre-defined data model
// has no data model , when it compared to data in databank
typically text-heavy
an ambiguous word/term/statement
This results in irregularities and ambiguities that make it
difficult to understand using traditional programs
80-90% of all potentially usable business information may originate in unstructured form
Other sources have reported similar or higher percentages of unstructured data
majority of that will be unstructured
// xmajority of data that in our daily life are unstructured data
The earliest research into business intelligence focused in on unstructured textual data
// at the very begining , ir was used into business intelligence
the extraction and classification of unstructured text
Approaches in medicine and biomedical research
biomedical documents include self-organizing map
approaches for identifying topics among documents
ch2
what is a model
Mathematical models are used to study the properties of
the process, draw conclusions, make predictions
derived from ..
boolean re. is something that very precise
Boolean model of information retrieval (BIR)
The BIR is based on Boolean logic and classical set theory
the user's query are conceived as sets of terms.
// bag of words model
Retrieval is based on whether or not the documents contain the query terms.
// has ignore the term freq. and just check if the term is included in the doc.
why inverted
postings file or inverted file
a mapping from content
named in contrast to a forward index, which maps from documents to content
//
it is named in contrast to forward index
forward index maps froms documents to content
inverted maps from words to documents
The purpose of an inverted index is to allow fast full-text searches
//
to enable fast full text research , we generate an inverted index for the collection
It is the most popular data structure used in document retrieval systems
Phrase Queries
A word query is a query on a word or phrase
// a phrase can be 2 or more words :: the wto black albama ...
search for documents containing an exact sentence or phrase
searching for a certain string of text
using quotation marks (") around a specific phrase to
indicate that you want to search for instances of that search query
//
there should be quotation marks around this string
ch3
a morpheme is the smallest meaningful unit in a language
// meaningful unit
stemmer
handling of word endings by reducing the words to their word roots
// find the root of words
Stemming broadens our results to include both word roots and word derivatives
to improve recall
// stemmers are used to increse recall in ir sys.
A stemming algorithm is an algorithm that converts a word to a related form
conversion of plurals to singulars.
It is the most effective and widely used
it can only be applied to text in the English Language
// the porter stemmer
the user specifies a word in a query but only a variant of this word is present in a relevant document
This problem can be partially overcome with the substitution of stems for the words
A stem is the portion of a word that is left after the removal of its affixes (i.e., prefixes and suffixes)
improving retrieval performance
affix removal, table lookup, successor variety
determination of morpheme boundaries
and n-grams.
might require considerable storage space
reduce the size of indexing files
In summary, the successor variety stemming process has 3 parts:
develop a stemmer that required little or no human processing.
// the systems are considered to requiring little human processing
// sys should do little interaction with human
it facilitates the construction of an attractive user interface
to test the correctness of our work
// to check if it is correct
we have achieved an 80% level of correctness
Several advantages of the Successor Variety Algorithm can be observed
-- without the need to use a dictionary
-- it is basically (domain independent)
Another advantage is that it can be used in several domains; it is basically (domain independent).
National Computer Conference and Exhibition
Part-of-Speech
The part of speech indicates how the word functions in
meaning as well as grammatically within the sentence.
// find the function of the words we are trying to tag in the sentance .
ch4
encyclopedia
Wildcard character
a wildcard character is a kind of placeholder represented by a single character
an asterisk
can be interpreted as a number of literal characters or an empty string
used in Regular expressions
doc* matches doc and document but not dodo
k gram
an n-gram is a contiguous sequence of n items from a given sample of text or speech
n-grams may also be called shingles
a type of probabilistic language model
n-gram models are widely used in statistical natural language processing.
Space–time tradeoff
an algorithm or program trades increased space usage with decreased time
Using stored knowledge or encoding stimuli reactions as "instincts"
in the DNA avoids the need for "calculation" in time-critical situations
More specific to computers, look-up tables have been implemented since the very earliest operating systems
Spelling Correction
spell checker :: ot check the spell of a word if it is correct
levenstein distance
k gram
Context-sensitive process
a software feature that checks for misspellings in a text.
n-grams, to recognize errors instead of correctly-spelled words
with phonetic information
Gorin wrote SPELL in assembly language, for faster action
available on mainframe computers
PCs with sufficient memory.
For example, the concatenation of "snow" and "ball" is "snowball
string theory, string concatenation is a primitive notion.
Recreational mathematics
ch5
Jaccard with k-Grams
Jaccard similarity
//
k-Grams is just a simple concept and can be used in many algos.
Jaccard measures simularity of tow sets using k gram
is a statistic used for comparing the similarity and diversity of sample sets.
divided by the size of the union of the sample sets
If A and B are both empty, we define J(A,B) = 1.
Intersection over Union
ratio of the size of the symmetric difference
Weighted Jaccard similarity and distance
Probability Jaccard similarity and distance
Tanimoto similarity and distance
Paul Jaccard
Cosine similarity is for comparing two real-valued vectors,
Jaccard similarity is for comparing two binary vectors (sets).
ch6
zipf Law
ch7
ch8
ch9
Kappa statistic
a statistic which measures inter-rater agreement
it is conceptually simpler to evaluate disagreement between two raters
The seminal paper introducing kappa as a new technique was published by Jacob Cohen
between two raters only,cannot measure the aggrement amoung more than two raters
Fleiss' kappa
The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic
Kappa is also used to compare performance in machine learning
Statistical significance
type of intraclass correlation