-
Notifications
You must be signed in to change notification settings - Fork 143
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #544 from shatu/dataless-classifier
Initial commit for the Dataless Classifier (closes #556)
- Loading branch information
Showing
58 changed files
with
6,586 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# CogComp-DatalessClassifier | ||
Given a label ontology, and textual descriptions of those labels, Dataless-Classifier is capable of classifying arbitrary text into that ontology. | ||
|
||
It is particularly useful in those scenarios where it is difficult/expensive to gather enough training data to train a supervised text classifier. Dataless-Classifier utilizes the semantic meaning of the labels to bypass the need for explicit supervision. For more information, please visit our main project [page](http://cogcomp.org/page/project_view/6). | ||
|
||
|
||
Some key points: | ||
- The Main classes for the Dataless Annotators are: | ||
* **ESADatalessAnnotator** for the ESA-based Dataless Annotator | ||
* **W2VDatalessAnnotator** for the Word2Vec-based Dataless Annotator | ||
- Dataless Annotators add the **DATALESS_ESA** and **DATALESS_W2V** views to the input `TextAnnotation` respectively, and it requires the presence of a **TOKENS** view with the end-user's desired Tokenization. | ||
- Since Labels/Topics are inferred at the Document-Level, all topic annotations span the entire document. | ||
- Sample invocation has been provided in the main functions of each annotator. | ||
- Both annotators load up embeddings in memory, and thus can easily consume upto **10GB RAM**. | ||
|
||
|
||
## Label Hierarchy | ||
Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy needs to be provided using a very specific format: | ||
* **labelNamePath**: Specify your label id to label name mapping here in the `labelID \t labelName` format | ||
(label id can be any ID specific to your system, however we use the label name itself as ID in our sample hierachy for readibility) | ||
* **labelHierarchyPath**: The first line of this file should contain tab-separated list of Top-Level nodes in the hierarchy (i.e. the ones directly connected to the root). Then, every following line should specify the connections in the hierachy in the `parentLabelID \t childLabelID1 \t childLabelID2 \t ...` format. | ||
* **labelDescPath**: Dataless' performance hinges on good label descriptions, which you specify in this file in the `labelID \t labelDescription` format. | ||
|
||
We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where: | ||
* idToLabelNameMap.txt should be used as labelNamePath | ||
* parentChildIdMap.txt should be used as labelHierarchyPath | ||
* labelDesc\_Kws\_simple.txt should be used as labelDescPath | ||
|
||
We also provide improved 20newsgroups label descriptions in *labelDesc\_Kws\_embellished.txt* which corresponds to the label descriptions used in [2], whereas the *labelDesc\_Kws\_simple.txt* corresponds to the label descriptions used in [1]. | ||
|
||
## Embeddings | ||
ESA and Word2Vec Embeddings are fetched from the DataStore on demand. | ||
|
||
## Config | ||
A sample config file with the default values has been provided in the config folder .. *config/project.properties* | ||
|
||
To check whether you are properly set to use the project or not, run: | ||
* `mvn -Dtest=ESADatalessTest#testPredictions test` to test the ESADatalessAnnotator. | ||
* `mvn -Dtest=W2VDatalessTest#testPredictions test` to test the W2VDatalessAnnotator. | ||
|
||
If you use this software for research, please cite the following papers: | ||
|
||
[1] Chang, Ming-Wei, et al. "Importance of Semantic Representation: Dataless Classification." AAAI. Vol. 2. 2008. | ||
|
||
[2] Song, Yangqiu, and Dan Roth. "On Dataless Hierarchical Text Classification." AAAI. Vol. 7. 2014. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
## Use ResourceManager to read these properties | ||
# curatorHost = trollope.cs.illinois.edu | ||
# curatorPort = 9010 | ||
|
||
## Target Label Hierarchy | ||
labelHierarchyPath = data/hierarchies/20newsgroups/parentChildIdMap.txt | ||
labelNamePath = data/hierarchies/20newsgroups/idToLabelNameMap.txt | ||
labelDescPath = data/hierarchies/20newsgroups/labelDesc_Kws_simple.txt | ||
# labelDescPath = data/hierarchies/20newsgroups/labelDesc_Kws_embellished.txt | ||
|
||
## Classifier configuration | ||
inferenceBottomUp = True | ||
classifierThreshold = 0.99 | ||
classifierLeastK = 1 | ||
classifierMaxK = 3 | ||
|
||
## ESA Configuration | ||
#esaPath = data/embeddings/esaEmbedding/esa_vectors.txt | ||
#esaMapPath = data/embeddings/esaEmbedding/idToConceptMap.txt | ||
#esaDimension = 100 | ||
|
||
## W2V Configuration | ||
#w2vPath = data/embeddings/w2vEmbedding-100/w2v_vectors.txt | ||
#w2vDimension = 200 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
yes i know it s nowhere near christmas time but i m gonna loose net access in a few days maybe a week or if i m lucky and wanted to post this for interested people to save till xmas note bell labs is a good place if you have a phd and a good boss i have neither subject xmas light set with levels of brightness another version of a variable brightness xmas light set this set starts with a blinker bulb string diagram orginal way set 0v b b 0rtn modified set for level brightness string 0v 0k w string b 0v rtn note no mods to wiring to the right of this point only one blinker is used note that the blinker would not have as much current thru it as the string bulbs because of the second string of bulbs in parallel with it that s why the use of the 0k w resistor here to add extra current thru the blinker to make up for the current shunted thru the second string while the blinker is glowing and the second string is not glowing when the blinker goes open this resistor has only a slight effect on the brightness of the strings s slightly dimmer s slightly brighter or use a w 0v bulb in place of the 0k resistor if you can get one caution do not replace with a standard c bulb as these draw too much current and burn out the blinker c approx w what you ll see when it s working powerup string will light at full brightness and b will be lit bypassing most of the current from the second string making them not light b will open placing both strings in series making the string that was out to glow at a low brightness and the other string that was on before to glow at reduced brightness be sure to wire and insulate the splices resistor leads and cut wires in a safe manner level brightness xmas light set for easter |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
i m looking for some recommendations for screen capture programs a couple of issues ago pc mag listed as editor s choices both conversion artist and hijaak for windows anyone have any experience with those or some others i m trying to get an alpha manual in the next few days and i m not making much progress with the screen shots i m currently using dodot and i m about to burn it and the disks it rode it on it s got a lot of freaky bugs and oversights that are driving me crazy tonight it decided that for any graphic it writes out as a tiff file that s under a certain arbitrary size it will swap the left and right sides of the picture usually it confines itself to not copying things to the clipboard so i have to save and load pix for editing in paintbrush or crashing every hour or so the one nice thing it has though is it s dither option you d think that this would turn colors into dots which it does if you go from say colors to colors but if you go from or colors to b w you can set a threshold level for which colors turn to black and which turn to white for me this is useful because i can turn light grays on buttons to white and the dark grays to black and thereby preserve the d effect on buttons and other parts of the window if you understood my description can you tell me if another less buggy program can do this as well much thanks for any help signature david delgreco what lies behind us and what lies technically a writer before us are tiny matters compared delgreco rahul net to what lies within us oliver wendell holmes david f delgreco delgreco rahul net recommendation for screen capture program |
26 changes: 26 additions & 0 deletions
26
dataless-classifier/data/hierarchies/20newsgroups/idToLabelNameMap.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
politics politics | ||
religion religion | ||
computer computer | ||
autos.sports autos.sports | ||
science science | ||
sales sales | ||
talk.politics.guns talk.politics.guns | ||
talk.politics.mideast talk.politics.mideast | ||
talk.politics.misc talk.politics.misc | ||
alt.atheism alt.atheism | ||
soc.religion.christian soc.religion.christian | ||
talk.religion.misc talk.religion.misc | ||
comp.sys.ibm.pc.hardware comp.sys.ibm.pc.hardware | ||
comp.sys.mac.hardware comp.sys.mac.hardware | ||
comp.graphics comp.graphics | ||
comp.windows.x comp.windows.x | ||
comp.os.ms.windows.misc comp.os.ms.windows.misc | ||
rec.autos rec.autos | ||
rec.motorcycles rec.motorcycles | ||
rec.sport.baseball rec.sport.baseball | ||
rec.sport.hockey rec.sport.hockey | ||
sci.electronics sci.electronics | ||
sci.crypt sci.crypt | ||
sci.med sci.med | ||
sci.space sci.space | ||
misc.forsale misc.forsale |
26 changes: 26 additions & 0 deletions
26
dataless-classifier/data/hierarchies/20newsgroups/labelDesc_Kws_embellished.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
politics politics gun fbi guns weapon compound israel arab jews jewish muslim gay homosexual sexual | ||
religion religion atheist christian atheism god islamic christian god christ church bible jesus christian morality jesus god religion horus | ||
computer computer bus pc motherboard bios board computer dos mac apple powerbook graphics image gif animation tiff window motif xterm sun windows windows dos microsoft ms driver drivers card printer | ||
autos.sports autos.sports car ford auto toyota honda nissan bmw bike motorcycle yamaha baseball ball hitter hockey wings espn | ||
science science circuit electronics radio signal battery encryption key crypto algorithm security doctor medical disease medicine patient space orbit moon earth sky solar | ||
sales sales sale offer shipping forsale sell price brand obo | ||
talk.politics.guns gun fbi guns weapon compound | ||
talk.politics.mideast israel arab jews jewish muslim | ||
talk.politics.misc gay homosexual sexual | ||
alt.atheism atheist christian atheism god islamic | ||
soc.religion.christian christian god christ church bible jesus | ||
talk.religion.misc christian morality jesus god religion horus | ||
comp.sys.ibm.pc.hardware bus pc motherboard bios board computer dos | ||
comp.sys.mac.hardware mac apple powerbook | ||
comp.graphics graphics image gif animation tiff | ||
comp.windows.x window motif xterm sun windows | ||
comp.os.ms.windows.misc windows dos microsoft ms driver drivers card printer | ||
rec.autos car ford auto toyota honda nissan bmw | ||
rec.motorcycles bike motorcycle yamaha | ||
rec.sport.baseball baseball ball hitter | ||
rec.sport.hockey hockey wings espn | ||
sci.electronics circuit electronics radio signal battery | ||
sci.crypt encryption key crypto algorithm security | ||
sci.med doctor medical disease medicine patient | ||
sci.space space orbit moon earth sky solar | ||
misc.forsale sale offer shipping forsale sell price brand obo |
26 changes: 26 additions & 0 deletions
26
dataless-classifier/data/hierarchies/20newsgroups/labelDesc_Kws_simple.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
politics politics politics guns politics mideast politics | ||
religion religion atheism society religion christianity christian religion | ||
computer computer computer systems ibm pc hardware computer systems mac macintosh apple hardware computer graphics computer windows x windowsx computer os operating system microsoft windows | ||
autos.sports autos.sports cars motorcycles baseball hockey | ||
science science science electronics science cryptography medicine science space | ||
sales sales for sale discount | ||
talk.politics.guns politics guns | ||
talk.politics.mideast politics mideast | ||
talk.politics.misc politics | ||
alt.atheism atheism | ||
soc.religion.christian society religion christianity christian | ||
talk.religion.misc religion | ||
comp.sys.ibm.pc.hardware computer systems ibm pc hardware | ||
comp.sys.mac.hardware computer systems mac macintosh apple hardware | ||
comp.graphics computer graphics | ||
comp.windows.x computer windows x windowsx | ||
comp.os.ms.windows.misc computer os operating system microsoft windows | ||
rec.autos cars | ||
rec.motorcycles motorcycles | ||
rec.sport.baseball baseball | ||
rec.sport.hockey hockey | ||
sci.electronics science electronics | ||
sci.crypt science cryptography | ||
sci.med science medicine | ||
sci.space science space | ||
misc.forsale for sale discount |
7 changes: 7 additions & 0 deletions
7
dataless-classifier/data/hierarchies/20newsgroups/parentChildIdMap.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
politics religion computer autos.sports science sales | ||
politics talk.politics.guns talk.politics.mideast talk.politics.misc | ||
religion alt.atheism soc.religion.christian talk.religion.misc | ||
computer comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.graphics comp.windows.x comp.os.ms.windows.misc | ||
autos.sports rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey | ||
science sci.electronics sci.crypt sci.med sci.space | ||
sales misc.forsale |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> | ||
<parent> | ||
<artifactId>illinois-cogcomp-nlp</artifactId> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<version>4.0.0</version> | ||
</parent> | ||
|
||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<artifactId>illinois-datalessclassification</artifactId> | ||
<name>Illinois Dataless Classifier</name> | ||
<description>Classifies Text into the given label hierarchy from just the textual label descriptions</description> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>org.cogcomp</groupId> | ||
<artifactId>cogcomp-datastore</artifactId> | ||
<version>1.9.10</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>illinois-core-utilities</artifactId> | ||
<version>4.0.0</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>edu.illinois.cs.cogcomp</groupId> | ||
<artifactId>illinois-tokenizer</artifactId> | ||
<version>4.0.0</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.slf4j</groupId> | ||
<artifactId>slf4j-log4j12</artifactId> | ||
<version>1.7.12</version> | ||
<optional>true</optional> | ||
</dependency> | ||
<dependency> | ||
<groupId>net.sf.jung</groupId> | ||
<artifactId>jung-api</artifactId> | ||
<version>2.0.1</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>net.sf.jung</groupId> | ||
<artifactId>jung-graph-impl</artifactId> | ||
<version>2.0.1</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>commons-cli</groupId> | ||
<artifactId>commons-cli</artifactId> | ||
<version>1.4</version> | ||
</dependency> | ||
</dependencies> | ||
|
||
<build> | ||
<pluginManagement> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-surefire-plugin</artifactId> | ||
<version>2.20.1</version> | ||
<configuration> | ||
<!--increase the memory requirements if you need more space--> | ||
<argLine>-Xmx15g</argLine> | ||
</configuration> | ||
</plugin> | ||
</plugins> | ||
</pluginManagement> | ||
</build> | ||
|
||
|
||
</project> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#mvn compile | ||
#mvn dependency:copy-dependencies | ||
nice java -Xmx10g -cp ./target/*:./target/dependency/* edu.illinois.cs.cogcomp.datalessclassification.ta.ESADatalessAnnotator $@ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#mvn compile | ||
#mvn dependency:copy-dependencies | ||
nice java -Xmx10g -cp ./target/*:./target/dependency/* edu.illinois.cs.cogcomp.datalessclassification.ta.W2VDatalessAnnotator $@ |
Oops, something went wrong.