Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical sentiment classifier, single feature classification, erroneous probabilities? #27

Open
bwbaugh opened this issue Mar 25, 2013 · 2 comments

Comments

@bwbaugh
Copy link
Owner

bwbaugh commented Mar 25, 2013

Part of the web interface is supposed to show how each feature would be classified if it was a document of length one. Why does the hierarchical sentiment classifier only label these individual features as either neutral or positive, even when the confidence value is less than 0.5?

As an example:

<span style="color: #808080" title="neutral: 48.01%">('__start__', u'This')</span> 
<span style="color: #98c000" title="positive: 60.35%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 45.32%">(u'This', u'is')</span> 
<span style="color: #b3c000" title="positive: 53.17%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 38.86%">(u'is', u'only')</span> 
<span style="color: #c07e00" title="positive: 32.82%">(u'only',)</span> 
<span style="color: #808080" title="neutral: 67.93%">(u'only', u'a')</span> 
<span style="color: #9bc000" title="positive: 59.42%">(u'a',)</span> 
<span style="color: #808080" title="neutral: 51.44%">(u'a', u'test')</span> 
<span style="color: #c0a100" title="positive: 42.09%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 34.62%">(u'test', '__end__')</span> <br>

Current hash: 5fd9baa

bwbaugh added a commit that referenced this issue Mar 25, 2013
For some reason, when classifying a document that consists only of one
feature, the hierarchical classifier only labels the document as neutral
or positive, even when the confidence values are less than 0.5.

I'm still not sure why this is the case, however I am addressing the
issue for the part of the web interface that shows the influence of each
of the individual features that make up the query by just using the
conditional probability of the feature across the labels instead of
trying to classify it. In the mean time, this addresses issue gh-27.
@bwbaugh
Copy link
Owner Author

bwbaugh commented Mar 25, 2013

Now, using conditional probabilities only (instead of trying to classify each feature as its own document):

<span style="color: #808080" title="neutral: 51.99%">('__start__', u'This')</span> 
<span style="color: #808080" title="neutral: 52.04%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 54.68%">(u'This', u'is')</span> 
<span style="color: #808080" title="neutral: 56.23%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 61.14%">(u'is', u'only')</span> 
<span style="color: #808080" title="neutral: 56.40%">(u'only',)</span> 
<span style="color: #c0ad00" title="negative: 54.75%">(u'only', u'a')</span> 
<span style="color: #808080" title="neutral: 54.63%">(u'a',)</span> 
<span style="color: #c06500" title="negative: 73.62%">(u'a', u'test')</span> 
<span style="color: #808080" title="neutral: 52.74%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 65.38%">(u'test', '__end__')</span> <br>

Perhaps by the prior probabilities skew the overall classification so much that just a single feature isn't capable of overcoming the priors. Now that I think about it, why are we throwing away the confidence value from the classification process, and re-calculating it from the conditionals? Which is the correct approach?

@bwbaugh
Copy link
Owner Author

bwbaugh commented Mar 25, 2013

When we use the original confidence value from the classification process, we get:

<span style="color: #808080" title="neutral: 50.56%">('__start__', u'This')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'This',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'This', u'is')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'is',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'is', u'only')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'only',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'only', u'a')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'a',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'a', u'test')</span> 
<span style="color: #a5c000" title="positive: 56.90%">(u'test',)</span> 
<span style="color: #808080" title="neutral: 50.56%">(u'test', '__end__')</span> <br>

Why are there only two unique confidence values across all features? Shouldn't the individual conditional probabilities cause at least some variation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant