-
Notifications
You must be signed in to change notification settings - Fork 0
Readability Measures
These are mainly based off of words length or syllables (and Dale uses a reference list).
Lots of redundancy. Maybe worth using Flesh-Kincaid Reading Ease, Coleman Liau, and New Dale-Chall (and consider removing if they don't add anything to the model)
Also consider cooking our own custom Dale-Chall model.
Developed by a US Navy contract - specifically to evaluate technical documents
Higher score = better readability. Therefore it penalizes high word/sentence ratio and high-syllable words.
The resulting numerical score can be converted to grade (apparently established using existing reference documents/classification?) or stand alone (better for our purposes).
This is an especially ubiquitous test
[https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests]
# pip install textstat
flesch_reading_ease(text)
## For reference:
# 90-100 : Very Easy
# 80-89 : Easy
# 70-79 : Fairly Easy
# 60-69 : Standard
# 50-59 : Fairly Difficult
# 30-49 : Difficult
# 0-29 : Very Confusing
Sort of a shortcut version of the above, using US grade levels as.
(We should just use the reading ease, since including this too would be redundant.
# textstat package again
flesch_kincaid_grade(text)
Estimates the years of "formal" education a person would need to understand a text (range is generally 6-17).
Specified to need around 100 words (probably the same for others)
"complex" words are defined as 3 or more syllables.
Similar to the above, but a subtle difference in how to look at it.
[https://en.wikipedia.org/wiki/Gunning_fog_index]
# textstat
gunning_fog(text)
Also approximates to US grade level
Difference uses characters instead of syllables (characters can be more reliably counted than syllables.
So far I would guess all these tests to be very highly correlated, and thus probably not of additional use in a predictive model.
[https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index]
coleman_liau_index(text)
Simple Measure of Gobbledygook
Designed to be a more accurate and easy to mentally calculate substitute for others.
Another that gives US grade level.
Requires 30+ sentences.
Polysyllables are the same as the "complex" words in Gunning, aka 3+ syllables.
The main idea is to be able to count the number of polysyllables in 30 sentences, take the square root and add 3 (though this is more approximated)
smog_index(text)
[https://en.wikipedia.org/wiki/SMOG]
Another grade level estimate, really just a differently weighted version of of the Flesch Kincaid, using characters instead of words like Coleman Liau
automated_readability_index(text)
[https://en.wikipedia.org/wiki/Automated_readability_index]
Designed specifically for younger children's texts (under 4th grade)
Also gives grade level
Uses a list of "everyday" words as a reference. If a word isn't on this list, its considered "unfamiliar"
textstat in python doesn't do this, but I don't think we'll make use of this anyways.
[https://en.wikipedia.org/wiki/Spache_readability_formula]
Like the last, but not just for younger kids. Still gives a grade.
Also uses a list of familiar words - 3000 most commonly used english words (We could potentially get the 3000 most used words in the stackexchange group and use that instead?)
dale_chall_readability_score(text)
[https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula]
The above are simple. There may be other, more machine-learning methods to quantify.
Apparently a promising measure (mentioned in Bailin Book)
Appears to be a function to do this more ocmplicated procedure here: [https://github.com/gsaqui/automate-coh-metrix]
(glanced at the code, and not really sure what its doing - may just be putting data in the spreadsheet for further analysis)
It apparently outputs a spreadsheet... not sure how we would implement this into a predictive model. Need to test some output.
This would be a possibility, but actually, instead of fitting a readability model, we might as well just fit our predictive model on the same items.
Schwarm and Ostendorf used n-gram language models as a "low-cost automatic approximation of syntactic and semantic analysis." (continue investigating their process - adapting it it probably a good bet to incorportate readability)