Readability Measures

"Classic" Readability Formulas

These are mainly based off of words length or syllables (and Dale uses a reference list).

Lots of redundancy. Maybe worth using Flesh-Kincaid Reading Ease, Coleman Liau, and New Dale-Chall (and consider removing if they don't add anything to the model)

Also consider cooking our own custom Dale-Chall model.

Flesch-Kincaid Reading Ease

Developed by a US Navy contract - specifically to evaluate technical documents

$206.835 - 1.015 \left(\frac{\text{total words}}{\text{total sentence}}\right) - 84.6\left(\frac{\text{total syllables}}{\text{total words}}\right)$

Higher score = better readability. Therefore it penalizes high word/sentence ratio and high-syllable words.

The resulting numerical score can be converted to grade (apparently established using existing reference documents/classification?) or stand alone (better for our purposes).

This is an especially ubiquitous test

[https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests]

# pip install textstat
flesch_reading_ease(text)

## For reference:
# 90-100 : Very Easy 
# 80-89 : Easy 
# 70-79 : Fairly Easy 
# 60-69 : Standard 
# 50-59 : Fairly Difficult 
# 30-49 : Difficult 
# 0-29 : Very Confusing

Flesch-Kincaid Grade

Sort of a shortcut version of the above, using US grade levels as.

$0.39 \left(\frac{\text{total words}}{\text{total sentence}}\right) + 11.8\left(\frac{\text{total syllables}}{\text{total words}}\right) - 15.59$

(We should just use the reading ease, since including this too would be redundant.

# textstat package again
flesch_kincaid_grade(text)

Gunning Fog Index

Estimates the years of "formal" education a person would need to understand a text (range is generally 6-17).

Specified to need around 100 words (probably the same for others)

$0.4 \left(\frac{\text{total words}}{\text{total sentence}}\right) + 100\left(\frac{\text{complex words}}{\text{total words}}\right)$

"complex" words are defined as 3 or more syllables.

Similar to the above, but a subtle difference in how to look at it.

[https://en.wikipedia.org/wiki/Gunning_fog_index]

# textstat
gunning_fog(text)

Coleman-Liau Index

Also approximates to US grade level

Difference uses characters instead of syllables (characters can be more reliably counted than syllables.

$0.0588 \left(\frac{\text{total letters}}{\text{total words}}\right) - 0.296\left(\text{average number of sentence per 100 words}}\right) - 15.8$

So far I would guess all these tests to be very highly correlated, and thus probably not of additional use in a predictive model.

[https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index]

coleman_liau_index(text)

SMOG Index

Simple Measure of Gobbledygook

Designed to be a more accurate and easy to mentally calculate substitute for others.

Another that gives US grade level.

Requires 30+ sentences.

$3.1291 + 1.0430 \sqrt{30 \left(\frac{\text{number polysyllables}}{\text{number sentences}}}\right)$

Polysyllables are the same as the "complex" words in Gunning, aka 3+ syllables.

The main idea is to be able to count the number of polysyllables in 30 sentences, take the square root and add 3 (though this is more approximated)

smog_index(text)

[https://en.wikipedia.org/wiki/SMOG]

Automated Readability Index (ARI)

Another grade level estimate, really just a differently weighted version of of the Flesch Kincaid, using characters instead of words like Coleman Liau

$4.71 \left(\frac{\text{characters}}{\text{words}}\right) - 0.5\left(\frac{\text{words}}{\text{sentences}}\right) - 21.43$

automated_readability_index(text)

[https://en.wikipedia.org/wiki/Automated_readability_index]

Spache Score

Designed specifically for younger children's texts (under 4th grade)

Also gives grade level

Uses a list of "everyday" words as a reference. If a word isn't on this list, its considered "unfamiliar"

$0.121 (\text{Average sentence length}) + 0.082 (\text{percentage of unfamiliar words}) + 0.659$

textstat in python doesn't do this, but I don't think we'll make use of this anyways.

[https://en.wikipedia.org/wiki/Spache_readability_formula]

New Dale-Chall Score

Like the last, but not just for younger kids. Still gives a grade.

Also uses a list of familiar words - 3000 most commonly used english words (We could potentially get the 3000 most used words in the stackexchange group and use that instead?)

$0.1579 (\frac{\text{unfamiliar words}}{\text{total words}}) + 0.0496 (\frac{\text{words}}{\text{sentences}) + 0.659$

dale_chall_readability_score(text)

[https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula]

More Complex Ideas

The above are simple. There may be other, more machine-learning methods to quantify.

Coh-Metrix

Apparently a promising measure (mentioned in Bailin Book)

Appears to be a function to do this more ocmplicated procedure here: [https://github.com/gsaqui/automate-coh-metrix]

(glanced at the code, and not really sure what its doing - may just be putting data in the spreadsheet for further analysis)

It apparently outputs a spreadsheet... not sure how we would implement this into a predictive model. Need to test some output.

Machine Learning

This would be a possibility, but actually, instead of fitting a readability model, we might as well just fit our predictive model on the same items.

Schwarm and Ostendorf used n-gram language models as a "low-cost automatic approximation of syntactic and semantic analysis." (continue investigating their process - adapting it it probably a good bet to incorportate readability)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly