7 Learning Features for Your Next Machine Learning Model

In this article, you’ll learn how to extract seven useful readability and complex text features from raw text using the Textstat Python library.
Topics we will cover include:
- How Textstat can measure the readability and text complexity of the machine learning tasks below.
- How to calculate seven commonly used metrics in Python.
- How to interpret these metrics when using them as planning or regression factors.
Let’s not waste any more time.
7 Learning Features for Your Next Machine Learning Model
Photo by Editor
Introduction
Unlike fully structured tabular data, preparation text data for machine learning models usually includes tasks such as tokenization, embedding, or sentiment analysis. While these are undoubtedly useful features, the complexity of a text’s structure – or its readability, for that matter – can also create an incredibly instructive feature for predictive tasks like editing or reverse engineering.
Textstatas its name suggests, it is a lightweight and intuitive Python library that can help you derive statistics from raw text. For readable points, it provides input features for models that can help distinguish between a typical social media post, a children’s fable, or a philosophical manuscript, to name a few.
This article presents seven clever examples of text analysis that can be easily done using the Textstat library.
Before we begin, make sure you have installed Textstat:
Although the analysis described here can be scaled up to large text sets, we will illustrate them with a toy dataset that includes a small number of labeled texts. Keep in mind, however, that for downstream machine learning model training, you’ll need a large enough dataset for training purposes.
import pandas as pd import textstat # Create a play data set with three distinct text data = { ‘Class’: [‘Simple’, ‘Standard’, ‘Complex’]’Text’: [
“The cat sat on the mat. It was a sunny day. The dog played outside.”,
“Machine learning algorithms build a model based on sample data, known as training data, to make predictions.”,
“The thermodynamic properties of the system dictate the spontaneous progression of the chemical reaction, contingent upon the activation energy threshold.”
]} df = pd.DataFrame(data) print(“Environment setup and dataset ready!”)
|
enter the pandas like pd enter textstat # Create a play data set with three significantly different scripts data = { ‘Category’: [‘Simple’, ‘Standard’, ‘Complex’], ‘text’: [ “The cat sat on the mat. It was a sunny day. The dog played outside.”, “Machine learning algorithms build a model based on sample data, known as training data, to make predictions.”, “The thermodynamic properties of the system dictate the spontaneous progression of the chemical reaction, contingent upon the activation energy threshold.” ] } df = pd.DataFrame(data) print(“Environment setup and data set ready!”) |
1. Using the Flesch Reading Ease Formula
The first text analysis metric we will examine is the Flesch Reading Ease formula, one of the oldest and most widely used metrics for measuring text readability. It evaluates text based on average sentence length and average number of syllables per word. Although conceptually it is intended to take values in the 0 – 100 range – 0 meaning unreadable and 100 meaning very easy to read – its formula is not strictly bound, as shown in the examples below:
df[‘Flesch_Ease’] =df[‘Text’].apply(textstat.flesch_reading_ease) print(“Flesch Reading Ease Scores:”) print(df[[‘Category’, ‘Flesch_Ease’]])
|
df[‘Flesch_Ease’] = df[‘Text’].Claim(textstat.physical_reading_is_easy) print(“Flesch Easy Reading Scores:”) print(df[[‘Category’, ‘Flesch_Ease’]]) |
Output:
Flesch Reading Ease Scores: Category Flesch_Ease 0 Simple 105.880000 1 Standard 45.262353 2 Complex -8.045000
|
Flesch Reading Easy Scores: Section Flesch_Easy 0 It’s easy 105.880000 1 General 45.262353 2 Complex –8.045000 |
Here is what the actual formula looks like:
$$ 206.835 – 1.015 left( frac{text{total words}}{text{total sentences}} right) – 84.6 left( frac{text{total syllables}}{text{total words}} right) $$
Unlimited formulas like Flesch Reading Ease can interfere with the proper training of a machine learning model, which should be considered among the latest feature engineering tasks.
2. Computing Flesch-Kincaid Grade Levels
Unlike the Reading Ease score, which assigns a single readability value, the Flesch-Kincaid Grade Level assesses text complexity using a scale similar to US school grade levels. In this case, higher values indicate greater difficulty. However, be warned: this metric also behaves in the same way as the Flesch Reading Ease score, so that very simple or complex texts may yield scores below zero or arbitrarily high values, respectively.
df[‘Flesch_Grade’] =df[‘Text’].use(textstat.flesch_kincaid_grade) print(“Flesch-Kincaid Grade Levels:”) print(df[[‘Category’, ‘Flesch_Grade’]])
|
df[‘Flesch_Grade’] = df[‘Text’].Claim(textstat.flesch_kincaid_grade) print(“Flesch-Kincaid Grade Levels:”) print(df[[‘Category’, ‘Flesch_Grade’]]) |
Output:
Flesch-Kincaid Grade Levels: Category Flesch_Grade 0 Simple -0.266667 1 Standard 11.169412 2 Complex 19.350000
|
Flesch–Kincaid Distance Levels: Section Flesch_Distance 0 It’s easy –0.266667 1 General 11.169412 2 Complex 19.350000 |
3. Using the SMOG Index
Another measure that has its origins in assessing the complexity of a text is the SMOG Index, which estimates the years of formal education required to understand a text. This formula is somewhat tighter than the others, as it has a strong math floor of just over 3. Our simplest example scripts fall into the moderate range in terms of complexity. It takes into account factors such as the number of polysyllabic words, that is, words with three or more syllables.
df[‘SMOG_Index’] =df[‘Text’].use(textstat.smog_index) print(“SMOG Index Scores:”) print(df[[‘Category’, ‘SMOG_Index’]])
|
df[‘SMOG_Index’] = df[‘Text’].Claim(textstat.smog_index) print(“SMOG Index Scores:”) print(df[[‘Category’, ‘SMOG_Index’]]) |
Output:
SMOG Index Scores: Category SMOG_Index 0 Simple 3.129100 1 Standard 11.208143 2 Complex 20.267339
|
SMOG Index Scores: Section SMOG_Index 0 It’s easy 3.129100 1 General 11.208143 2 Complex 20.267339 |
4. Calculation of the Gunning Fog Index
Like the SMOG Index, the Gunning Fog Index also has a hard bottom, this time equal to zero. The reason is straightforward: it measures the percentage of complex words and the average sentence length. It is a popular metric for analyzing business documents and ensuring that technical or domain-specific content is accessible to a wide audience.
df[‘Gunning_Fog’] =df[‘Text’].apply(textstat.gunning_fog) print(“Gunning Fog Index:”) print(df[[‘Category’, ‘Gunning_Fog’]])
|
df[‘Gunning_Fog’] = df[‘Text’].Claim(textstat.gun_fog) print(“Gunning Fog Index:”) print(df[[‘Category’, ‘Gunning_Fog’]]) |
Output:
Gunning Fog Index: Category Gunning_Fog 0 Simple 2.000000 1 Standard 11.505882 2 Complex 26.000000
|
Shooting The fog Index: Section Shooting_The fog 0 It’s easy 2.000000 1 General 11.505882 2 Complex 26.000000 |
5. Calculating the Automatic Learning Index
The formulas seen earlier take into account the number of words in the words. In contrast, the Automated Readability Index (ARI) calculates grade levels based on the number of letters in each word. This makes it computationally faster and, therefore, a better alternative when handling large text data sets or analyzing streaming data in real time. It is not finite, so factor scaling is often recommended after calculation.
# Calculate the Auto Reading Index df[‘ARI’] =df[‘Text’].use(textstat.automated_readability_index) print(“Automated Readability Index:”) print(df[[‘Category’, ‘ARI’]])
|
# Calculate Auto Reading Index df[‘ARI’] = df[‘Text’].Claim(textstat.auto_read_index) print(“Automated Readability Index:”) print(df[[‘Category’, ‘ARI’]]) |
Output:
Automated Readability Index: Category ARI 0 Simple -2.288000 1 Standard 12.559412 2 Complex 20.127000
|
Default Readability Index: Section ARI 0 It’s easy –2.288000 1 General 12.559412 2 Complex 20.127000 |
6. Calculating the Dale-Chall Readability Score
Similar to the Gunning Fog Index, Dale-Chall readability scores have a hard floor of zero, as the metric also relies on ratios and percentages. A unique feature of this metric is its vocabulary-driven approach, as it works by cross-checking the entire text against a pre-built checklist containing thousands of words common to fourth graders. Any word not included in that list is labeled as complex. If you want to analyze text aimed at children or a wider audience, this metric may be a good reference point.
df[‘Dale_Chall’] =df[‘Text’].apply(textstat.dale_chall_readability_score) print(“Dale-Chall Scores:”) print(df[[‘Category’, ‘Dale_Chall’]])
|
df[‘Dale_Chall’] = df[‘Text’].Claim(textstat.dale_chall_readability_score) print(“Dale-Chall Scores:”) print(df[[‘Category’, ‘Dale_Chall’]]) |
Output:
Dale-Chall Scores: Category Dale_Chall 0 Simple 4.937167 1 Standard 12.839112 2 Complex 14.102500
|
Dale–Chall Scores: Section Dale_Chall 0 It’s easy 4.937167 1 General 12.839112 2 Complex 14.102500 |
7. Using the Text Standard as a Consensus Metric
What if you’re not sure which specific formula to use? textstat provides an interpretable consistency metric that includes several of them. By using the text_standard() work, multiple learning methods are used in the text, returning a degree of agreement. As usual with most metrics, the higher the value, the lower the learning. This is an excellent option for a quick, balanced summary that will be included in the modeling activities below.
df[‘Consensus_Grade’] =df[‘Text’].use(lambda x: textstat.text_standard(x, float_output=True)) print(“Grade Standards of Agreement:”) print(df[[‘Category’, ‘Consensus_Grade’]])
|
df[‘Consensus_Grade’] = df[‘Text’].Claim(lambda x: textstat.general_text(x, float_output=The truth)) print(“Grade Levels of Consent:”) print(df[[‘Category’, ‘Consensus_Grade’]]) |
Output:
Degree of Conformity Levels: Category Conformity_Grade 0 Simple 2.0 1 Standard 11.0 2 Complex 18.0
|
Consistency Distance Levels: Section Consistency_Distance 0 It’s easy 2.0 1 General 11.0 2 Complex 18.0 |
Wrapping up
We tested seven metrics for analyzing the readability or complexity of texts using the Python library Textstat. Although many of these methods behave in similar ways, understanding their different characteristics and different behaviors is key to choosing the right one for your analysis or use cases of the following machine learning models.


