normalization()
Normalize Term-Document Matrix for Temporal Analysis
The normalization()
function standardizes raw keyword frequencies to enable meaningful comparisons across time periods with varying corpus sizes. It applies one of five normalization strategies, each suited to different analytical goals.
πΉ Function Definition
normalization(
data,normty = "nc",
sc = 1000,
nnlty = "V",
p_asy = TRUE
)
π― Purpose
Raw frequency counts are influenced by corpus size variations across time periods. A term appearing 100 times in a corpus of 10,000 words is much more significant than the same count in a corpus of 1,000,000 words.
The normalization()
function addresses this by:
- Adjusting for corpus size β Accounts for different document/token counts per year
- Enabling fair comparisons β Makes frequencies comparable across time periods
- Highlighting relative importance β Emphasizes termsβ relative prominence
- Preparing for modeling β Creates standardized data for temporal smoothing and clustering
This function is typically applied after importData()
and before temporal modeling and visualization.
βοΈ Arguments
Argument | Type | Default | Description |
---|---|---|---|
data | List | required | A list object returned by importData() , containing the TDM and corpus metadata. |
normty | Character | "nc" |
Normalization method to apply. Options: β’ "nc" : Column normalization by corpus sizeβ’ "nchi" : Chi-square-like normalizationβ’ "nM" : Maximum frequency normalizationβ’ "nmM" : Min-max normalizationβ’ "nnl" : Non-linear normalization |
sc | Numeric | 1000 |
Scaling factor applied after normalization. Default is 1000 for "nc" and "nM" , otherwise 1. |
nnlty | Character | "V" |
Asymmetry measure for non-linear normalization (only used when normty = "nnl" ):β’ "V" : Variance-based asymmetryβ’ "M" : Mean-median-based asymmetry |
p_asy | Logical | TRUE |
If TRUE and normty = "nnl" , includes asymmetry coefficients in the output. |
π Normalization Methods
1. Column Normalization ("nc"
)
Formula: Normalized frequency = (raw frequency / total tokens in year) Γ scaling factor
Use when: - You want to account for varying corpus sizes across time periods - Comparing frequencies across years with different document counts - Standard normalization for most temporal analyses
Example:
<- normalization(corpus_data, normty = "nc", sc = 1000) corpus_norm
This converts raw frequencies to βper 1000 tokensβ rates.
2. Chi-square Normalization ("nchi"
)
Formula: Based on chi-square decomposition using row masses and expected frequencies
Use when: - You want to emphasize deviations from expected frequencies - Performing correspondence analysis-style normalization - Highlighting terms that appear more/less than expected
Example:
<- normalization(corpus_data, normty = "nchi") corpus_norm
This method is rooted in correspondence analysis theory and emphasizes relative contributions.
3. Maximum Frequency Normalization ("nM"
)
Formula: Normalized frequency = (raw frequency / maximum frequency in row) Γ scaling factor
Use when: - You want to focus on relative peaks within each termβs trajectory - Comparing temporal patterns regardless of absolute frequency - All terms scaled to same maximum (useful for clustering)
Example:
<- normalization(corpus_data, normty = "nM", sc = 1000) corpus_norm
Each termβs maximum frequency becomes the scaling reference point.
4. Min-Max Normalization ("nmM"
)
Formula: Normalized frequency = (raw frequency - min) / (max - min)
Use when: - You need all values scaled to [0, 1] range - Comparing terms with vastly different frequency ranges - Preparing data for specific clustering algorithms
Example:
<- normalization(corpus_data, normty = "nmM") corpus_norm
All normalized values fall between 0 (minimum) and 1 (maximum) for each term.
5. Non-linear Normalization ("nnl"
)
Formula: Incorporates asymmetry coefficients based on distribution shape
Use when: - Term frequencies show strong skewness or asymmetry - You want to account for variance or mean-median differences - Advanced modeling requires distribution-aware normalization
Asymmetry types: - "V"
: Variance-based (accounts for spread) - "M"
: Mean-median-based (accounts for skewness)
Example:
# Variance-based
<- normalization(corpus_data, normty = "nnl", nnlty = "V")
corpus_norm
# Mean-median-based
<- normalization(corpus_data, normty = "nnl", nnlty = "M", p_asy = TRUE) corpus_norm
π¦ Output
Returns a list with the same structure as input, with these updates:
Element | Type | Description |
---|---|---|
tdm | tibble | Term-document matrix with normalized frequencies |
corpus_info | tibble | Unchanged corpus metadata |
norm | logical | Set to TRUE indicating normalization has been applied |
normty | character | The normalization method used |
year_cols | numeric | Unchanged column indices for yearly data |
zone | character | Unchanged frequency zones |
colors | character | Unchanged color palette |
p_asy | numeric | (Optional) Asymmetry coefficients (only for "nnl" method with p_asy = TRUE ) |
π‘ Usage Examples
Basic Column Normalization
library(cccc)
# Import data
<- importData("tdm.csv", "corpus_info.csv")
corpus
# Apply standard column normalization (per 1000 tokens)
<- normalization(corpus, normty = "nc", sc = 1000)
corpus_norm
# Check normalization status
$norm # Should be TRUE
corpus_norm$normty # Should be "nc" corpus_norm
Comparing Different Methods
# Column normalization
<- normalization(corpus, normty = "nc")
corpus_nc
# Chi-square normalization
<- normalization(corpus, normty = "nchi")
corpus_nchi
# Max normalization
<- normalization(corpus, normty = "nM")
corpus_nM
# Compare visualizations
curvePlot(corpus_nc, keywords = c("algorithm", "data"))
curvePlot(corpus_nchi, keywords = c("algorithm", "data"))
Non-linear Normalization with Asymmetry
# Apply non-linear normalization with variance-based asymmetry
<- normalization(
corpus_nnl
corpus, normty = "nnl",
nnlty = "V",
p_asy = TRUE
)
# View asymmetry coefficients
head(corpus_nnl$p_asy)
Custom Scaling Factor
# Normalize to "per 10,000 tokens" instead of per 1000
<- normalization(corpus, normty = "nc", sc = 10000) corpus_norm
π Choosing the Right Method
Method | Best For | Pros | Cons |
---|---|---|---|
nc | General temporal analysis | Simple, interpretable | May not handle extreme skewness well |
nchi | Deviation-focused analysis | Emphasizes unexpected patterns | Less intuitive interpretation |
nM | Pattern comparison | Good for clustering | Loses absolute frequency information |
nmM | Algorithm preparation | Bounded [0,1] range | Sensitive to outliers |
nnl | Asymmetric distributions | Accounts for skewness | More complex, requires understanding |
Recommendation: Start with "nc"
for most analyses. Use "nchi"
for correspondence-style analysis, "nM"
for clustering, and "nnl"
for distributions with strong asymmetry.
π Typical Workflow
# 1. Import data
<- importData("tdm.csv", "corpus_info.csv")
corpus
# 2. Explore raw frequencies
rowMassPlot(corpus)
colMassPlot(corpus)
# 3. Normalize
<- normalization(corpus, normty = "nc", sc = 1000)
corpus_norm
# 4. Visualize normalized trajectories
curvePlot(corpus_norm, keywords = c("algorithm", "network", "data"))
facetPlot(corpus_norm, zone = "all")
# 5. Proceed with smoothing
<- smoothingSelection(corpus_norm) smooth_params
β οΈ Important Notes
Re-normalization
Once a dataset is normalized (norm = TRUE
), applying normalization()
again will normalize the already-normalized data. Always start from raw data if you need to try different normalization methods.
Zero Frequencies
Terms with zero frequencies in all time periods will remain zero after normalization. Consider filtering these out before normalization if needed.
Interpretation
After normalization, frequency values no longer represent raw counts. Always specify the normalization method and scaling factor when reporting results.
π See Also
importData()
β Import and prepare data (required before normalization)curvePlot()
β Visualize normalized temporal curvesfacetPlot()
β Compare normalized trajectories across zonessmoothingSelection()
β Next step: select smoothing parameters