Alt text
  • cccc R Package
  • Functions
  • Download
  • Use Cases
  • Projects
  • References
  • About Us

On this page

  • πŸ”Ή Function Definition
  • 🎯 Purpose
  • βš™οΈ Arguments
  • πŸ“Š Normalization Methods
    • 1. Column Normalization ("nc")
    • 2. Chi-square Normalization ("nchi")
    • 3. Maximum Frequency Normalization ("nM")
    • 4. Min-Max Normalization ("nmM")
    • 5. Non-linear Normalization ("nnl")
  • πŸ“¦ Output
  • πŸ’‘ Usage Examples
    • Basic Column Normalization
    • Comparing Different Methods
    • Non-linear Normalization with Asymmetry
    • Custom Scaling Factor
  • πŸ“Š Choosing the Right Method
  • πŸ”— Typical Workflow
  • ⚠️ Important Notes
    • Re-normalization
    • Zero Frequencies
    • Interpretation
  • πŸ“š See Also

normalization()

Normalize Term-Document Matrix for Temporal Analysis

The normalization() function standardizes raw keyword frequencies to enable meaningful comparisons across time periods with varying corpus sizes. It applies one of five normalization strategies, each suited to different analytical goals.


πŸ”Ή Function Definition

normalization(
  data,
  normty = "nc",
  sc = 1000,
  nnlty = "V",
  p_asy = TRUE
)

🎯 Purpose

Raw frequency counts are influenced by corpus size variations across time periods. A term appearing 100 times in a corpus of 10,000 words is much more significant than the same count in a corpus of 1,000,000 words.

The normalization() function addresses this by:

  1. Adjusting for corpus size β€” Accounts for different document/token counts per year
  2. Enabling fair comparisons β€” Makes frequencies comparable across time periods
  3. Highlighting relative importance β€” Emphasizes terms’ relative prominence
  4. Preparing for modeling β€” Creates standardized data for temporal smoothing and clustering

This function is typically applied after importData() and before temporal modeling and visualization.


βš™οΈ Arguments

Argument Type Default Description
data List required A list object returned by importData(), containing the TDM and corpus metadata.
normty Character "nc" Normalization method to apply. Options:
β€’ "nc": Column normalization by corpus size
β€’ "nchi": Chi-square-like normalization
β€’ "nM": Maximum frequency normalization
β€’ "nmM": Min-max normalization
β€’ "nnl": Non-linear normalization
sc Numeric 1000 Scaling factor applied after normalization. Default is 1000 for "nc" and "nM", otherwise 1.
nnlty Character "V" Asymmetry measure for non-linear normalization (only used when normty = "nnl"):
β€’ "V": Variance-based asymmetry
β€’ "M": Mean-median-based asymmetry
p_asy Logical TRUE If TRUE and normty = "nnl", includes asymmetry coefficients in the output.

πŸ“Š Normalization Methods

1. Column Normalization ("nc")

Formula: Normalized frequency = (raw frequency / total tokens in year) Γ— scaling factor

Use when: - You want to account for varying corpus sizes across time periods - Comparing frequencies across years with different document counts - Standard normalization for most temporal analyses

Example:

corpus_norm <- normalization(corpus_data, normty = "nc", sc = 1000)

This converts raw frequencies to β€œper 1000 tokens” rates.


2. Chi-square Normalization ("nchi")

Formula: Based on chi-square decomposition using row masses and expected frequencies

Use when: - You want to emphasize deviations from expected frequencies - Performing correspondence analysis-style normalization - Highlighting terms that appear more/less than expected

Example:

corpus_norm <- normalization(corpus_data, normty = "nchi")

This method is rooted in correspondence analysis theory and emphasizes relative contributions.


3. Maximum Frequency Normalization ("nM")

Formula: Normalized frequency = (raw frequency / maximum frequency in row) Γ— scaling factor

Use when: - You want to focus on relative peaks within each term’s trajectory - Comparing temporal patterns regardless of absolute frequency - All terms scaled to same maximum (useful for clustering)

Example:

corpus_norm <- normalization(corpus_data, normty = "nM", sc = 1000)

Each term’s maximum frequency becomes the scaling reference point.


4. Min-Max Normalization ("nmM")

Formula: Normalized frequency = (raw frequency - min) / (max - min)

Use when: - You need all values scaled to [0, 1] range - Comparing terms with vastly different frequency ranges - Preparing data for specific clustering algorithms

Example:

corpus_norm <- normalization(corpus_data, normty = "nmM")

All normalized values fall between 0 (minimum) and 1 (maximum) for each term.


5. Non-linear Normalization ("nnl")

Formula: Incorporates asymmetry coefficients based on distribution shape

Use when: - Term frequencies show strong skewness or asymmetry - You want to account for variance or mean-median differences - Advanced modeling requires distribution-aware normalization

Asymmetry types: - "V": Variance-based (accounts for spread) - "M": Mean-median-based (accounts for skewness)

Example:

# Variance-based
corpus_norm <- normalization(corpus_data, normty = "nnl", nnlty = "V")

# Mean-median-based
corpus_norm <- normalization(corpus_data, normty = "nnl", nnlty = "M", p_asy = TRUE)

πŸ“¦ Output

Returns a list with the same structure as input, with these updates:

Element Type Description
tdm tibble Term-document matrix with normalized frequencies
corpus_info tibble Unchanged corpus metadata
norm logical Set to TRUE indicating normalization has been applied
normty character The normalization method used
year_cols numeric Unchanged column indices for yearly data
zone character Unchanged frequency zones
colors character Unchanged color palette
p_asy numeric (Optional) Asymmetry coefficients (only for "nnl" method with p_asy = TRUE)

πŸ’‘ Usage Examples

Basic Column Normalization

library(cccc)

# Import data
corpus <- importData("tdm.csv", "corpus_info.csv")

# Apply standard column normalization (per 1000 tokens)
corpus_norm <- normalization(corpus, normty = "nc", sc = 1000)

# Check normalization status
corpus_norm$norm  # Should be TRUE
corpus_norm$normty  # Should be "nc"

Comparing Different Methods

# Column normalization
corpus_nc <- normalization(corpus, normty = "nc")

# Chi-square normalization
corpus_nchi <- normalization(corpus, normty = "nchi")

# Max normalization
corpus_nM <- normalization(corpus, normty = "nM")

# Compare visualizations
curvePlot(corpus_nc, keywords = c("algorithm", "data"))
curvePlot(corpus_nchi, keywords = c("algorithm", "data"))

Non-linear Normalization with Asymmetry

# Apply non-linear normalization with variance-based asymmetry
corpus_nnl <- normalization(
  corpus, 
  normty = "nnl", 
  nnlty = "V", 
  p_asy = TRUE
)

# View asymmetry coefficients
head(corpus_nnl$p_asy)

Custom Scaling Factor

# Normalize to "per 10,000 tokens" instead of per 1000
corpus_norm <- normalization(corpus, normty = "nc", sc = 10000)

πŸ“Š Choosing the Right Method

Method Best For Pros Cons
nc General temporal analysis Simple, interpretable May not handle extreme skewness well
nchi Deviation-focused analysis Emphasizes unexpected patterns Less intuitive interpretation
nM Pattern comparison Good for clustering Loses absolute frequency information
nmM Algorithm preparation Bounded [0,1] range Sensitive to outliers
nnl Asymmetric distributions Accounts for skewness More complex, requires understanding

Recommendation: Start with "nc" for most analyses. Use "nchi" for correspondence-style analysis, "nM" for clustering, and "nnl" for distributions with strong asymmetry.


πŸ”— Typical Workflow

# 1. Import data
corpus <- importData("tdm.csv", "corpus_info.csv")

# 2. Explore raw frequencies
rowMassPlot(corpus)
colMassPlot(corpus)

# 3. Normalize
corpus_norm <- normalization(corpus, normty = "nc", sc = 1000)

# 4. Visualize normalized trajectories
curvePlot(corpus_norm, keywords = c("algorithm", "network", "data"))
facetPlot(corpus_norm, zone = "all")

# 5. Proceed with smoothing
smooth_params <- smoothingSelection(corpus_norm)

⚠️ Important Notes

Re-normalization

Once a dataset is normalized (norm = TRUE), applying normalization() again will normalize the already-normalized data. Always start from raw data if you need to try different normalization methods.

Zero Frequencies

Terms with zero frequencies in all time periods will remain zero after normalization. Consider filtering these out before normalization if needed.

Interpretation

After normalization, frequency values no longer represent raw counts. Always specify the normalization method and scaling factor when reporting results.


πŸ“š See Also

  • importData() β€” Import and prepare data (required before normalization)
  • curvePlot() β€” Visualize normalized temporal curves
  • facetPlot() β€” Compare normalized trajectories across zones
  • smoothingSelection() β€” Next step: select smoothing parameters
 

Β© 2025 The cccc Team | Developed within the RIND Project