smoothingSelection()

Select Optimal Smoothing Parameters for Temporal Curves

The smoothingSelection() function identifies the best smoothing parameters for modeling keyword frequency trajectories over time. It systematically evaluates different combinations of spline degrees and smoothing penalties to find the optimal balance between smoothness and fidelity to the data.

🔹 Function Definition

smoothingSelection(
  data,
  lambda_seq = NULL,
  degrees = NULL,
  penalty_type = "m-2",
  normty = NULL,
  plot = TRUE,
  verbose = TRUE
)

🎯 Purpose

Raw keyword frequency curves are often noisy and irregular, making it difficult to identify underlying temporal trends. Smoothing helps reveal the true signal by reducing noise while preserving meaningful patterns.

The smoothingSelection() function helps you:

Find optimal smoothing level — Determine how much smoothing is appropriate for your data
Select spline complexity — Choose the right degree of B-spline for modeling
Balance fit vs. smoothness — Avoid both overfitting (too wiggly) and oversmoothing (too flat)
Compare penalty strategies — Evaluate different derivative-based penalization approaches
Make informed decisions — Use diagnostic criteria (GCV, OCV) for objective parameter selection
Visualize trade-offs — See how different parameters affect model quality

This function uses penalized B-spline regression with cross-validation to find parameters that generalize well to unseen data.

🧮 Statistical Background

B-splines

B-splines (basis splines) are piecewise polynomial functions that provide flexible curve fitting: - Degree (m): Controls the polynomial order (linear, quadratic, cubic, etc.) - Higher degrees = smoother, more flexible curves - Lower degrees = simpler, more constrained curves

Smoothing Parameter (λ)

Controls the trade-off between: - Fidelity to data (small λ): Curve follows data points closely but may be noisy - Smoothness (large λ): Curve is smooth but may miss important features

Penalty Types

“m-2”: Penalizes the (m-2)th derivative (default, adaptive to spline degree)
“2”: Penalizes the second derivative (curvature penalty)
“3”: Penalizes the third derivative (wiggliness penalty)

Diagnostic Criteria

GCV (Generalized Cross-Validation): - Estimates prediction error without actually splitting data - Lower GCV = better model performance - Preferred when computational efficiency matters

OCV (Ordinary Cross-Validation): - Leave-one-out cross-validation - More computationally intensive but more accurate - Lower OCV = better generalization

⚙️ Arguments

Argument	Type	Default	Description
data	List	required	A list object returned by `importData()` or `normalization()`, containing the TDM and corpus metadata.
lambda_seq	Numeric vector	`seq(-6, 9, 0.25)`	Sequence of log₁₀(λ) values to evaluate. Range from 10⁻⁶ to 10⁹. Finer sequences provide more precision but take longer.
degrees	Integer vector	`1:8`	Range of B-spline degrees (m) to test. Common choices: 3 (cubic), 4 (quartic), 5 (quintic).
penalty_type	Character	`"m-2"`	Penalty applied to derivatives: • `"m-2"`: Adaptive penalty (default) • `"2"`: Second derivative (curvature) • `"3"`: Third derivative (wiggliness)
normty	Character	`NULL`	Label for normalization method used (for documentation purposes). Automatically detected if data was normalized.
plot	Logical	`TRUE`	If `TRUE`, produces diagnostic plots showing df, SSE, GCV, and OCV across λ values.
verbose	Logical	`TRUE`	If `TRUE`, prints progress messages for each spline degree being evaluated.

📦 Output

Returns a list with comprehensive smoothing diagnostics:

Element	Type	Description
results	data.frame	Complete grid of all (m, λ) combinations with diagnostic measures (df, sse, gcv, ocv).
summary_optimal	data.frame	Summary table showing optimal GCV and OCV values for each tested spline degree.
optimal_gcv	data.frame	Subset of results containing λ values that minimize GCV for each degree.
optimal_ocv	data.frame	Subset of results containing λ values that minimize OCV for each degree.
plots	list	List of `ggplot2` objects showing evolution of df, sse, gcv, and ocv across λ (if `plot = TRUE`).
summary_panel	grob	Combined graphical summary of optimal smoothing parameters.
degree	numeric	Current spline degree m.
penalty_type	character	Penalization type used.
call	call	Function call for reproducibility.

💡 Usage Examples

Basic Usage

library(cccc)

# Import and normalize data
corpus <- importData("tdm.csv", "corpus_info.csv")
corpus_norm <- normalization(corpus, normty = "nc")

# Find optimal smoothing parameters (default settings)
smooth_params <- smoothingSelection(corpus_norm)

# View optimal parameters
smooth_params$summary_optimal

Custom Lambda Sequence

# Use finer lambda sequence for more precision
smooth_params <- smoothingSelection(
  corpus_norm,
  lambda_seq = seq(-6, 9, 0.1)  # Finer steps
)

Specific Degree Range

# Test only cubic and quartic splines
smooth_params <- smoothingSelection(
  corpus_norm,
  degrees = 3:4
)

Different Penalty Type

# Use second derivative penalty
smooth_params <- smoothingSelection(
  corpus_norm,
  penalty_type = "2"
)

📊 Interpreting the Output

1. Summary Optimal Table

smooth_params$summary_optimal

Shows for each degree: - degree: Spline degree (m) - optimal_lambda_gcv: λ that minimizes GCV - min_gcv: Minimum GCV value - optimal_lambda_ocv: λ that minimizes OCV - min_ocv: Minimum OCV value

How to use: Choose the degree with the lowest GCV or OCV value.

2. Diagnostic Plots

Four plots are generated (if plot = TRUE):

Degrees of Freedom (df) Plot

Shows effective model complexity across λ values. - Low df: Highly smoothed (simple model) - High df: Less smoothed (complex model)

Sum of Squared Errors (SSE) Plot

Shows fit to the data. - Low SSE: Better fit to observed data - High SSE: Poorer fit (oversmoothed)

GCV Plot

Shows estimated prediction error. - Minimum: Optimal balance point - U-shaped curve: Too much/too little smoothing both increase error

OCV Plot

Shows true cross-validation error. - Similar interpretation to GCV - More reliable but computationally intensive

3. Typical Pattern

        GCV/OCV
         |
    high |    \
         |     \___/  ← optimal
         |         \
     low |__________\___
         small λ  →  large λ
         (rough)    (smooth)

🎯 Choosing Parameters

Step 1: Compare Degrees

# Look at summary table
smooth_params$summary_optimal

# Find degree with minimum GCV
best_degree <- which.min(smooth_params$summary_optimal$min_gcv)

Step 2: Extract Optimal Lambda

# Get optimal lambda for best degree
optimal_params <- smooth_params$optimal_gcv %>%
  filter(degree == best_degree)

Step 3: Validate with Plots

Check that GCV curve has clear minimum (U-shape)
Verify df values are reasonable (not too high or too low)
Compare GCV and OCV to ensure consistency

📈 Use Cases

1. Initial Analysis

First time analyzing a corpus—need to find appropriate smoothing level.

2. Method Comparison

Testing different penalty strategies to see which suits your data.

3. Parameter Sensitivity

Understanding how robust your results are to smoothing choices.

4. Publication Preparation

Documenting systematic parameter selection for research papers.

5. Multi-Corpus Studies

Finding consistent parameters across different corpora.

💡 Tips & Best Practices

Start with defaults — They work well for most corpora
Always check plots — Visual inspection catches issues that numbers don’t
Compare GCV and OCV — Consistency indicates reliable results
Don’t oversearch — Finer lambda sequences rarely improve results significantly
Consider computation time — Balance precision with practical constraints
Document your choice — Save summary_optimal table for methods section
Prefer simpler models — If two degrees are similar, choose the lower one
Use normalized data — Always normalize before smoothing selection

📚 See Also

normalization() — Apply before smoothing selection
optimalSmoothing() — Next step: select final degree and penalty
plotSuboptimalFits() — Visualize different smoothing options
curvePlot() — Visualize raw trajectories