smoothingSelection()
Select Optimal Smoothing Parameters for Temporal Curves
The smoothingSelection()
function identifies the best smoothing parameters for modeling keyword frequency trajectories over time. It systematically evaluates different combinations of spline degrees and smoothing penalties to find the optimal balance between smoothness and fidelity to the data.
🔹 Function Definition
smoothingSelection(
data,lambda_seq = NULL,
degrees = NULL,
penalty_type = "m-2",
normty = NULL,
plot = TRUE,
verbose = TRUE
)
🎯 Purpose
Raw keyword frequency curves are often noisy and irregular, making it difficult to identify underlying temporal trends. Smoothing helps reveal the true signal by reducing noise while preserving meaningful patterns.
The smoothingSelection()
function helps you:
- Find optimal smoothing level — Determine how much smoothing is appropriate for your data
- Select spline complexity — Choose the right degree of B-spline for modeling
- Balance fit vs. smoothness — Avoid both overfitting (too wiggly) and oversmoothing (too flat)
- Compare penalty strategies — Evaluate different derivative-based penalization approaches
- Make informed decisions — Use diagnostic criteria (GCV, OCV) for objective parameter selection
- Visualize trade-offs — See how different parameters affect model quality
This function uses penalized B-spline regression with cross-validation to find parameters that generalize well to unseen data.
🧮 Statistical Background
B-splines
B-splines (basis splines) are piecewise polynomial functions that provide flexible curve fitting: - Degree (m): Controls the polynomial order (linear, quadratic, cubic, etc.) - Higher degrees = smoother, more flexible curves - Lower degrees = simpler, more constrained curves
Smoothing Parameter (λ)
Controls the trade-off between: - Fidelity to data (small λ): Curve follows data points closely but may be noisy - Smoothness (large λ): Curve is smooth but may miss important features
Penalty Types
- “m-2”: Penalizes the (m-2)th derivative (default, adaptive to spline degree)
- “2”: Penalizes the second derivative (curvature penalty)
- “3”: Penalizes the third derivative (wiggliness penalty)
Diagnostic Criteria
GCV (Generalized Cross-Validation): - Estimates prediction error without actually splitting data - Lower GCV = better model performance - Preferred when computational efficiency matters
OCV (Ordinary Cross-Validation): - Leave-one-out cross-validation - More computationally intensive but more accurate - Lower OCV = better generalization
⚙️ Arguments
Argument | Type | Default | Description |
---|---|---|---|
data | List | required | A list object returned by importData() or normalization() , containing the TDM and corpus metadata. |
lambda_seq | Numeric vector | seq(-6, 9, 0.25) |
Sequence of log₁₀(λ) values to evaluate. Range from 10⁻⁶ to 10⁹. Finer sequences provide more precision but take longer. |
degrees | Integer vector | 1:8 |
Range of B-spline degrees (m) to test. Common choices: 3 (cubic), 4 (quartic), 5 (quintic). |
penalty_type | Character | "m-2" |
Penalty applied to derivatives: • "m-2" : Adaptive penalty (default)• "2" : Second derivative (curvature)• "3" : Third derivative (wiggliness) |
normty | Character | NULL |
Label for normalization method used (for documentation purposes). Automatically detected if data was normalized. |
plot | Logical | TRUE |
If TRUE , produces diagnostic plots showing df, SSE, GCV, and OCV across λ values. |
verbose | Logical | TRUE |
If TRUE , prints progress messages for each spline degree being evaluated. |
📦 Output
Returns a list with comprehensive smoothing diagnostics:
Element | Type | Description |
---|---|---|
results | data.frame | Complete grid of all (m, λ) combinations with diagnostic measures (df, sse, gcv, ocv). |
summary_optimal | data.frame | Summary table showing optimal GCV and OCV values for each tested spline degree. |
optimal_gcv | data.frame | Subset of results containing λ values that minimize GCV for each degree. |
optimal_ocv | data.frame | Subset of results containing λ values that minimize OCV for each degree. |
plots | list | List of ggplot2 objects showing evolution of df, sse, gcv, and ocv across λ (if plot = TRUE ). |
summary_panel | grob | Combined graphical summary of optimal smoothing parameters. |
degree | numeric | Current spline degree m. |
penalty_type | character | Penalization type used. |
call | call | Function call for reproducibility. |
💡 Usage Examples
Basic Usage
library(cccc)
# Import and normalize data
<- importData("tdm.csv", "corpus_info.csv")
corpus <- normalization(corpus, normty = "nc")
corpus_norm
# Find optimal smoothing parameters (default settings)
<- smoothingSelection(corpus_norm)
smooth_params
# View optimal parameters
$summary_optimal smooth_params
Custom Lambda Sequence
# Use finer lambda sequence for more precision
<- smoothingSelection(
smooth_params
corpus_norm,lambda_seq = seq(-6, 9, 0.1) # Finer steps
)
Specific Degree Range
# Test only cubic and quartic splines
<- smoothingSelection(
smooth_params
corpus_norm,degrees = 3:4
)
Different Penalty Type
# Use second derivative penalty
<- smoothingSelection(
smooth_params
corpus_norm,penalty_type = "2"
)
📊 Interpreting the Output
1. Summary Optimal Table
$summary_optimal smooth_params
Shows for each degree: - degree: Spline degree (m) - optimal_lambda_gcv: λ that minimizes GCV - min_gcv: Minimum GCV value - optimal_lambda_ocv: λ that minimizes OCV - min_ocv: Minimum OCV value
How to use: Choose the degree with the lowest GCV or OCV value.
2. Diagnostic Plots
Four plots are generated (if plot = TRUE
):
Degrees of Freedom (df) Plot
Shows effective model complexity across λ values. - Low df: Highly smoothed (simple model) - High df: Less smoothed (complex model)
Sum of Squared Errors (SSE) Plot
Shows fit to the data. - Low SSE: Better fit to observed data - High SSE: Poorer fit (oversmoothed)
GCV Plot
Shows estimated prediction error. - Minimum: Optimal balance point - U-shaped curve: Too much/too little smoothing both increase error
OCV Plot
Shows true cross-validation error. - Similar interpretation to GCV - More reliable but computationally intensive
3. Typical Pattern
GCV/OCV
|
high | \
| \___/ ← optimal
| \
low |__________\___
small λ → large λ
(rough) (smooth)
🎯 Choosing Parameters
Step 1: Compare Degrees
# Look at summary table
$summary_optimal
smooth_params
# Find degree with minimum GCV
<- which.min(smooth_params$summary_optimal$min_gcv) best_degree
Step 2: Extract Optimal Lambda
# Get optimal lambda for best degree
<- smooth_params$optimal_gcv %>%
optimal_params filter(degree == best_degree)
Step 3: Validate with Plots
- Check that GCV curve has clear minimum (U-shape)
- Verify df values are reasonable (not too high or too low)
- Compare GCV and OCV to ensure consistency
📈 Use Cases
1. Initial Analysis
First time analyzing a corpus—need to find appropriate smoothing level.
2. Method Comparison
Testing different penalty strategies to see which suits your data.
3. Parameter Sensitivity
Understanding how robust your results are to smoothing choices.
4. Publication Preparation
Documenting systematic parameter selection for research papers.
5. Multi-Corpus Studies
Finding consistent parameters across different corpora.
💡 Tips & Best Practices
- Start with defaults — They work well for most corpora
- Always check plots — Visual inspection catches issues that numbers don’t
- Compare GCV and OCV — Consistency indicates reliable results
- Don’t oversearch — Finer lambda sequences rarely improve results significantly
- Consider computation time — Balance precision with practical constraints
- Document your choice — Save summary_optimal table for methods section
- Prefer simpler models — If two degrees are similar, choose the lower one
- Use normalized data — Always normalize before smoothing selection
📚 See Also
normalization()
— Apply before smoothing selectionoptimalSmoothing()
— Next step: select final degree and penaltyplotSuboptimalFits()
— Visualize different smoothing optionscurvePlot()
— Visualize raw trajectories