plotSuboptimalFits()

Visualize Smoothed Curves for Quality Assessment

The plotSuboptimalFits() function creates visual comparisons of raw and smoothed keyword frequency trajectories. It helps you assess how well your chosen smoothing parameters perform across different types of keywords in your corpus.

🔹 Function Definition

plotSuboptimalFits(
  data,
  opt_res,
  n_curves = 9,
  show_zone = FALSE,
  graph = FALSE
)

🎯 Purpose

After selecting optimal smoothing parameters with optimalSmoothing(), it’s crucial to visually validate that the smoothing works well across your entire corpus. This function helps you:

Assess smoothing quality — See how well smoothed curves capture underlying trends
Detect overfitting/undersmoothing — Identify cases where smoothing is too aggressive or too weak
Evaluate representativeness — Check performance across different keyword types
Compare raw vs. smoothed — Understand what information is retained vs. filtered
Identify problematic cases — Find keywords that may need special treatment
Build confidence — Validate that parameters work well before applying to full corpus
Create publication figures — Generate high-quality visualizations of smoothing results

The function intelligently samples keywords across the residual distribution to show a representative range of smoothing performance.

🧮 How It Works

RMS Residual Sampling

The function computes Root Mean Square (RMS) residuals for all keywords:

RMS = √(Σ(observed - smoothed)² / n)

Then selects n_curves keywords distributed across the RMS range: - Low RMS: Keywords where smoothing fits very well - Medium RMS: Typical smoothing performance - High RMS: Keywords with more complex patterns or poor fits

This ensures you see the full spectrum of smoothing behavior, not just the best cases.

⚙️ Arguments

Argument	Type	Default	Description
data	List	required	A list object returned by `importData()` or `normalization()`, containing the TDM and corpus metadata.
opt_res	List	required	The optimal smoothing configuration returned by `optimalSmoothing()`, including spline degree (`m_opt`), penalty type (`penalty_opt`), and lambda (`lambda_opt`).
n_curves	Integer	`9`	Number of keywords to visualize. Must be a perfect square for optimal grid layout (e.g., 4, 9, 16, 25).
show_zone	Logical	`FALSE`	If `TRUE`, includes the keyword’s frequency zone in plot titles (e.g., “algorithm [Zone 4]”).
graph	Logical	`FALSE`	If `TRUE`, displays plots immediately in the R graphics device. If `FALSE` (default), plots are returned invisibly and can be accessed from the output list.

📦 Output

Returns (invisibly) a list containing visualization objects:

Element	Type	Description
singleKeywordPlot	list	A list of individual `ggplot2` objects, one for each selected keyword. Each plot shows raw (dashed) and smoothed (solid) curves with keyword name in title.
combinedKeywordPlot	patchwork	A combined grid layout displaying all selected keyword plots together. Uses `patchwork` package for arrangement.

Plot characteristics: - Grey dashed line: Raw frequency trajectory - Red solid line: Smoothed spline fit - X-axis: Time periods (years) - Y-axis: Frequency (raw or normalized, depending on input data) - Title: Keyword name (and zone if show_zone = TRUE)

💡 Usage Examples

Basic Usage

library(cccc)

# Complete workflow
corpus <- importData("tdm.csv", "corpus_info.csv")
corpus_norm <- normalization(corpus, normty = "nc")

# Find optimal parameters
smooth_m2 <- smoothingSelection(corpus_norm, penalty_type = "m-2", plot = FALSE)
smooth_2 <- smoothingSelection(corpus_norm, penalty_type = "2", plot = FALSE)
optimal <- optimalSmoothing(list("m-2" = smooth_m2, "2" = smooth_2))

# Visualize smoothing quality
fits <- plotSuboptimalFits(corpus_norm, optimal)

# Display combined plot
fits$combinedKeywordPlot

Show Individual Plots

# Create plots
fits <- plotSuboptimalFits(corpus_norm, optimal, n_curves = 9)

# View first individual plot
fits$singleKeywordPlot[[1]]

# View specific keyword plot
fits$singleKeywordPlot[[5]]

# Save individual plots
library(ggplot2)
ggsave("keyword1_fit.png", fits$singleKeywordPlot[[1]], width = 8, height = 5)

Include Zone Information

# Add frequency zone to titles
fits <- plotSuboptimalFits(
  corpus_norm, 
  optimal, 
  n_curves = 9,
  show_zone = TRUE
)

# Now titles show: "algorithm [Zone 4]"
fits$combinedKeywordPlot

More/Fewer Keywords

# Show 4 keywords (2×2 grid)
fits_small <- plotSuboptimalFits(corpus_norm, optimal, n_curves = 4)

# Show 16 keywords (4×4 grid)
fits_large <- plotSuboptimalFits(corpus_norm, optimal, n_curves = 16)

# Show 25 keywords (5×5 grid)
fits_xlarge <- plotSuboptimalFits(corpus_norm, optimal, n_curves = 25)

🔍 Interpreting the Plots

What to Look For

✅ Good Smoothing

Raw:     • • •  •  •
           ╱╲  ╱╲
Smooth: ─────────────  (captures trend, reduces noise)

Smoothed curve follows general trend
Reduces noise without losing important features
No systematic bias (doesn’t consistently over/underestimate)

⚠️ Oversmoothing

Raw:     • • •  •  •
           ╱╲╱╲╱╲
Smooth: ───────────── (too flat)

Smoothed curve misses important peaks or valleys
Trajectory appears unnaturally flat
Real patterns are suppressed

⚠️ Undersmoothing

Raw:     • • •  •  •
           ╱╲╱╲╱╲
Smooth:   ╱╲╱╲╱╲  (too wiggly)

Smoothed curve follows noise too closely
Trajectory has spurious fluctuations
Fails to reveal underlying trend

⚠️ Systematic Bias

Raw:     • • •  •  •
         ╱╲╱╲╱╲
Smooth: ─────────── (consistently below/above)

Smoothed curve consistently over- or underestimates
May indicate inappropriate penalty or normalization

📊 Understanding the Selection

RMS Distribution Sampling

If you have 1000 keywords and request n_curves = 9, the function:

Computes RMS for all 1000 keywords
Sorts keywords by RMS (low to high)
Samples 9 keywords evenly across the distribution:
- Keywords ~1, 125, 250, 375, 500, 625, 750, 875, 1000

This gives you: - Low RMS keywords: Best-fit cases (smooth, predictable) - Medium RMS keywords: Typical cases (moderate complexity) - High RMS keywords: Challenging cases (noisy, volatile)

Why This Matters

Seeing only low-RMS keywords would give false confidence. Seeing only high-RMS keywords would be unnecessarily discouraging. The representative sample shows you the realistic range of smoothing performance.

📈 Use Cases

1. Quality Assurance

Before applying smoothing to full corpus, verify it works well.

2. Parameter Validation

Confirm that optimalSmoothing() choices are actually optimal visually.

3. Method Comparison

Compare smoothing with different parameters side-by-side.

4. Publication Figures

Create figures showing smoothing effectiveness for methods sections.

5. Identifying Outliers

Find keywords with unusual temporal patterns that need special attention.

6. Training Examples

Show collaborators/reviewers how smoothing works on your data.

💡 Tips & Best Practices

Always run this function — Don’t skip visual validation
Use perfect squares for n_curves (4, 9, 16, 25) for clean grid layouts
Start with 9 — Good balance between coverage and readability
Check high-RMS cases — If they look terrible, reconsider parameters
Save the plots — Include in supplementary materials or methods sections
Show to colleagues — Get feedback on whether smoothing looks reasonable
Don’t expect perfection — Some keywords will always be noisy
Compare normalizations — Try different normalization methods if smoothing looks poor

📚 See Also

optimalSmoothing() — Select parameters (prerequisite for this function)
smoothingSelection() — Find optimal λ for penalties
curvePlot() — Visualize specific keyword trajectories
facetPlot() — Create faceted visualizations by zone