colMassPlot()
Visualize Temporal Dimensions of Corpus Structure
The colMassPlot()
function creates a multi-line plot showing how four key corpus metrics evolve over time. This provides crucial insights into corpus growth patterns, data collection trends, and temporal coverage.
πΉ Function Definition
colMassPlot(
data,sc = c(1, 10, 10, 1),
r = 1,
textty = "text",
themety = "light",
size_b = 2.5,
x_lab = "years"
)
π― Purpose
Understanding the temporal structure of your corpus is essential for interpreting frequency trends. The colMassPlot()
function helps you:
- Visualize corpus growth β See how the corpus expands over time
- Identify collection patterns β Detect periods of intensive/sparse data collection
- Assess temporal balance β Evaluate whether time periods are comparably represented
- Detect anomalies β Spot unusual spikes or drops in corpus size
- Contextualize term frequencies β Understand how corpus size affects frequency patterns
- Validate data quality β Ensure corpus metadata is consistent and complete
This function is typically used after importData()
and alongside rowMassPlot()
for comprehensive data exploration before normalization.
βοΈ Arguments
Argument | Type | Default | Description |
---|---|---|---|
data | List | required | A list object returned by importData() , containing the TDM and corpus metadata. |
sc | Numeric vector | c(1, 10, 10, 1) |
Scaling factors for the four metrics (in order): 1. nDoc (number of documents)2. dimCorpus (total tokens)3. Csum (sum of keyword frequencies)4. Mcf (maximum keyword frequency)Adjust to make lines visually comparable. |
r | Integer | 1 |
Interval for x-axis label thinning. r = 2 shows every 2nd year, r = 5 shows every 5th year, etc. Useful for long time series. |
textty | Character | "text" |
Label for the unit of analysis in the legend (e.g., "text" , "document" , "paper" , "article" ). |
themety | Character | "light" |
Visual theme for the plot: β’ "light" : Light background (default)β’ "dark" : Dark background |
size_b | Numeric | 2.5 |
Line thickness for the plot. Increase for bolder lines, decrease for finer lines. |
x_lab | Character | "years" |
Label for the x-axis (e.g., "Years" , "Time Period" , "Publication Year" ). |
π Corpus Metrics Explained
1. nDoc β Number of Documents
The count of documents/texts in the corpus for each time period.
Insights: - Shows data collection intensity - Reveals archival coverage patterns - Indicates periods of high/low publication activity
2. dimCorpus β Total Tokens
The total number of words/tokens in all documents for each time period.
Insights: - Represents overall corpus size - Important for normalization decisions - Shows writing/publication volume trends
3. Csum β Sum of Keyword Frequencies
The total frequency of all keywords across all documents in each time period.
Insights: - Indicates keyword density in the corpus - Shows overall vocabulary coverage - Helps assess keyword selection adequacy
4. Mcf β Maximum Keyword Frequency
The highest frequency of any single keyword in each time period.
Insights: - Identifies periods dominated by specific terms - Shows potential outliers or dominant topics - Indicates vocabulary concentration patterns
π¨ Understanding Scaling Factors
Different metrics have vastly different scales: - nDoc
might range from 10-100 - dimCorpus
might range from 10,000-1,000,000 - Csum
might range from 5,000-500,000 - Mcf
might range from 50-5,000
Scaling makes visual comparison possible by bringing all metrics to similar ranges.
Default scaling: c(1, 10, 10, 1)
- nDoc
: Γ1 (no scaling) - dimCorpus
: Γ10 (reduces by factor of 10) - Csum
: Γ10 (reduces by factor of 10) - Mcf
: Γ1 (no scaling)
Adjust scaling if lines are too far apart or overlapping too much.
π¦ Output
Returns a ggplot2
object with the following characteristics:
Element | Description |
---|---|
X-axis | Time periods (years) from the corpus metadata |
Y-axis | Rescaled values of the four corpus metrics |
Lines | Four colored lines representing each metricβs temporal evolution |
Legend | Shows metric names with scaling factors (e.g., βnDoc (Γ1)β, βdimCorpus (Γ10)β) |
Theme | Light or dark background based on themety parameter |
π‘ Usage Examples
Basic Usage
library(cccc)
# Import data
<- importData("tdm.csv", "corpus_info.csv")
corpus
# Create temporal plot with default settings
colMassPlot(corpus)
π Interpreting the Plot
Common Patterns
1. Parallel Growth
All four lines increase proportionally over time.
π All metrics βοΈ
Interpretation: Consistent corpus expansion with stable composition
2. Diverging Trends
Lines separate or converge over time.
π nDoc & dimCorpus βοΈ but Csum βοΈ
Interpretation: Corpus growing but keyword density decreasing (vocabulary diversification)
3. Spikes or Drops
Sudden changes in one or more metrics.
π nDoc: sudden spike in specific year
Interpretation: Intensive data collection period or archival event
4. Plateau Patterns
Metrics level off after initial growth.
π Early growth β β β flat
Interpretation: Complete historical coverage or saturation
π― What to Look For
1. Temporal Coverage
- Are all time periods equally represented?
- Are there gaps or sparse periods?
- Does coverage increase toward recent years?
2. Growth Patterns
- Linear growth (steady increase)
- Exponential growth (accelerating increase)
- Irregular patterns (varying collection intensity)
3. Proportionality
- Do nDoc and dimCorpus grow together? (expected)
- Does Csum follow dimCorpus? (indicates keyword coverage)
- Are there periods where Mcf spikes? (term dominance)
4. Anomalies
- Sudden drops (potential data quality issues)
- Isolated spikes (special events or archival additions)
- Plateaus (collection boundaries)
π Use Cases
1. Data Quality Assessment
Verify that corpus metadata is complete and consistent across time periods.
2. Normalization Decision
Determine whether normalization is necessary based on corpus size variation.
3. Historical Context
Understand how corpus collection reflects historical publication/archival patterns.
4. Methodology Documentation
Create figures for research papers showing corpus characteristics.
5. Comparative Studies
Compare temporal structures of different corpora or subcorpora.
π‘ Tips & Best Practices
- Always check this plot before normalization to understand corpus growth patterns
- Experiment with scaling to find the most informative visualization
- Compare with rowMassPlot() for comprehensive data exploration
- Document temporal patterns in your methodology section
- Use r parameter for long time series (>30 years) to avoid label clutter
- Save high-resolution versions for publications
- Consider normalization if you see large variations in corpus size across time
π See Also
importData()
β Import corpus data and metadatarowMassPlot()
β Visualize keyword frequency distributionnormalization()
β Normalize frequencies (often needed when corpus size varies)curvePlot()
β Visualize individual keyword trajectories