Alt text
  • cccc R Package
  • Functions
  • Download
  • Use Cases
  • Projects
  • References
  • About Us

On this page

  • 🔹 Function Definition
  • 🎯 Purpose
  • ⚙️ Arguments
  • 📊 Input File Requirements
    • Term-Document Matrix (TDM) File
    • Corpus Information File
  • 📦 Output
  • 🔍 Frequency Zones
    • Statistical Zones (zone = "stat")
    • Linguistic Zones (zone = "ling")
  • 💡 Usage Examples
    • Basic Usage
    • Using Excel Files
    • Custom Separator for CSV
  • 🔗 Typical Workflow
  • 📚 See Also

importData()

Import and Validate Corpus Data

The importData() function is the entry point for the cccc analytical pipeline. It imports, validates, and structures your corpus data and metadata into a standardized format ready for temporal analysis.


🔹 Function Definition

importData(
  tdm_file,
  corpus_file,
  sep_tdm = ";",
  sep_corpus_info = ";",
  zone = "stat",
  verbose = TRUE
)

🎯 Purpose

This function performs several critical operations:

  1. Reads data files — Imports term-document matrix (TDM) and corpus metadata from CSV or Excel files
  2. Validates structure — Ensures data format consistency and completeness
  3. Cleans keywords — Standardizes term formatting and removes duplicates
  4. Computes frequencies — Calculates total frequency per term across all time periods
  5. Assigns zones — Classifies terms into frequency zones for stratified analysis
  6. Structures output — Returns a standardized list object for downstream analysis

This function is essential for ensuring data quality and consistency before any temporal modeling or clustering operations.


⚙️ Arguments

Argument Type Default Description
tdm_file Character required Path to the term-document matrix file (CSV or Excel). First column: keywords/terms. Remaining columns: frequencies per year.
corpus_file Character required Path to the corpus information file (CSV or Excel). Must include columns for years, tokens, and number of documents per year.
sep_tdm Character ";" Column separator for TDM CSV file. Ignored if file is Excel format.
sep_corpus_info Character ";" Column separator for corpus info CSV file. Ignored if file is Excel format.
zone Character "stat" Frequency zone classification strategy:
• "stat": Statistical quartiles
• "ling": Linguistic frequency-based zones
verbose Logical TRUE If TRUE, prints progress messages during import and processing.

📊 Input File Requirements

Term-Document Matrix (TDM) File

The TDM file must have the following structure:

keyword 2000 2001 2002 …
algorithm 145 178 203 …
data 892 945 1021 …
network 234 267 289 …

Requirements: - First column contains keywords/terms - Subsequent columns represent years (column names should be year values) - Cell values are raw frequencies for each term in each year - Supported formats: CSV (with customizable separator) or Excel (.xlsx, .xls)

Corpus Information File

The corpus metadata file must include:

year dimCorpus nDoc
2000 1500000 450
2001 1650000 478
2002 1820000 502

Requirements: - year: Year identifier (must match TDM column names) - dimCorpus: Total number of tokens in the corpus for that year - nDoc: Number of documents in the corpus for that year - Additional metadata columns are preserved but not required


📦 Output

Returns a list object with the following components:

Element Type Description
tdm tibble Processed term-document matrix including:
• keyword: Cleaned term
• tot_freq: Total frequency across all years
• int_freq: Frequency interval label
• zone: Assigned frequency zone
• Year columns: Frequencies per year
corpus_info tibble Corpus metadata with years, dimCorpus, nDoc, and any additional metadata columns
norm logical Normalization status (initially FALSE, updated after normalization())
year_cols numeric Vector of column indices corresponding to yearly frequencies in the TDM
zone character Vector of unique frequency zone labels
colors character Default color palette for zones (used in visualization functions)

🔍 Frequency Zones

Statistical Zones (zone = "stat")

Terms are classified into quartile-based zones: - Zone 1: Q1 (0-25th percentile) — Low frequency terms - Zone 2: Q2 (25-50th percentile) — Lower-medium frequency terms - Zone 3: Q3 (50-75th percentile) — Upper-medium frequency terms - Zone 4: Q4 (75-100th percentile) — High frequency terms

Linguistic Zones (zone = "ling")

Terms are classified based on linguistic frequency theory: - Different thresholds based on corpus-linguistic principles - Zones reflect natural language frequency distributions - More aligned with Zipf’s law and lexical stratification


💡 Usage Examples

Basic Usage

library(cccc)

# Import data with default settings
corpus_data <- importData(
  tdm_file = "data/term_document_matrix.csv",
  corpus_file = "data/corpus_info.csv"
)

# Check structure
str(corpus_data)
names(corpus_data$tdm)

Using Excel Files

# Import from Excel files (separator arguments are ignored)
corpus_data <- importData(
  tdm_file = "data/tdm.xlsx",
  corpus_file = "data/corpus_metadata.xlsx",
  zone = "ling"  # Use linguistic zones
)

Custom Separator for CSV

# Import CSV files with comma separator
corpus_data <- importData(
  tdm_file = "data/tdm.csv",
  corpus_file = "data/corpus.csv",
  sep_tdm = ",",
  sep_corpus_info = ",",
  verbose = TRUE
)

🔗 Typical Workflow

After importing data with importData(), the typical next steps are:

  1. Explore the data → Use rowMassPlot() and colMassPlot() to visualize frequency distributions
  2. Normalize frequencies → Apply normalization() to account for corpus size variations
  3. Visualize trajectories → Use curvePlot() or facetPlot() to examine temporal patterns
  4. Model and smooth → Apply smoothing functions to identify trends

📚 See Also

  • normalization() — Normalize the imported TDM
  • rowMassPlot() — Visualize keyword frequency distribution
  • colMassPlot() — Visualize temporal corpus dimensions

 

© 2025 The cccc Team | Developed within the RIND Project