importData()
Import and Validate Corpus Data
The importData()
function is the entry point for the cccc analytical pipeline. It imports, validates, and structures your corpus data and metadata into a standardized format ready for temporal analysis.
🔹 Function Definition
importData(
tdm_file,
corpus_file,sep_tdm = ";",
sep_corpus_info = ";",
zone = "stat",
verbose = TRUE
)
🎯 Purpose
This function performs several critical operations:
- Reads data files — Imports term-document matrix (TDM) and corpus metadata from CSV or Excel files
- Validates structure — Ensures data format consistency and completeness
- Cleans keywords — Standardizes term formatting and removes duplicates
- Computes frequencies — Calculates total frequency per term across all time periods
- Assigns zones — Classifies terms into frequency zones for stratified analysis
- Structures output — Returns a standardized list object for downstream analysis
This function is essential for ensuring data quality and consistency before any temporal modeling or clustering operations.
⚙️ Arguments
Argument | Type | Default | Description |
---|---|---|---|
tdm_file | Character | required | Path to the term-document matrix file (CSV or Excel). First column: keywords/terms. Remaining columns: frequencies per year. |
corpus_file | Character | required | Path to the corpus information file (CSV or Excel). Must include columns for years, tokens, and number of documents per year. |
sep_tdm | Character | ";" |
Column separator for TDM CSV file. Ignored if file is Excel format. |
sep_corpus_info | Character | ";" |
Column separator for corpus info CSV file. Ignored if file is Excel format. |
zone | Character | "stat" |
Frequency zone classification strategy: • "stat" : Statistical quartiles• "ling" : Linguistic frequency-based zones |
verbose | Logical | TRUE |
If TRUE , prints progress messages during import and processing. |
📊 Input File Requirements
Term-Document Matrix (TDM) File
The TDM file must have the following structure:
keyword | 2000 | 2001 | 2002 | … |
---|---|---|---|---|
algorithm | 145 | 178 | 203 | … |
data | 892 | 945 | 1021 | … |
network | 234 | 267 | 289 | … |
Requirements: - First column contains keywords/terms - Subsequent columns represent years (column names should be year values) - Cell values are raw frequencies for each term in each year - Supported formats: CSV (with customizable separator) or Excel (.xlsx, .xls)
Corpus Information File
The corpus metadata file must include:
year | dimCorpus | nDoc |
---|---|---|
2000 | 1500000 | 450 |
2001 | 1650000 | 478 |
2002 | 1820000 | 502 |
Requirements: - year
: Year identifier (must match TDM column names) - dimCorpus
: Total number of tokens in the corpus for that year - nDoc
: Number of documents in the corpus for that year - Additional metadata columns are preserved but not required
📦 Output
Returns a list object with the following components:
Element | Type | Description |
---|---|---|
tdm | tibble | Processed term-document matrix including: • keyword : Cleaned term• tot_freq : Total frequency across all years• int_freq : Frequency interval label• zone : Assigned frequency zone• Year columns: Frequencies per year |
corpus_info | tibble | Corpus metadata with years , dimCorpus , nDoc , and any additional metadata columns |
norm | logical | Normalization status (initially FALSE , updated after normalization() ) |
year_cols | numeric | Vector of column indices corresponding to yearly frequencies in the TDM |
zone | character | Vector of unique frequency zone labels |
colors | character | Default color palette for zones (used in visualization functions) |
🔍 Frequency Zones
Statistical Zones (zone = "stat"
)
Terms are classified into quartile-based zones: - Zone 1: Q1 (0-25th percentile) — Low frequency terms - Zone 2: Q2 (25-50th percentile) — Lower-medium frequency terms - Zone 3: Q3 (50-75th percentile) — Upper-medium frequency terms - Zone 4: Q4 (75-100th percentile) — High frequency terms
Linguistic Zones (zone = "ling"
)
Terms are classified based on linguistic frequency theory: - Different thresholds based on corpus-linguistic principles - Zones reflect natural language frequency distributions - More aligned with Zipf’s law and lexical stratification
💡 Usage Examples
Basic Usage
library(cccc)
# Import data with default settings
<- importData(
corpus_data tdm_file = "data/term_document_matrix.csv",
corpus_file = "data/corpus_info.csv"
)
# Check structure
str(corpus_data)
names(corpus_data$tdm)
Using Excel Files
# Import from Excel files (separator arguments are ignored)
<- importData(
corpus_data tdm_file = "data/tdm.xlsx",
corpus_file = "data/corpus_metadata.xlsx",
zone = "ling" # Use linguistic zones
)
Custom Separator for CSV
# Import CSV files with comma separator
<- importData(
corpus_data tdm_file = "data/tdm.csv",
corpus_file = "data/corpus.csv",
sep_tdm = ",",
sep_corpus_info = ",",
verbose = TRUE
)
🔗 Typical Workflow
After importing data with importData()
, the typical next steps are:
- Explore the data → Use
rowMassPlot()
andcolMassPlot()
to visualize frequency distributions - Normalize frequencies → Apply
normalization()
to account for corpus size variations - Visualize trajectories → Use
curvePlot()
orfacetPlot()
to examine temporal patterns - Model and smooth → Apply smoothing functions to identify trends
📚 See Also
normalization()
— Normalize the imported TDMrowMassPlot()
— Visualize keyword frequency distributioncolMassPlot()
— Visualize temporal corpus dimensions