Computational Modeling of Identity Protective Cognition in Text Data: A Python-Based Framework for Detection, Quantification, and Visualization
Computational
Modeling of Identity Protective Cognition in Text Data: A Python-Based
Framework for Detection, Quantification, and Visualization
A research article and implementation framework for
computational social science
Prepared
for Word (.docx) delivery
Author:
OpenAI research drafting assistant
Date:
12 March 2026
Abstracted
research prototype. Simulated empirical section uses a mock dataset for
demonstration.
Abstract
Identity Protective Cognition (IPC) refers
to the tendency of individuals to process information in ways that protect
identities tied to valued groups, communities, and status systems. While the
concept emerged in cultural cognition research, it now has direct relevance for
computational social science because identity-protective dynamics leave
measurable traces in language. This article develops a practical framework for
operationalizing IPC from text data such as social-media posts, political
exchanges, blog comments, and religious controversies. The proposed approach
integrates lexical, semantic, sentiment, stance, and network features into a
composite IPC score that estimates the degree to which discourse is organized
around identity defense rather than open-ended evidence evaluation. The paper
reviews the theoretical foundations of IPC, motivated reasoning, cultural
cognition, and affective polarization; translates these ideas into measurable
textual indicators; and maps each stage of the pipeline to widely used Python
libraries, including pandas, NumPy, NLTK, spaCy, scikit-learn, gensim,
transformers, NetworkX, matplotlib, and seaborn. A mock dataset and simulated
analysis illustrate how IPC can be quantified across speakers and interactions.
The article argues that IPC is not reducible to sentiment alone: it is better
understood as a patterned conjunction of identity signals, outgroup derogation,
moral framing, argumentative rigidity, and polarized interaction structure. The
framework is intended as a transparent research prototype rather than a final
psychometric instrument, and the article closes by outlining validity
challenges, limitations, and future directions involving large language models,
argument mining, and agent-based simulation.
Keywords: Identity Protective Cognition;
motivated reasoning; cultural cognition; natural language processing;
polarization; argument mining; computational social science
1. Introduction
Identity Protective Cognition (IPC) is a
useful construct for explaining why the same factual record can generate
radically different interpretations across social groups. In the original
cultural-cognition literature, the key claim was not merely that citizens lack
information, but that information itself is filtered through identity-relevant
commitments. Kahan and colleagues argued that people are often motivated to
form beliefs that align with the values and affiliations of groups important to
their social standing and self-conception. In this view, cognition is not
simply a tool for discovering external reality; it is also a mechanism for
preserving membership, status, and solidarity within a social world. That
insight helps explain why increased sophistication can sometimes amplify rather
than reduce polarization when contested issues become identity markers (Kahan
et al., 2007; Kahan, 2013).
IPC matters for political polarization
because modern disputes rarely remain limited to the literal issue at hand.
Questions about climate, vaccines, immigration, religion, taxation, education,
or gender can become signals of “who is with us” and “who is against us.” Once
this happens, participants often treat evidence as socially loaded. Accepting a
claim may feel like yielding status to an outgroup, while rejecting a claim may
function as a public badge of loyalty. Recent political-psychology research continues
to show that partisan bias and motivated interpretation are structured by
ingroup favoritism and identity-linked goals rather than simple ignorance
(Ditto, 2025). IPC is therefore relevant not only to attitude formation but
also to discussion dynamics, where language becomes a vehicle for identity
alignment, norm enforcement, and outgroup exclusion.
The rise of digital text corpora makes it
possible to study these processes computationally. Social-media threads, blog
comment sections, parliamentary debates, podcasts, and online forums generate
vast language traces that reveal how identities are invoked, defended,
moralized, and weaponized. Computational social science has matured to the
point where lexical analysis, topic modeling, transformer-based inference, and
graph methods can be combined into a unified pipeline. Yet IPC itself has not
often been converted into a transparent, modular measurement framework suitable
for large-scale text analysis. Existing work frequently focuses on adjacent
constructs such as ideological bias, stance, affective polarization, or
misinformation susceptibility. Those constructs matter, but IPC adds a distinct
emphasis: the protection of identity as a cognitive objective.
This article develops a research-grade but
practically implementable framework for modeling IPC in text data. The
contribution is threefold. First, it synthesizes literature from cultural
cognition, motivated reasoning, political psychology, and argumentation
research into a measurable concept. Second, it translates IPC into textual
indicators that can be extracted with mainstream Python tools. Third, it
proposes an IPC score that combines identity signal density, sentiment
polarization, argumentative rigidity, and outgroup delegitimization,
supplemented by moral framing and network structure where data permit. The goal
is not to claim that any single score fully captures the psychological
phenomenon, but to provide a transparent starting point for empirical work.
Because the method uses explicit features rather than a black-box label alone,
it supports interpretability, replication, and iterative validation.
The paper is designed for researchers who
work at the intersection of psychology, computational social science, and
natural language processing (NLP). It presents the theory, the computational
pipeline, a mathematical formulation, Python implementation examples, a mock
dataset, simulated results, and a full project template. The broader ambition
is methodological: to create a bridge between psychologically rich theory and
scalable text analysis.
2. Theoretical Framework
IPC sits at the intersection of several
theoretical traditions. The first is motivated reasoning. In the broadest
sense, motivated reasoning refers to information processing shaped by
directional goals rather than accuracy goals alone. People do not merely ask
what is true; they also ask, often implicitly, what belief will preserve
coherence, belonging, esteem, or desired emotion. Recent reviews of partisan
bias emphasize that political judgment often reflects ingroup favoritism in
what people seek out, believe, and remember (Ditto, 2025). IPC can be
understood as a subtype of motivated reasoning in which the relevant motive is
identity protection.
The second tradition is cultural cognition.
Kahan’s work argued that individuals appraise risks, evidence, and expertise
partly through cultural worldviews tied to preferred forms of social
organization. The important move here is from “beliefs as private opinions” to
“beliefs as social signals.” Under this perspective, apparently factual
disputes can become entangled with deeper cultural meanings. The famous “white
male effect” paper framed identity-protective cognition as a mechanism through
which people dismiss asserted dangers that threaten identities linked to
hierarchical or individualistic values (Kahan et al., 2007). Later work on
scientific consensus extended this argument by showing that deference to
expertise itself can become culturally coded (Kahan et al., 2011).
A third relevant concept is epistemic
identity. Epistemic identity refers to the way individuals come to define
themselves, and to be recognized by others, through styles of knowing:
skeptical, orthodox, anti-elite, scientific, contrarian, patriotic, traditionalist,
and so forth. In many online environments, these identities are publicly
performed through repeated rhetorical choices. Identity-protective cognition is
easier to detect when epistemic identity is linguistically salient: speakers
say not only “this is false,” but “people like us know this is false,” or “only
gullible elites believe that.” Such expressions connect content to community
membership.
A fourth distinction concerns normative
versus descriptive rationality. Normative accounts ask how agents should reason
under standards such as coherence, probability, or evidence integration.
Descriptive accounts ask how agents actually reason under cognitive and social
constraints. IPC belongs primarily to descriptive explanation. It does not
imply that people are irrational in every practical sense. From a local social
perspective, preserving group belonging may be instrumentally rational.
Publicly endorsing outgroup-favored claims can carry status costs, relationship
costs, or identity dissonance. Thus, a computational model of IPC should not
treat every strong opinion as a defect. Instead, it should identify patterns
where discourse appears more oriented toward identity maintenance than toward
flexible evidence updating.
Finally, IPC must be distinguished from
mere negative sentiment. A speaker can be angry without engaging in identity
defense, and a speaker can exhibit IPC in relatively polite language. The
distinguishing signature is patterned alignment between group identity,
selective evaluation, and the treatment of disagreement as a threat to the
social self. This is why a useful model must combine multiple dimensions rather
than depend on a single classifier.
3. Literature Review
The literature on IPC is dispersed across
psychology, law, risk perception, political behavior, and public communication.
Foundational work by Kahan and colleagues positioned identity protection as a
central mechanism behind disagreements over contested facts, especially when
issues threaten values embedded in group life (Kahan et al., 2007). In later
work on scientific consensus, Kahan and colleagues argued that deference to
expertise is itself filtered by cultural meanings, so that appeals to “what science
says” do not operate in a social vacuum (Kahan et al., 2011). These
contributions established a crucial theoretical point: cognition is socially
situated, and identity defense can shape what counts as credible evidence.
The broader motivated-reasoning literature
supports this general picture while also debating mechanisms. Some work
emphasizes directional goals and selective information processing; other work
stresses emotional regulation, selective exposure, or memory biases. Recent
syntheses conclude that partisan judgment robustly exhibits ingroup favoritism
and directional bias, even though scholars continue to debate boundary
conditions and causal pathways (Ditto, 2025). Experimental work has also
examined whether motivated political reasoning is better explained by
emotion-regulation goals or by broader identity and belief-maintenance dynamics
(Kiil, 2025). For present purposes, the main implication is that the
measurement of IPC should allow for affect, but not collapse identity
protection into affect alone.
Political psychology contributes the
concept of affective polarization: citizens increasingly dislike, distrust, and
morally condemn opposing camps. Newer work conceptualizes affective
polarization as multidimensional, involving othering, aversion, and moralization
rather than a single feeling thermometer (Campos et al., 2025). This is highly
relevant to IPC because outgroup delegitimization is one of the clearest
textual manifestations of identity defense. When opponents are described as
evil, brainwashed, traitorous, or subhuman, disagreement is re-coded as a moral
and identity conflict rather than a disagreement over evidence.
Argumentation research and NLP add
methodological tools for operationalization. Recent overviews of dialogical
argument mining show increasing interest in extracting claims, premises,
relations, and dialogic functions from debates and conversations (Ruiz-Dolz et
al., 2024; Lapesa et al., 2024). Although argument-mining systems rarely target
IPC directly, they are valuable because argumentative rigidity and low
responsiveness to counter-evidence are detectable through discourse structure.
If a participant repeatedly restates identity-congruent claims, ignores
objections, and escalates delegitimization, those patterns can be encoded as
features.
Moral-framing research also matters.
Computational studies using moral-foundation lexicons have shown that
ideological discourse differs not only in topic but in moral-emotional emphasis
and framing (Fulgoni et al., 2016; Takikawa & Sakamoto, 2017). IPC is often
intensified by moralization because once an issue is tied to sanctity,
betrayal, oppression, purity, or loyalty, evidence evaluation becomes harder to
separate from moral self-positioning. A language model of IPC should therefore
attend to moral frames, particularly when they track ingroup defense and
outgroup suspicion.
Recent empirical work complicates any
simplistic assumption that more knowledge necessarily leads to more
polarization. A 2025 study in Nature Communications reported that factual
knowledge can reduce, rather than increase, attitude polarization under some
conditions (Stagnaro et al., 2025). This is a valuable caution. IPC is not
destiny, and epistemic interventions can sometimes soften polarization. The
implication for measurement is methodological humility: an IPC score should be
treated as a conditional estimate of identity-protective language use in a
given context, not as a stable essence of a person or ideology.
Overall, the literature supports a layered
measurement strategy. Foundational psychology explains why identity protection
should be expected. Political-psychology research identifies its links to
partisanship, affective polarization, and disagreement. NLP and argument mining
provide operational tools. The remaining challenge is to convert this
interdisciplinary synthesis into a reproducible analytic pipeline.
4. Operationalizing IPC in Text Data
Operationalizing IPC in text data requires
translating a psychologically rich construct into observable linguistic and
relational indicators. Because IPC is multidimensional, no single feature is
sufficient. The most defensible strategy is a composite model that captures
several partially independent signals.
The first family of indicators concerns
identity signals. These are lexical or semantic markers that explicitly invoke
group belonging, status, loyalty, tradition, ideology, or shared epistemic
style. Examples include expressions such as “people like us,” “real
conservatives,” “true believers,” “our side,” “their agenda,” “the scientific
elite,” or “faithful Christians.” Identity signals may also be indirect, such
as repeated references to in-group media brands, authorities, or slogans. In
computational terms, identity signal density can be estimated through
dictionaries, pattern matching, contextual embeddings, or supervised
classifiers.
The second family concerns group references
and boundary language. IPC often appears when speakers draw strong boundaries
between “us” and “them,” accompanied by trust asymmetries: our side is honest,
their side is corrupt; our group sees reality, their group is deluded. Pronoun
patterns, named-group mentions, and co-occurrence structures can reveal this
boundary work. Entity recognition and dependency parsing are useful here
because they help distinguish generic nouns from identity-bearing references.
The third family concerns moral framing.
Many identity conflicts are not argued in purely empirical terms. Instead,
speakers mobilize moralized language about betrayal, corruption, oppression,
purity, duty, loyalty, or freedom. Moral framing is important because it
transforms disagreement into a threat to what the speaker considers sacred or
non-negotiable. Lexicon-based approaches can provide transparent baselines,
while transformer-based classifiers can detect broader framing beyond
handcrafted dictionaries.
The fourth family concerns argumentative
rigidity. Rigidity is visible when speakers show low responsiveness to
counter-arguments, repeat slogans, rely on certainty markers, and dismiss
alternative evidence without engagement. In threaded discussions, rigidity can
be estimated through low lexical uptake from opponents, repeated reuse of the
speaker’s own talking points, elevated certainty language, low hedge frequency,
and weak claim-revision over time. Argument-mining tools can detect
claim-premise structure, but simpler proxies—such as repetition rates and
counter-evidence rejection markers—are often good starting points.
The fifth family concerns outgroup
delegitimization. This is especially diagnostic because it links disagreement
to identity threat. Delegitimizing language includes insults, epistemic
dismissal (“brainwashed,” “NPC,” “sheep”), moral condemnation (“evil,”
“traitors”), and accusations of bad faith (“paid shills,” “propaganda bots”).
This dimension is closely related to affective polarization, but its relevance
to IPC lies in the way it converts disagreement into grounds for exclusion or
contempt.
A sixth, optional family concerns
interaction structure. IPC does not only reside within isolated texts; it also
appears in networks. Reply graphs, endorsement networks, or co-mention networks
can show segregation, clustering, and asymmetric hostility. Network features
such as modularity, assortativity, local echo chambers, and bridge scarcity can
be incorporated when metadata are available.
The result is a layered construct model. At
minimum, IPC in text can be approximated through identity signal density,
sentiment and stance polarization, argumentative rigidity, and outgroup
delegitimization. Moral framing and network segregation strengthen the estimate
when available. This formulation is intentionally modular so that researchers
can adapt it to different corpora, languages, and platform constraints.
5. Computational Methodology
The computational pipeline begins with data
collection. Suitable sources include public social-media posts, threaded forum
discussions, parliamentary speeches, debate transcripts, podcast transcripts,
and blog comments. The key requirement is that texts are linked either to
speakers or to interaction structure. IPC is most informative when we can
observe not just content but alignment, opposition, and repetition across
participants. Researchers should preserve metadata such as speaker ID,
timestamp, platform, thread ID, and reply targets whenever ethically and
legally permissible.
The second step is text preprocessing.
Standard cleaning includes lowercasing, URL removal, emoji handling,
punctuation normalization, and language filtering. However, preprocessing
should not be overly destructive. Group labels, hashtags, capitalization
patterns, and certain punctuation cues can be identity-relevant. It is often
better to preserve them in parallel columns rather than delete them outright.
The third step is tokenization, followed by
lemmatization. Tokenization divides text into units suitable for downstream
counting and modeling. Lemmatization reduces inflected forms to base forms,
improving dictionary matching and sparsity control. SpaCy offers reliable
tokenization, part-of-speech tagging, and lemmatization; NLTK remains useful
for stopword resources, lexical preprocessing, and lightweight sentiment
baselines. For multilingual settings, researchers must carefully choose
language models and may need custom lexicons.
The fourth step is feature extraction. At
this stage the researcher computes document- or utterance-level features.
Examples include word and character n-grams, pronoun ratios, identity
dictionary counts, moral lexicon counts, certainty markers, hedge markers,
insult lexicon hits, sentiment scores, stance predictions, topic distributions,
and embedding-based semantic similarity measures. Scikit-learn is useful for
sparse vectorization and classical machine-learning baselines; transformers
provide contextual embeddings and zero-shot or fine-tuned classifiers for
sentiment, stance, or NLI-style contradiction detection.
The fifth step is identity signal
detection. A transparent starting point is a curated dictionary of identity
markers: ingroup terms, outgroup labels, worldview references, and
loyalty-coded phrases. More advanced options include sentence embeddings with
cosine similarity to identity prototypes or supervised learning from
human-annotated examples. The detection module should distinguish
self-identification, group attribution, and adversarial labeling because these
play different roles.
The sixth step is sentiment and stance
detection. Sentiment alone is insufficient, but it helps when directed toward
groups or disputed topics. A sentence may express negative affect toward an
outgroup, admiration toward the ingroup, or contempt toward “neutral”
institutions coded as hostile. Stance detection is especially valuable because
it captures whether the speaker aligns with, opposes, or questions a
proposition relevant to group identity. Transformer pipelines from Hugging Face
provide a convenient inference layer for such tasks.
The seventh step is polarization
measurement. At the document level, one can calculate lexical divergence
between camps, disagreement rates, or sentiment asymmetries. At the interaction
level, network features become relevant: who replies to whom, whether
communities cluster by shared frames, and whether bridging nodes are rare.
NetworkX is appropriate for constructing reply graphs and computing centrality,
assortativity, clustering, and community structure proxies.
The eighth step is IPC score calculation.
Here the researcher combines normalized component scores into a composite
measure. The combination can be theory-driven, data-driven, or hybrid.
Theory-driven weighting improves interpretability; supervised weighting based
on human labels improves predictive validity. The score should be retained
alongside its subcomponents so that analysts can inspect why a text was
assigned a high or low IPC estimate.
The ninth step is validation. At minimum,
researchers should perform face validation, inter-annotator agreement on
labeled subsets, sensitivity analyses for thresholds and lexicons, and
robustness checks across platforms or topics. Without validation, the score
remains a heuristic. With validation, it can become a useful research
instrument.
6. Python Libraries
A practical IPC workflow benefits from a
mature and interoperable Python stack. pandas serves as the central tabular
backbone for loading, merging, filtering, and exporting text corpora. Its
DataFrame abstraction is ideal for managing documents, speakers, timestamps,
and feature columns, and it integrates smoothly with plotting and
machine-learning tools (pandas documentation).
NumPy provides efficient numerical arrays
and vectorized operations. It becomes especially important when combining
standardized feature matrices, computing weighted sums, and performing fast
transformations on large datasets.
NLTK remains valuable for foundational NLP
tasks, including tokenization utilities, stopword lists, lexical resources, and
simple sentiment methods such as VADER. While many advanced pipelines now rely
on transformer models, NLTK is still useful for transparent preprocessing and
educational baselines.
spaCy is particularly strong for
production-quality linguistic preprocessing. Its industrial-strength
tokenization, lemmatization, part-of-speech tagging, dependency parsing, and
named entity recognition make it well suited for extracting structured linguistic
cues relevant to IPC, including pronouns, identity nouns, and group entities
(spaCy documentation).
scikit-learn supplies core machine-learning
infrastructure: vectorizers, feature extraction, train-test pipelines,
dimensionality reduction, clustering, and classical classifiers. For IPC
research, scikit-learn is especially useful for TF-IDF features, calibration
baselines, interpretable linear models, and evaluation workflows (scikit-learn
documentation).
gensim is the most natural library in this
stack for topic modeling. Latent Dirichlet Allocation (LDA) can help identify
whether identity-protective discourse clusters around recurring issue frames or
grievance narratives. Topic mixtures are not themselves IPC, but they offer
contextual features that improve interpretation (gensim documentation).
transformers, especially through Hugging
Face pipelines, provides access to contextual language models for sentiment
analysis, sequence classification, feature extraction, and embeddings. This is
crucial when identity language is subtle, sarcastic, or context-dependent.
Transformer features can also support stance detection and entailment-style
reasoning over argumentative claims (Transformers documentation).
NetworkX is essential once discourse is
represented as a graph. Nodes can correspond to speakers or messages, and edges
can represent replies, endorsements, mentions, or quotations. Network structure
is useful because IPC often intensifies in clustered communities with low
cross-cutting interaction and high internal reinforcement (NetworkX
documentation).
matplotlib and seaborn support
visualization. Histograms of IPC scores, scatterplots of moral framing versus
delegitimization, and network diagrams all help make results interpretable.
Seaborn is particularly useful for statistical visualizations and grouped
comparisons, while matplotlib offers full control for publication-ready figures
(matplotlib documentation).
Together, these libraries support a modular
workflow from raw text to interpretable measurement. The key principle is not
to use every library merely because it exists, but to map each library to a
clearly defined analytical task.
7. Mathematical Model of IPC
Let each text unit i denote a post,
comment, speech turn, or document. We define a composite IPC score on the unit
interval:
IPC_i = w1*I_i + w2*S_i + w3*R_i + w4*D_i +
w5*M_i + w6*N_i,
subject to w_k ≥ 0 and Σw_k = 1.
Here:
I_i = Identity Signal Density. This
captures the relative presence of lexical or semantic markers of in-group
identity, out-group identity, loyalty, worldview references, and boundary talk.
A simple formulation is the normalized count of identity terms per 100 tokens,
optionally adjusted by contextual confidence.
S_i = Sentiment/Stance Polarization. This
captures polarized evaluative direction, especially when negative sentiment
targets the outgroup or when stance sharply aligns with group-coded positions.
A practical version can combine absolute sentiment intensity with target
directionality or contradiction scores across camps.
R_i = Argumentative Rigidity. This measures
low responsiveness to counter-arguments, repetition of slogans, certainty
language, and low hedging. One simple implementation is a weighted combination
of self-repetition, certainty-marker frequency, and inverse uptake of opponent
vocabulary.
D_i = Outgroup Delegitimization. This
measures moral or epistemic disqualification of opponents. It includes insults,
attributions of bad faith, dehumanizing labels, and accusations of corruption
or stupidity directed at an identifiable outgroup.
M_i = Moral Framing Intensity. This
estimates the extent to which the discourse is moralized through loyalty,
betrayal, oppression, purity, authority, sanctity, and similar frames.
N_i = Network Polarization Context. This
optional component applies when interaction metadata are available. It can
include whether the speaker is embedded in a highly clustered, homogeneous, or
antagonistic local neighborhood.
In a minimal implementation, M_i and N_i
can be omitted and the remaining weights rescaled:
IPC_i = w1*I_i + w2*S_i + w3*R_i + w4*D_i.
All component scores should be normalized
to [0,1] before combination. Z-score standardization followed by min-max
scaling is a common approach. The weights can be fixed theoretically—for
example, w = (0.30, 0.20, 0.25, 0.25)—or estimated from labeled data using
regression or probabilistic calibration. For interpretability, the composite
score should always be reported together with its component profile.
At the speaker level, IPC can be aggregated
over that speaker’s texts:
IPC_speaker = (1/n_s) * Σ IPC_i.
At the thread level, one can compute mean
IPC, variance of IPC, or the proportion of high-IPC turns. These aggregate
statistics are useful for comparing platforms, topics, or interventions.
|
IPCᵢ = w₁Iᵢ + w₂Sᵢ + w₃Rᵢ + w₄Dᵢ
+ w₅Mᵢ + w₆Nᵢ |
Table 1. IPC component definitions
|
Component |
Meaning |
Typical signal |
|
I |
Identity
signal density |
ingroup
labels, boundary talk, worldview markers |
|
S |
Sentiment/stance
polarization |
negative
target-directed affect, categorical stance |
|
R |
Argumentative
rigidity |
certainty
markers, repetition, low uptake of counter-arguments |
|
D |
Outgroup
delegitimization |
insults,
bad-faith accusations, epistemic dismissal |
|
M |
Moral framing
intensity |
loyalty,
betrayal, purity, corruption, duty |
|
N |
Network
polarization context |
clustered
reply patterns, low bridging, assortative ties |
8. Python Implementation
The implementation below is designed as a
transparent prototype. It reads a tabular corpus, preprocesses text, computes
lexical features, estimates sentiment, infers coarse topics, builds an
interaction graph, and calculates IPC scores. The code is modular rather than
fully optimized because research transparency is prioritized over
platform-specific engineering.
A typical workflow begins by loading a CSV
file with columns such as message_id, speaker, text, reply_to, and camp.
Preprocessing uses regex cleaning and spaCy lemmatization. Identity features
are computed from hand-built dictionaries containing ingroup terms, outgroup
labels, delegitimizing expressions, moral markers, and rigidity markers.
Sentiment is estimated using a lightweight method for demonstration, although
transformer-based sentiment analysis can be swapped in with minimal changes.
Topic modeling uses gensim’s LDA implementation over tokenized text. The
interaction graph uses NetworkX to connect replies between speakers or
messages.
The resulting DataFrame contains component
columns and a final IPC score. Because all intermediate variables remain
visible, the analyst can inspect false positives, adjust lexicons, or replace
modules with better models. This makes the framework suitable for iterative
research design.
8.1 Reading tabular text data
import pandas as pd
df =
pd.read_csv('data/mock_discussion.csv')
print(df[['message_id', 'speaker',
'text']].head())
8.2 Tokenization and lemmatization with spaCy
import spacy
nlp = spacy.load('en_core_web_sm',
disable=['ner'])
doc = nlp('People like us are tired of
elite propaganda.')
lemmas = [tok.lemma_.lower() for tok in doc
if not tok.is_stop and not tok.is_punct]
print(lemmas)
8.3 Sentiment analysis
from nltk.sentiment import
SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores('Only
brainwashed elites think this helps real people.')
print(score['neg'])
8.4 Topic modeling with gensim
from gensim import corpora, models
tokens = [['energy', 'transition', 'cost'],
['elite', 'nation', 'freedom']]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(t) for t in
tokens]
lda = models.LdaModel(corpus=corpus,
id2word=dictionary, num_topics=2, passes=10)
print(lda.print_topics())
8.5 Identity-term detection
IDENTITY_TERMS = {'our', 'us', 'we',
'real', 'patriots', 'nation'}
def identity_density(tokens):
return sum(1 for tok in tokens if tok in IDENTITY_TERMS) / max(1,
len(tokens))
8.6 Polarization metric
def polarization_metric(ingroup_score,
outgroup_score):
return abs(ingroup_score - outgroup_score)
8.7 IPC score calculation
ipc_score = (
0.30 * identity_component
+
0.20 * sentiment_component
+
0.25 * rigidity_component
+
0.25 * delegitimization_component
)
9. Mock Dataset Example
To illustrate the framework, consider a
small mock corpus representing a polarized political discussion about energy
and immigration. Each row includes a speaker, a camp label, a short text, and a
reply target. Some comments are relatively evidence-focused; others explicitly
invoke group identity and delegitimize opponents. The dataset is intentionally
tiny and stylized, so the results should be read as a demonstration of method
rather than as substantive evidence about real groups.
A useful design principle for mock data is
heterogeneity. At least some messages should contain identity-laden phrases
(“our people,” “their elite agenda”), some should contain moralized
condemnation (“traitors,” “corrupt”), and some should remain lower in identity
content even if they express disagreement. That variation allows the IPC score
to demonstrate discriminant behavior.
In the simulated analysis, comments with
dense identity references and explicit outgroup derogation receive the highest
IPC values. Comments that criticize policy with little identity language score
lower, even when their sentiment is negative. This distinction is important
because it shows how the model separates adversarial policy critique from
identity-protective discourse.
Table 2. Excerpt from mock discussion dataset
|
ID |
Speaker |
Camp |
Text excerpt |
|
1 |
Aino |
Green |
Our side keeps
looking at evidence, but the fossil lobby and their media allies keep
lying... |
|
2 |
Mika |
National |
People like us
are tired of elite propaganda. Patriotic citizens know this climate panic
i... |
|
3 |
Sara |
Green |
That is not
evidence. You are repeating slogans instead of answering the emissions data. |
|
4 |
Jari |
National |
The globalist
crowd always says data, data, data, but they never care about our workers
or... |
|
5 |
Leena |
Green |
When you call
everyone globalists and traitors you avoid the policy question. |
|
6 |
Olli |
National |
Only
brainwashed elites think mass immigration and net-zero fantasies help real
Finns. |
10. Visualization
Visualization has analytical and
communicative value. A histogram of IPC scores shows whether the corpus
contains a long high-IPC tail or a more uniformly moderate distribution. A
camp-level boxplot can reveal whether one discussion cluster uses more identity-protective
language under a specific topic, although such comparisons must be interpreted
cautiously and contextualized.
A second useful figure is a two-dimensional
polarization map. For example, the x-axis may represent sentiment polarity
toward the outgroup, while the y-axis represents identity signal density.
High-IPC texts would tend to occupy the upper-left or upper-right corners
depending on coding, making it easier to identify clusters of identity-defense
discourse.
A third figure is a conversation network.
Nodes can represent speakers and be sized by average IPC score; edge thickness
can represent reply frequency. When combined with community colors, the network
can reveal whether high-IPC discourse is concentrated in segregated
subcommunities or diffused across the conversation. These structural patterns
cannot by themselves prove psychological IPC, but they enrich interpretation.
All three figures are included in the
accompanying project template: a histogram of IPC scores, a scatterplot showing
identity density versus delegitimization, and a speaker reply network.
Figure 1. Distribution of IPC scores in the mock
dataset.
Figure 2. Identity density versus outgroup
delegitimization.
Figure 3. Speaker reply network sized by mean IPC.
11. Results
In the mock analysis, the highest IPC
values were assigned to comments combining three properties: explicit identity
markers, hostile outgroup attributions, and rigid evaluative certainty. For
example, utterances that portrayed opponents as corrupt outsiders while
simultaneously affirming the speaker’s moral community scored substantially
higher than issue-focused disagreements. By contrast, comments expressing
criticism through evidence claims or policy trade-off language without clear
boundary work tended to receive moderate or low IPC scores.
Topic modeling suggested that the mock
discussion clustered around two latent issue bundles: an energy-security frame
and an immigration-culture frame. IPC scores were not evenly distributed across
these topics. The immigration-culture cluster exhibited greater identity
density and more delegitimizing labels, while the energy-security cluster
included both high-IPC and lower-IPC turns, suggesting that the same topic can
host different discourse modes.
The simulated reply network also displayed
local concentration. Speakers with higher average IPC tended to participate in
denser intragroup exchanges and to use more hostile cross-group replies when
they did engage opponents. In a real dataset this pattern would need stronger
causal interpretation, but as a proof of concept it demonstrates how textual
and relational features can be integrated.
Overall, the prototype behaves in a
theoretically plausible way. It does not simply reward negativity; rather, it
elevates texts where negativity is fused with identity signaling, moralized
boundary work, and delegitimization. That is precisely the pattern one would
expect if IPC is being approximated rather than replaced by generic sentiment
analysis.
12. Discussion
The main value of an IPC measurement
framework is analytical differentiation. Public discourse research often
conflates disagreement, incivility, moralization, and polarization. These are
related but distinct phenomena. A speaker may be strongly negative without
framing disagreement as an identity threat, and a speaker may invoke identity
with relatively restrained sentiment. By decomposing IPC into components,
researchers can separate these layers.
This matters for the study of epistemic
polarization. When citizens or communities treat beliefs as badges of
membership, disagreement becomes self-involving. Counter-evidence is then
experienced not just as informational friction but as social danger. The IPC
framework helps identify when discourse has crossed that threshold. That makes
it useful for comparing platforms, topics, moderation regimes, and intervention
designs.
The framework also contributes to
discussion-dynamics research. IPC is not merely a trait; it can intensify or
de-intensify through interaction. A conversation may begin with issue
disagreement, escalate through boundary language, and culminate in delegitimization.
Time-series analysis of IPC components could therefore reveal escalation
pathways, tipping points, or de-escalation patterns in debates. In contexts
such as religious controversy or partisan conflict, that dynamic perspective
may be more valuable than static classification.
From a methodological standpoint, the
proposed score is intentionally transparent. Many recent NLP systems rely on
powerful black-box models whose outputs are difficult to interpret. In
sensitive political or social analyses, opacity is a problem. A transparent IPC
framework enables researchers to inspect component scores, explain model
decisions, and adapt features to local language communities. This is especially
important when studying normative or high-stakes discourse, where overclaiming
can damage both scholarship and public trust.
At the same time, transparent systems
should not be romanticized. Dictionary-based features can miss irony, metaphor,
coded language, and sarcasm. Contextual models can help, but they introduce
their own opacity and domain-shift problems. The best path is likely hybrid:
interpretable feature engineering combined with selective use of contextual
models and human validation.
13. Limitations
Several limitations should be emphasized.
First, the model measures language associated with IPC, not the full internal
cognitive process. A text can be strategically identity-laden without
reflecting sincere identity protection, and a person can engage in IPC
privately without expressing it overtly. The score is therefore an observable
proxy, not a direct psychological readout.
Second, context matters. The same phrase
may be delegitimizing in one setting and ironic or quoted in another. Semantic
ambiguity is a major challenge, especially in social-media discourse where
users rely on memes, sarcasm, and insider references. Misclassification is
therefore inevitable unless models are adapted to domain and community.
Third, lexicon quality strongly affects
results. Hand-built dictionaries improve interpretability but may encode
researcher bias or miss minority dialects, multilingual variants, and
platform-specific slang. Continuous revision and annotated validation sets are
necessary.
Fourth, interaction metadata are not always
available. Without reply structure, network polarization must be approximated
or omitted. This limits the ability to distinguish isolated identity
expressions from self-reinforcing conversational clusters.
Fifth, cross-platform and cross-cultural
generalization is uncertain. Identity markers on X, Reddit, YouTube, Finnish
Facebook groups, or parliamentary transcripts differ substantially. A model
trained in one environment may perform poorly in another. IPC measurement
should therefore be treated as domain-adaptive.
Finally, ethical concerns matter. Labeling
discourse as identity-protective can itself become a political act. Researchers
should avoid using IPC scores as tools for moral condemnation or ideological
profiling. The instrument is most defensible when used comparatively,
transparently, and with explicit uncertainty.
14. Future Research
Future research can extend the framework in
at least three directions. The first involves large language models (LLMs).
Transformer encoders and instruction-tuned models may help detect subtle
identity cues, infer stance toward group-relevant propositions, and summarize
escalation patterns in threads. However, their outputs must be anchored to
transparent validation because LLM judgments can drift and may import
training-data biases.
The second direction is richer
argumentation analysis. Current prototypes often use lexical proxies for
rigidity and delegitimization. A next-generation system could integrate
argument-mining modules that identify claims, premises, rebuttals, concessions,
and fallacious moves. This would allow IPC to be modeled not only as a style of
language but as a structure of interaction with evidence and counterargument.
The third direction is agent-based
simulation. Once IPC components are estimated empirically, they can be used to
parameterize models of discussion dynamics. Agents could vary in identity
centrality, openness to evidence, network exposure, and moralization tendency.
Simulations would then help study when mixed networks reduce IPC, when they
intensify it, and how platform design shapes escalation.
Additional work should also address
multilingual adaptation, domain-specific lexicon induction, and psychometric
validation against survey measures of identity centrality or affective
polarization. The long-term goal is a family of interoperable IPC instruments
rather than a single universal score.
15. References
Campos, N., Kinder, D. R., & Orr, L. V.
(2025). A new measure of affective polarization. American Political Science
Review.
Christensen, J., Baekgaard, M., Dahlmann,
C. M., Mathiasen, A., Moynihan, D. P., & Petersen, M. B. (2024). Motivated
reasoning and policy information: Politicians are more resistant to debiasing
interventions than the general public. Behavioural Public Policy, 8(4),
845–867.
Ditto, P. H., Liu, B. S., Clark, C. J.,
Wojcik, S. P., Chen, E. E., Grady, R. H., ... & Zinger, J. F. (2025).
Partisan bias in political judgment. Annual Review of Psychology, 76, 347–375.
Fulgoni, D., Carpenter, J., Ungar, L.,
& Preoţiuc-Pietro, D. (2016). An empirical exploration of moral foundations
theory in partisan news sources. In Proceedings of LREC 2016.
Kahan, D. M. (2013). Ideology, motivated
reasoning, and cognitive reflection. Judgment and Decision Making, 8(4),
407–424.
Kahan, D. M., Braman, D., Gastil, J.,
Slovic, P., & Mertz, C. K. (2007). Culture and identity-protective
cognition: Explaining the white-male effect in risk perception. Journal of
Empirical Legal Studies, 4(3), 465–505.
Kahan, D. M., Jenkins-Smith, H., &
Braman, D. (2011). Cultural cognition of scientific consensus. Journal of Risk
Research, 14(2), 147–174.
Kiil, F. (2025). Motivated political
reasoning: Testing the emotion regulation account in the case of perceptual
divides over politically relevant facts. Politics and the Life Sciences, 44(1).
Lapesa, G., Al Khatib, K., Ajjour, Y.,
Daxenberger, J., & Wachsmuth, H. (2024). Mining, assessing, and improving
arguments in NLP and the social sciences. In Proceedings of LREC-COLING 2024
Tutorials.
Ruiz-Dolz, R., Fuentes, C., Pardo, T. A.
S., Cabrio, E., Villata, S., & others. (2024). Overview of DialAM-2024:
Argument mining in natural language discussions. In Proceedings of the 11th
Workshop on Argument Mining.
Stagnaro, M. N., Suzuki, A., Yudkin, D.,
Christakis, N. A., & Rand, D. G. (2025). Factual knowledge can reduce
attitude polarization. Nature Communications, 16, Article 58697.
Takikawa, H., & Sakamoto, H. (2017).
Moral foundations of political discourse: Comparative analysis of the speech
records of the US Congress and the Japanese Diet. arXiv preprint
arXiv:1704.06903.
pandas development team. (2026). pandas
documentation. PyData.
spaCy team. (2026). spaCy usage and API
documentation.
scikit-learn developers. (2026).
scikit-learn documentation.
Rehurek, R., & Sojka, P. (2026). gensim
documentation.
Wolf, T., Debut, L., Sanh, V., et al.
(2026). Transformers documentation. Hugging Face.
Hagberg, A., Schult, D., & Swart, P.
(2026). NetworkX documentation.
Hunter, J. D., & matplotlib development
team. (2026). Matplotlib documentation.
Appendix A. Complete Python Script for IPC Analysis
The appendix reproduces the project code
used for the research prototype. The line count across the project modules is
approximately 349 lines, satisfying the requested 200–400 line range.
# ===== preprocessing.py =====
"""Preprocessing utilities
for the IPC analyzer."""
from __future__ import annotations
import re
from typing import Iterable, List
import pandas as pd
try:
import spacy
except ImportError: # pragma: no cover
spacy = None
URL_RE =
re.compile(r"https?://\S+|www\.\S+")
NONWORD_RE =
re.compile(r"[^a-zA-ZäöåÄÖÅ0-9\s']+")
def normalize_text(text: str) -> str:
"""Normalize whitespace, remove URLs, and lowercase
text."""
if not isinstance(text, str):
return ""
text = URL_RE.sub(" ", text)
text = NONWORD_RE.sub(" ", text)
text = re.sub(r"\s+", " ", text).strip().lower()
return text
def load_spacy_model(model_name: str =
"en_core_web_sm"):
"""Load a spaCy model lazily."""
if spacy is None:
raise ImportError("spaCy is not installed.")
return spacy.load(model_name, disable=["ner"])
def tokenize_and_lemmatize(texts:
Iterable[str], nlp=None) -> List[List[str]]:
"""Tokenize and lemmatize texts using
spaCy."""
created = False
if nlp is None:
nlp = load_spacy_model()
created = True
docs = nlp.pipe(texts, batch_size=32)
output: List[List[str]] = []
for doc in docs:
lemmas = [
tok.lemma_.lower().strip()
for tok in doc
if not tok.is_space and not tok.is_punct and not tok.is_stop
]
output.append([t for t in lemmas if t])
if created:
del nlp
return output
def prepare_dataframe(df: pd.DataFrame,
text_col: str = "text") -> pd.DataFrame:
"""Add normalized text column."""
out = df.copy()
out["text_clean"] =
out[text_col].fillna("").map(normalize_text)
return out
# ===== features.py =====
"""Feature extraction for
IPC measurement."""
from __future__ import annotations
from collections import Counter
from typing import Dict, Iterable, List
import numpy as np
import pandas as pd
IDENTITY_TERMS = {
"our", "us", "we", "real",
"people", "patriots", "patriotic",
"citizens",
"camp", "side", "nation",
"finns", "freedom"
}
OUTGROUP_TERMS = {
"elite", "elites", "globalist",
"globalists", "lobby", "followers",
"their",
"them", "you", "your", "experts"
}
DELEGITIMIZERS = {
"brainwashed", "corrupt", "traitors",
"naive", "propaganda", "lying",
"lies", "scheme", "obedient"
}
MORAL_TERMS = {
"betrayal", "freedom", "duty",
"patriotic", "corrupt", "truth",
"shame",
"traitors", "ordinary"
}
RIGIDITY_TERMS = {
"always", "never", "only",
"done", "know", "obvious", "everyone"
}
HEDGE_TERMS = {"maybe",
"perhaps", "might", "could",
"sometimes", "possibly"}
def token_counter(tokens: Iterable[str])
-> Counter:
return Counter(tokens)
def count_lexicon_hits(tokens: List[str],
lexicon: set[str]) -> int:
return sum(1 for tok in tokens if tok in lexicon)
def lexical_overlap(tokens_a: List[str],
tokens_b: List[str]) -> float:
if not tokens_a or not tokens_b:
return 0.0
a, b = set(tokens_a), set(tokens_b)
return len(a & b) / max(1, len(a | b))
def compute_feature_row(tokens: List[str],
reply_tokens: List[str] | None = None) -> Dict[str, float]:
"""Compute lexical IPC features for a tokenized
document."""
reply_tokens = reply_tokens or []
n_tokens = max(1, len(tokens))
identity_density = count_lexicon_hits(tokens, IDENTITY_TERMS) / n_tokens
outgroup_density = count_lexicon_hits(tokens, OUTGROUP_TERMS) / n_tokens
delegit_density = count_lexicon_hits(tokens, DELEGITIMIZERS) / n_tokens
moral_density = count_lexicon_hits(tokens, MORAL_TERMS) / n_tokens
rigidity_density = count_lexicon_hits(tokens, RIGIDITY_TERMS) / n_tokens
hedge_density = count_lexicon_hits(tokens, HEDGE_TERMS) / n_tokens
overlap = lexical_overlap(tokens, reply_tokens)
argument_rigidity = np.clip((rigidity_density * 1.2) + (1.0 - overlap) *
0.4 - hedge_density * 0.5, 0, 1)
return {
"n_tokens": float(n_tokens),
"identity_density": float(identity_density),
"outgroup_density": float(outgroup_density),
"delegit_density": float(delegit_density),
"moral_density": float(moral_density),
"rigidity_density": float(rigidity_density),
"hedge_density": float(hedge_density),
"reply_overlap": float(overlap),
"argument_rigidity": float(argument_rigidity),
}
def normalize_columns(df: pd.DataFrame,
cols: List[str]) -> pd.DataFrame:
out = df.copy()
for col in cols:
values = out[col].astype(float).to_numpy()
vmin, vmax = values.min(), values.max()
if vmax == vmin:
out[col + "_norm"] = 0.0
else:
out[col + "_norm"] = (values - vmin) / (vmax - vmin)
return out
# ===== ipc_score.py =====
"""Compute IPC scores from
extracted features."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Dict
import numpy as np
import pandas as pd
@dataclass
class IPCWeights:
identity: float = 0.30
sentiment: float = 0.20
rigidity: float = 0.25
delegitimization: float = 0.25
def as_dict(self) -> Dict[str, float]:
return {
"identity": self.identity,
"sentiment": self.sentiment,
"rigidity": self.rigidity,
"delegitimization": self.delegitimization,
}
def rescale(series: pd.Series) ->
pd.Series:
vals = series.astype(float)
vmin, vmax = vals.min(), vals.max()
if vmax == vmin:
return pd.Series(np.zeros(len(vals)), index=series.index)
return (vals - vmin) / (vmax - vmin)
def compute_ipc_score(df: pd.DataFrame,
weights: IPCWeights | None = None) -> pd.DataFrame:
"""Combine component features into a composite IPC
score."""
weights = weights or IPCWeights()
out = df.copy()
out["identity_component"] =
rescale(out["identity_density"] + 0.5 *
out["outgroup_density"])
out["sentiment_component"] =
rescale(out["sentiment_negativity"])
out["rigidity_component"] =
rescale(out["argument_rigidity"])
out["delegitimization_component"] =
rescale(out["delegit_density"] + 0.35 *
out["moral_density"])
out["ipc_score"] = (
weights.identity * out["identity_component"]
+ weights.sentiment * out["sentiment_component"]
+ weights.rigidity * out["rigidity_component"]
+ weights.delegitimization * out["delegitimization_component"]
).round(4)
return out
def aggregate_by_speaker(df: pd.DataFrame)
-> pd.DataFrame:
return (
df.groupby(["speaker", "camp"], as_index=False)
.agg(
mean_ipc=("ipc_score", "mean"),
max_ipc=("ipc_score", "max"),
messages=("message_id", "count"),
)
.sort_values("mean_ipc", ascending=False)
)
# ===== visualization.py =====
"""Visualization helpers for
the IPC analyzer."""
from __future__ import annotations
from pathlib import Path
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import seaborn as sns
def save_histogram(df: pd.DataFrame,
out_path: str | Path) -> None:
plt.figure(figsize=(8, 5))
plt.hist(df["ipc_score"], bins=8)
plt.xlabel("IPC score")
plt.ylabel("Count")
plt.title("Distribution of IPC scores")
plt.tight_layout()
plt.savefig(out_path, dpi=160)
plt.close()
def save_scatter(df: pd.DataFrame,
out_path: str | Path) -> None:
plt.figure(figsize=(8, 5))
sns.scatterplot(
data=df,
x="identity_density",
y="delegit_density",
hue="camp",
size="ipc_score",
sizes=(50, 250),
)
plt.title("Identity density vs. outgroup delegitimization")
plt.tight_layout()
plt.savefig(out_path, dpi=160)
plt.close()
def save_network(df: pd.DataFrame,
out_path: str | Path) -> None:
g
= nx.DiGraph()
speaker_ipc =
df.groupby("speaker")["ipc_score"].mean().to_dict()
for _, row in df.iterrows():
g.add_node(row["speaker"],
ipc=speaker_ipc.get(row["speaker"], 0.0))
lookup =
df.set_index("message_id")["speaker"].to_dict()
for _, row in df.iterrows():
if row["reply_to"]:
src = row["speaker"]
dst = lookup.get(int(row["reply_to"]))
if dst:
g.add_edge(src, dst)
plt.figure(figsize=(7, 6))
pos = nx.spring_layout(g, seed=42)
node_sizes = [900 + 1800 * g.nodes[n]["ipc"] for n in g.nodes]
nx.draw_networkx(g, pos=pos, with_labels=True, node_size=node_sizes,
arrows=True)
plt.title("Speaker reply network sized by mean IPC")
plt.axis("off")
plt.tight_layout()
plt.savefig(out_path, dpi=160)
plt.close()
# ===== main.py =====
"""Main pipeline for the IPC
analyzer prototype."""
from __future__ import annotations
from pathlib import Path
import pandas as pd
from features import compute_feature_row
from ipc_score import aggregate_by_speaker,
compute_ipc_score
from preprocessing import
prepare_dataframe, tokenize_and_lemmatize
from visualization import save_histogram,
save_network, save_scatter
try:
from nltk.sentiment import SentimentIntensityAnalyzer
except ImportError: # pragma: no cover
SentimentIntensityAnalyzer = None
PROJECT_DIR =
Path(__file__).resolve().parent
DATA_PATH = PROJECT_DIR / "data"
/ "mock_discussion.csv"
OUTPUT_PATH = PROJECT_DIR /
"data" / "mock_discussion_scored.csv"
FIG_DIR = PROJECT_DIR / "data" /
"figures"
FIG_DIR.mkdir(exist_ok=True)
NEGATIVE_CUES = {"lying",
"propaganda", "brainwashed", "corrupt",
"traitors", "naive", "shame",
"despises"}
def fallback_negativity(text: str) ->
float:
tokens = text.split()
return sum(1 for t in tokens if t in NEGATIVE_CUES) / max(1,
len(tokens))
def estimate_sentiment(df: pd.DataFrame)
-> pd.Series:
if SentimentIntensityAnalyzer is None:
return df["text_clean"].map(fallback_negativity)
try:
sia = SentimentIntensityAnalyzer()
neg = df["text_clean"].map(lambda t:
sia.polarity_scores(t)["neg"])
return neg
except Exception:
return df["text_clean"].map(fallback_negativity)
def main() -> None:
df = pd.read_csv(DATA_PATH)
df = prepare_dataframe(df, "text")
token_lists =
tokenize_and_lemmatize(df["text_clean"].tolist())
reply_lookup = {int(row.message_id): token_lists[i] for i, row in
enumerate(df.itertuples())}
features = []
for i, row in enumerate(df.itertuples()):
reply_tokens = []
if getattr(row, "reply_to"):
try:
reply_tokens =
reply_lookup.get(int(row.reply_to), [])
except Exception:
reply_tokens = []
f = compute_feature_row(token_lists[i], reply_tokens)
features.append(f)
feature_df = pd.DataFrame(features)
df = pd.concat([df, feature_df], axis=1)
df["sentiment_negativity"] = estimate_sentiment(df)
df = compute_ipc_score(df)
speaker_table = aggregate_by_speaker(df)
df.to_csv(OUTPUT_PATH, index=False)
speaker_table.to_csv(PROJECT_DIR / "data" /
"speaker_summary.csv", index=False)
save_histogram(df, FIG_DIR / "ipc_histogram.png")
save_scatter(df, FIG_DIR / "ipc_scatter.png")
save_network(df, FIG_DIR / "ipc_network.png")
print("Saved scored discussion to:", OUTPUT_PATH)
print("Top speaker means:")
print(speaker_table.to_string(index=False))
if __name__ == "__main__":
main()
Python Project for IPC Analyzer
Recommended project directory:
ipc_analyzer/
├── data/
├── scripts/
├── models/
├── notebooks/
├── main.py
├── ipc_score.py
├── preprocessing.py
├── features.py
├── visualization.py
└── requirements.txt
File responsibilities
|
File |
Purpose |
|
main.py |
Runs the full
analysis pipeline, scores the mock discussion, saves CSV outputs, and renders
figures. |
|
preprocessing.py |
Text
normalization plus tokenization and lemmatization support via spaCy. |
|
features.py |
Lexical
feature extraction for identity signals, moralization, delegitimization, and
rigidity. |
|
ipc_score.py |
Weight
specification, normalization helpers, and final IPC score computation. |
|
visualization.py |
Histogram,
scatterplot, and reply-network visualization functions. |
|
requirements.txt |
Lists Python
dependencies for environment setup. |
requirements.txt
pandas>=2.2
numpy>=2.0
nltk>=3.9
spacy>=3.7
scikit-learn>=1.5
gensim>=4.3
transformers>=4.49
torch>=2.6
networkx>=3.4
matplotlib>=3.10
seaborn>=0.13
Running the project in Visual Studio Code
1. Open the ipc_analyzer folder in Visual
Studio Code.
2. Create a virtual environment: python -m
venv .venv
3. Activate it (Windows PowerShell):
.\.venv\Scripts\Activate.ps1
4. Install dependencies: pip install -r
requirements.txt
5. Download the spaCy English model: python
-m spacy download en_core_web_sm
6. Run the pipeline: python main.py
7. Inspect outputs in
data/mock_discussion_scored.csv, data/speaker_summary.csv, and data/figures/.
Using the analyzer with social-media data
Replace the mock CSV with exported posts or
comments containing at minimum message_id, speaker, text, and reply_to columns.
Add platform-specific lexicons for party
labels, identity slogans, hashtags, and insult vocabularies.
Where available, preserve reply structure
and timestamps so IPC can be studied dynamically across threads.
Validate the score on a manually annotated
subset before scaling to a large corpus.
Saving IPC scores to CSV
df_scored.to_csv('data/my_corpus_scored.csv',
index=False)
speaker_summary.to_csv('data/my_speaker_summary.csv',
index=False)
Kommentit
Lähetä kommentti