Computational Modeling of Identity Protective Cognition in Text Data: A Python-Based Framework for Detection, Quantification, and Visualization

 

Computational Modeling of Identity Protective Cognition in Text Data: A Python-Based Framework for Detection, Quantification, and Visualization

A research article and implementation framework for computational social science

Prepared for Word (.docx) delivery

Author: OpenAI research drafting assistant

Date: 12 March 2026

 

Abstracted research prototype. Simulated empirical section uses a mock dataset for demonstration.


 

Abstract

Identity Protective Cognition (IPC) refers to the tendency of individuals to process information in ways that protect identities tied to valued groups, communities, and status systems. While the concept emerged in cultural cognition research, it now has direct relevance for computational social science because identity-protective dynamics leave measurable traces in language. This article develops a practical framework for operationalizing IPC from text data such as social-media posts, political exchanges, blog comments, and religious controversies. The proposed approach integrates lexical, semantic, sentiment, stance, and network features into a composite IPC score that estimates the degree to which discourse is organized around identity defense rather than open-ended evidence evaluation. The paper reviews the theoretical foundations of IPC, motivated reasoning, cultural cognition, and affective polarization; translates these ideas into measurable textual indicators; and maps each stage of the pipeline to widely used Python libraries, including pandas, NumPy, NLTK, spaCy, scikit-learn, gensim, transformers, NetworkX, matplotlib, and seaborn. A mock dataset and simulated analysis illustrate how IPC can be quantified across speakers and interactions. The article argues that IPC is not reducible to sentiment alone: it is better understood as a patterned conjunction of identity signals, outgroup derogation, moral framing, argumentative rigidity, and polarized interaction structure. The framework is intended as a transparent research prototype rather than a final psychometric instrument, and the article closes by outlining validity challenges, limitations, and future directions involving large language models, argument mining, and agent-based simulation.

Keywords: Identity Protective Cognition; motivated reasoning; cultural cognition; natural language processing; polarization; argument mining; computational social science

1. Introduction

Identity Protective Cognition (IPC) is a useful construct for explaining why the same factual record can generate radically different interpretations across social groups. In the original cultural-cognition literature, the key claim was not merely that citizens lack information, but that information itself is filtered through identity-relevant commitments. Kahan and colleagues argued that people are often motivated to form beliefs that align with the values and affiliations of groups important to their social standing and self-conception. In this view, cognition is not simply a tool for discovering external reality; it is also a mechanism for preserving membership, status, and solidarity within a social world. That insight helps explain why increased sophistication can sometimes amplify rather than reduce polarization when contested issues become identity markers (Kahan et al., 2007; Kahan, 2013).

IPC matters for political polarization because modern disputes rarely remain limited to the literal issue at hand. Questions about climate, vaccines, immigration, religion, taxation, education, or gender can become signals of “who is with us” and “who is against us.” Once this happens, participants often treat evidence as socially loaded. Accepting a claim may feel like yielding status to an outgroup, while rejecting a claim may function as a public badge of loyalty. Recent political-psychology research continues to show that partisan bias and motivated interpretation are structured by ingroup favoritism and identity-linked goals rather than simple ignorance (Ditto, 2025). IPC is therefore relevant not only to attitude formation but also to discussion dynamics, where language becomes a vehicle for identity alignment, norm enforcement, and outgroup exclusion.

The rise of digital text corpora makes it possible to study these processes computationally. Social-media threads, blog comment sections, parliamentary debates, podcasts, and online forums generate vast language traces that reveal how identities are invoked, defended, moralized, and weaponized. Computational social science has matured to the point where lexical analysis, topic modeling, transformer-based inference, and graph methods can be combined into a unified pipeline. Yet IPC itself has not often been converted into a transparent, modular measurement framework suitable for large-scale text analysis. Existing work frequently focuses on adjacent constructs such as ideological bias, stance, affective polarization, or misinformation susceptibility. Those constructs matter, but IPC adds a distinct emphasis: the protection of identity as a cognitive objective.

This article develops a research-grade but practically implementable framework for modeling IPC in text data. The contribution is threefold. First, it synthesizes literature from cultural cognition, motivated reasoning, political psychology, and argumentation research into a measurable concept. Second, it translates IPC into textual indicators that can be extracted with mainstream Python tools. Third, it proposes an IPC score that combines identity signal density, sentiment polarization, argumentative rigidity, and outgroup delegitimization, supplemented by moral framing and network structure where data permit. The goal is not to claim that any single score fully captures the psychological phenomenon, but to provide a transparent starting point for empirical work. Because the method uses explicit features rather than a black-box label alone, it supports interpretability, replication, and iterative validation.

The paper is designed for researchers who work at the intersection of psychology, computational social science, and natural language processing (NLP). It presents the theory, the computational pipeline, a mathematical formulation, Python implementation examples, a mock dataset, simulated results, and a full project template. The broader ambition is methodological: to create a bridge between psychologically rich theory and scalable text analysis.

2. Theoretical Framework

IPC sits at the intersection of several theoretical traditions. The first is motivated reasoning. In the broadest sense, motivated reasoning refers to information processing shaped by directional goals rather than accuracy goals alone. People do not merely ask what is true; they also ask, often implicitly, what belief will preserve coherence, belonging, esteem, or desired emotion. Recent reviews of partisan bias emphasize that political judgment often reflects ingroup favoritism in what people seek out, believe, and remember (Ditto, 2025). IPC can be understood as a subtype of motivated reasoning in which the relevant motive is identity protection.

The second tradition is cultural cognition. Kahan’s work argued that individuals appraise risks, evidence, and expertise partly through cultural worldviews tied to preferred forms of social organization. The important move here is from “beliefs as private opinions” to “beliefs as social signals.” Under this perspective, apparently factual disputes can become entangled with deeper cultural meanings. The famous “white male effect” paper framed identity-protective cognition as a mechanism through which people dismiss asserted dangers that threaten identities linked to hierarchical or individualistic values (Kahan et al., 2007). Later work on scientific consensus extended this argument by showing that deference to expertise itself can become culturally coded (Kahan et al., 2011).

A third relevant concept is epistemic identity. Epistemic identity refers to the way individuals come to define themselves, and to be recognized by others, through styles of knowing: skeptical, orthodox, anti-elite, scientific, contrarian, patriotic, traditionalist, and so forth. In many online environments, these identities are publicly performed through repeated rhetorical choices. Identity-protective cognition is easier to detect when epistemic identity is linguistically salient: speakers say not only “this is false,” but “people like us know this is false,” or “only gullible elites believe that.” Such expressions connect content to community membership.

A fourth distinction concerns normative versus descriptive rationality. Normative accounts ask how agents should reason under standards such as coherence, probability, or evidence integration. Descriptive accounts ask how agents actually reason under cognitive and social constraints. IPC belongs primarily to descriptive explanation. It does not imply that people are irrational in every practical sense. From a local social perspective, preserving group belonging may be instrumentally rational. Publicly endorsing outgroup-favored claims can carry status costs, relationship costs, or identity dissonance. Thus, a computational model of IPC should not treat every strong opinion as a defect. Instead, it should identify patterns where discourse appears more oriented toward identity maintenance than toward flexible evidence updating.

Finally, IPC must be distinguished from mere negative sentiment. A speaker can be angry without engaging in identity defense, and a speaker can exhibit IPC in relatively polite language. The distinguishing signature is patterned alignment between group identity, selective evaluation, and the treatment of disagreement as a threat to the social self. This is why a useful model must combine multiple dimensions rather than depend on a single classifier.

3. Literature Review

The literature on IPC is dispersed across psychology, law, risk perception, political behavior, and public communication. Foundational work by Kahan and colleagues positioned identity protection as a central mechanism behind disagreements over contested facts, especially when issues threaten values embedded in group life (Kahan et al., 2007). In later work on scientific consensus, Kahan and colleagues argued that deference to expertise is itself filtered by cultural meanings, so that appeals to “what science says” do not operate in a social vacuum (Kahan et al., 2011). These contributions established a crucial theoretical point: cognition is socially situated, and identity defense can shape what counts as credible evidence.

The broader motivated-reasoning literature supports this general picture while also debating mechanisms. Some work emphasizes directional goals and selective information processing; other work stresses emotional regulation, selective exposure, or memory biases. Recent syntheses conclude that partisan judgment robustly exhibits ingroup favoritism and directional bias, even though scholars continue to debate boundary conditions and causal pathways (Ditto, 2025). Experimental work has also examined whether motivated political reasoning is better explained by emotion-regulation goals or by broader identity and belief-maintenance dynamics (Kiil, 2025). For present purposes, the main implication is that the measurement of IPC should allow for affect, but not collapse identity protection into affect alone.

Political psychology contributes the concept of affective polarization: citizens increasingly dislike, distrust, and morally condemn opposing camps. Newer work conceptualizes affective polarization as multidimensional, involving othering, aversion, and moralization rather than a single feeling thermometer (Campos et al., 2025). This is highly relevant to IPC because outgroup delegitimization is one of the clearest textual manifestations of identity defense. When opponents are described as evil, brainwashed, traitorous, or subhuman, disagreement is re-coded as a moral and identity conflict rather than a disagreement over evidence.

Argumentation research and NLP add methodological tools for operationalization. Recent overviews of dialogical argument mining show increasing interest in extracting claims, premises, relations, and dialogic functions from debates and conversations (Ruiz-Dolz et al., 2024; Lapesa et al., 2024). Although argument-mining systems rarely target IPC directly, they are valuable because argumentative rigidity and low responsiveness to counter-evidence are detectable through discourse structure. If a participant repeatedly restates identity-congruent claims, ignores objections, and escalates delegitimization, those patterns can be encoded as features.

Moral-framing research also matters. Computational studies using moral-foundation lexicons have shown that ideological discourse differs not only in topic but in moral-emotional emphasis and framing (Fulgoni et al., 2016; Takikawa & Sakamoto, 2017). IPC is often intensified by moralization because once an issue is tied to sanctity, betrayal, oppression, purity, or loyalty, evidence evaluation becomes harder to separate from moral self-positioning. A language model of IPC should therefore attend to moral frames, particularly when they track ingroup defense and outgroup suspicion.

Recent empirical work complicates any simplistic assumption that more knowledge necessarily leads to more polarization. A 2025 study in Nature Communications reported that factual knowledge can reduce, rather than increase, attitude polarization under some conditions (Stagnaro et al., 2025). This is a valuable caution. IPC is not destiny, and epistemic interventions can sometimes soften polarization. The implication for measurement is methodological humility: an IPC score should be treated as a conditional estimate of identity-protective language use in a given context, not as a stable essence of a person or ideology.

Overall, the literature supports a layered measurement strategy. Foundational psychology explains why identity protection should be expected. Political-psychology research identifies its links to partisanship, affective polarization, and disagreement. NLP and argument mining provide operational tools. The remaining challenge is to convert this interdisciplinary synthesis into a reproducible analytic pipeline.

4. Operationalizing IPC in Text Data

Operationalizing IPC in text data requires translating a psychologically rich construct into observable linguistic and relational indicators. Because IPC is multidimensional, no single feature is sufficient. The most defensible strategy is a composite model that captures several partially independent signals.

The first family of indicators concerns identity signals. These are lexical or semantic markers that explicitly invoke group belonging, status, loyalty, tradition, ideology, or shared epistemic style. Examples include expressions such as “people like us,” “real conservatives,” “true believers,” “our side,” “their agenda,” “the scientific elite,” or “faithful Christians.” Identity signals may also be indirect, such as repeated references to in-group media brands, authorities, or slogans. In computational terms, identity signal density can be estimated through dictionaries, pattern matching, contextual embeddings, or supervised classifiers.

The second family concerns group references and boundary language. IPC often appears when speakers draw strong boundaries between “us” and “them,” accompanied by trust asymmetries: our side is honest, their side is corrupt; our group sees reality, their group is deluded. Pronoun patterns, named-group mentions, and co-occurrence structures can reveal this boundary work. Entity recognition and dependency parsing are useful here because they help distinguish generic nouns from identity-bearing references.

The third family concerns moral framing. Many identity conflicts are not argued in purely empirical terms. Instead, speakers mobilize moralized language about betrayal, corruption, oppression, purity, duty, loyalty, or freedom. Moral framing is important because it transforms disagreement into a threat to what the speaker considers sacred or non-negotiable. Lexicon-based approaches can provide transparent baselines, while transformer-based classifiers can detect broader framing beyond handcrafted dictionaries.

The fourth family concerns argumentative rigidity. Rigidity is visible when speakers show low responsiveness to counter-arguments, repeat slogans, rely on certainty markers, and dismiss alternative evidence without engagement. In threaded discussions, rigidity can be estimated through low lexical uptake from opponents, repeated reuse of the speaker’s own talking points, elevated certainty language, low hedge frequency, and weak claim-revision over time. Argument-mining tools can detect claim-premise structure, but simpler proxies—such as repetition rates and counter-evidence rejection markers—are often good starting points.

The fifth family concerns outgroup delegitimization. This is especially diagnostic because it links disagreement to identity threat. Delegitimizing language includes insults, epistemic dismissal (“brainwashed,” “NPC,” “sheep”), moral condemnation (“evil,” “traitors”), and accusations of bad faith (“paid shills,” “propaganda bots”). This dimension is closely related to affective polarization, but its relevance to IPC lies in the way it converts disagreement into grounds for exclusion or contempt.

A sixth, optional family concerns interaction structure. IPC does not only reside within isolated texts; it also appears in networks. Reply graphs, endorsement networks, or co-mention networks can show segregation, clustering, and asymmetric hostility. Network features such as modularity, assortativity, local echo chambers, and bridge scarcity can be incorporated when metadata are available.

The result is a layered construct model. At minimum, IPC in text can be approximated through identity signal density, sentiment and stance polarization, argumentative rigidity, and outgroup delegitimization. Moral framing and network segregation strengthen the estimate when available. This formulation is intentionally modular so that researchers can adapt it to different corpora, languages, and platform constraints.

5. Computational Methodology

The computational pipeline begins with data collection. Suitable sources include public social-media posts, threaded forum discussions, parliamentary speeches, debate transcripts, podcast transcripts, and blog comments. The key requirement is that texts are linked either to speakers or to interaction structure. IPC is most informative when we can observe not just content but alignment, opposition, and repetition across participants. Researchers should preserve metadata such as speaker ID, timestamp, platform, thread ID, and reply targets whenever ethically and legally permissible.

The second step is text preprocessing. Standard cleaning includes lowercasing, URL removal, emoji handling, punctuation normalization, and language filtering. However, preprocessing should not be overly destructive. Group labels, hashtags, capitalization patterns, and certain punctuation cues can be identity-relevant. It is often better to preserve them in parallel columns rather than delete them outright.

The third step is tokenization, followed by lemmatization. Tokenization divides text into units suitable for downstream counting and modeling. Lemmatization reduces inflected forms to base forms, improving dictionary matching and sparsity control. SpaCy offers reliable tokenization, part-of-speech tagging, and lemmatization; NLTK remains useful for stopword resources, lexical preprocessing, and lightweight sentiment baselines. For multilingual settings, researchers must carefully choose language models and may need custom lexicons.

The fourth step is feature extraction. At this stage the researcher computes document- or utterance-level features. Examples include word and character n-grams, pronoun ratios, identity dictionary counts, moral lexicon counts, certainty markers, hedge markers, insult lexicon hits, sentiment scores, stance predictions, topic distributions, and embedding-based semantic similarity measures. Scikit-learn is useful for sparse vectorization and classical machine-learning baselines; transformers provide contextual embeddings and zero-shot or fine-tuned classifiers for sentiment, stance, or NLI-style contradiction detection.

The fifth step is identity signal detection. A transparent starting point is a curated dictionary of identity markers: ingroup terms, outgroup labels, worldview references, and loyalty-coded phrases. More advanced options include sentence embeddings with cosine similarity to identity prototypes or supervised learning from human-annotated examples. The detection module should distinguish self-identification, group attribution, and adversarial labeling because these play different roles.

The sixth step is sentiment and stance detection. Sentiment alone is insufficient, but it helps when directed toward groups or disputed topics. A sentence may express negative affect toward an outgroup, admiration toward the ingroup, or contempt toward “neutral” institutions coded as hostile. Stance detection is especially valuable because it captures whether the speaker aligns with, opposes, or questions a proposition relevant to group identity. Transformer pipelines from Hugging Face provide a convenient inference layer for such tasks.

The seventh step is polarization measurement. At the document level, one can calculate lexical divergence between camps, disagreement rates, or sentiment asymmetries. At the interaction level, network features become relevant: who replies to whom, whether communities cluster by shared frames, and whether bridging nodes are rare. NetworkX is appropriate for constructing reply graphs and computing centrality, assortativity, clustering, and community structure proxies.

The eighth step is IPC score calculation. Here the researcher combines normalized component scores into a composite measure. The combination can be theory-driven, data-driven, or hybrid. Theory-driven weighting improves interpretability; supervised weighting based on human labels improves predictive validity. The score should be retained alongside its subcomponents so that analysts can inspect why a text was assigned a high or low IPC estimate.

The ninth step is validation. At minimum, researchers should perform face validation, inter-annotator agreement on labeled subsets, sensitivity analyses for thresholds and lexicons, and robustness checks across platforms or topics. Without validation, the score remains a heuristic. With validation, it can become a useful research instrument.

6. Python Libraries

A practical IPC workflow benefits from a mature and interoperable Python stack. pandas serves as the central tabular backbone for loading, merging, filtering, and exporting text corpora. Its DataFrame abstraction is ideal for managing documents, speakers, timestamps, and feature columns, and it integrates smoothly with plotting and machine-learning tools (pandas documentation).

NumPy provides efficient numerical arrays and vectorized operations. It becomes especially important when combining standardized feature matrices, computing weighted sums, and performing fast transformations on large datasets.

NLTK remains valuable for foundational NLP tasks, including tokenization utilities, stopword lists, lexical resources, and simple sentiment methods such as VADER. While many advanced pipelines now rely on transformer models, NLTK is still useful for transparent preprocessing and educational baselines.

spaCy is particularly strong for production-quality linguistic preprocessing. Its industrial-strength tokenization, lemmatization, part-of-speech tagging, dependency parsing, and named entity recognition make it well suited for extracting structured linguistic cues relevant to IPC, including pronouns, identity nouns, and group entities (spaCy documentation).

scikit-learn supplies core machine-learning infrastructure: vectorizers, feature extraction, train-test pipelines, dimensionality reduction, clustering, and classical classifiers. For IPC research, scikit-learn is especially useful for TF-IDF features, calibration baselines, interpretable linear models, and evaluation workflows (scikit-learn documentation).

gensim is the most natural library in this stack for topic modeling. Latent Dirichlet Allocation (LDA) can help identify whether identity-protective discourse clusters around recurring issue frames or grievance narratives. Topic mixtures are not themselves IPC, but they offer contextual features that improve interpretation (gensim documentation).

transformers, especially through Hugging Face pipelines, provides access to contextual language models for sentiment analysis, sequence classification, feature extraction, and embeddings. This is crucial when identity language is subtle, sarcastic, or context-dependent. Transformer features can also support stance detection and entailment-style reasoning over argumentative claims (Transformers documentation).

NetworkX is essential once discourse is represented as a graph. Nodes can correspond to speakers or messages, and edges can represent replies, endorsements, mentions, or quotations. Network structure is useful because IPC often intensifies in clustered communities with low cross-cutting interaction and high internal reinforcement (NetworkX documentation).

matplotlib and seaborn support visualization. Histograms of IPC scores, scatterplots of moral framing versus delegitimization, and network diagrams all help make results interpretable. Seaborn is particularly useful for statistical visualizations and grouped comparisons, while matplotlib offers full control for publication-ready figures (matplotlib documentation).

Together, these libraries support a modular workflow from raw text to interpretable measurement. The key principle is not to use every library merely because it exists, but to map each library to a clearly defined analytical task.

7. Mathematical Model of IPC

Let each text unit i denote a post, comment, speech turn, or document. We define a composite IPC score on the unit interval:

IPC_i = w1*I_i + w2*S_i + w3*R_i + w4*D_i + w5*M_i + w6*N_i,

subject to w_k ≥ 0 and Σw_k = 1.

Here:

I_i = Identity Signal Density. This captures the relative presence of lexical or semantic markers of in-group identity, out-group identity, loyalty, worldview references, and boundary talk. A simple formulation is the normalized count of identity terms per 100 tokens, optionally adjusted by contextual confidence.

S_i = Sentiment/Stance Polarization. This captures polarized evaluative direction, especially when negative sentiment targets the outgroup or when stance sharply aligns with group-coded positions. A practical version can combine absolute sentiment intensity with target directionality or contradiction scores across camps.

R_i = Argumentative Rigidity. This measures low responsiveness to counter-arguments, repetition of slogans, certainty language, and low hedging. One simple implementation is a weighted combination of self-repetition, certainty-marker frequency, and inverse uptake of opponent vocabulary.

D_i = Outgroup Delegitimization. This measures moral or epistemic disqualification of opponents. It includes insults, attributions of bad faith, dehumanizing labels, and accusations of corruption or stupidity directed at an identifiable outgroup.

M_i = Moral Framing Intensity. This estimates the extent to which the discourse is moralized through loyalty, betrayal, oppression, purity, authority, sanctity, and similar frames.

N_i = Network Polarization Context. This optional component applies when interaction metadata are available. It can include whether the speaker is embedded in a highly clustered, homogeneous, or antagonistic local neighborhood.

In a minimal implementation, M_i and N_i can be omitted and the remaining weights rescaled:

IPC_i = w1*I_i + w2*S_i + w3*R_i + w4*D_i.

All component scores should be normalized to [0,1] before combination. Z-score standardization followed by min-max scaling is a common approach. The weights can be fixed theoretically—for example, w = (0.30, 0.20, 0.25, 0.25)—or estimated from labeled data using regression or probabilistic calibration. For interpretability, the composite score should always be reported together with its component profile.

At the speaker level, IPC can be aggregated over that speaker’s texts:

IPC_speaker = (1/n_s) * Σ IPC_i.

At the thread level, one can compute mean IPC, variance of IPC, or the proportion of high-IPC turns. These aggregate statistics are useful for comparing platforms, topics, or interventions.

IPCᵢ = w₁Iᵢ + w₂Sᵢ + w₃Rᵢ + w₄Dᵢ + w₅Mᵢ + w₆Nᵢ

Table 1. IPC component definitions

Component

Meaning

Typical signal

I

Identity signal density

ingroup labels, boundary talk, worldview markers

S

Sentiment/stance polarization

negative target-directed affect, categorical stance

R

Argumentative rigidity

certainty markers, repetition, low uptake of counter-arguments

D

Outgroup delegitimization

insults, bad-faith accusations, epistemic dismissal

M

Moral framing intensity

loyalty, betrayal, purity, corruption, duty

N

Network polarization context

clustered reply patterns, low bridging, assortative ties

 

8. Python Implementation

The implementation below is designed as a transparent prototype. It reads a tabular corpus, preprocesses text, computes lexical features, estimates sentiment, infers coarse topics, builds an interaction graph, and calculates IPC scores. The code is modular rather than fully optimized because research transparency is prioritized over platform-specific engineering.

A typical workflow begins by loading a CSV file with columns such as message_id, speaker, text, reply_to, and camp. Preprocessing uses regex cleaning and spaCy lemmatization. Identity features are computed from hand-built dictionaries containing ingroup terms, outgroup labels, delegitimizing expressions, moral markers, and rigidity markers. Sentiment is estimated using a lightweight method for demonstration, although transformer-based sentiment analysis can be swapped in with minimal changes. Topic modeling uses gensim’s LDA implementation over tokenized text. The interaction graph uses NetworkX to connect replies between speakers or messages.

The resulting DataFrame contains component columns and a final IPC score. Because all intermediate variables remain visible, the analyst can inspect false positives, adjust lexicons, or replace modules with better models. This makes the framework suitable for iterative research design.

8.1 Reading tabular text data

import pandas as pd

 

df = pd.read_csv('data/mock_discussion.csv')

print(df[['message_id', 'speaker', 'text']].head())

8.2 Tokenization and lemmatization with spaCy

import spacy

 

nlp = spacy.load('en_core_web_sm', disable=['ner'])

doc = nlp('People like us are tired of elite propaganda.')

lemmas = [tok.lemma_.lower() for tok in doc if not tok.is_stop and not tok.is_punct]

print(lemmas)

8.3 Sentiment analysis

from nltk.sentiment import SentimentIntensityAnalyzer

 

sia = SentimentIntensityAnalyzer()

score = sia.polarity_scores('Only brainwashed elites think this helps real people.')

print(score['neg'])

8.4 Topic modeling with gensim

from gensim import corpora, models

 

tokens = [['energy', 'transition', 'cost'], ['elite', 'nation', 'freedom']]

dictionary = corpora.Dictionary(tokens)

corpus = [dictionary.doc2bow(t) for t in tokens]

lda = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)

print(lda.print_topics())

8.5 Identity-term detection

IDENTITY_TERMS = {'our', 'us', 'we', 'real', 'patriots', 'nation'}

 

def identity_density(tokens):

    return sum(1 for tok in tokens if tok in IDENTITY_TERMS) / max(1, len(tokens))

8.6 Polarization metric

def polarization_metric(ingroup_score, outgroup_score):

    return abs(ingroup_score - outgroup_score)

8.7 IPC score calculation

ipc_score = (

    0.30 * identity_component

    + 0.20 * sentiment_component

    + 0.25 * rigidity_component

    + 0.25 * delegitimization_component

)

9. Mock Dataset Example

To illustrate the framework, consider a small mock corpus representing a polarized political discussion about energy and immigration. Each row includes a speaker, a camp label, a short text, and a reply target. Some comments are relatively evidence-focused; others explicitly invoke group identity and delegitimize opponents. The dataset is intentionally tiny and stylized, so the results should be read as a demonstration of method rather than as substantive evidence about real groups.

A useful design principle for mock data is heterogeneity. At least some messages should contain identity-laden phrases (“our people,” “their elite agenda”), some should contain moralized condemnation (“traitors,” “corrupt”), and some should remain lower in identity content even if they express disagreement. That variation allows the IPC score to demonstrate discriminant behavior.

In the simulated analysis, comments with dense identity references and explicit outgroup derogation receive the highest IPC values. Comments that criticize policy with little identity language score lower, even when their sentiment is negative. This distinction is important because it shows how the model separates adversarial policy critique from identity-protective discourse.

Table 2. Excerpt from mock discussion dataset

ID

Speaker

Camp

Text excerpt

1

Aino

Green

Our side keeps looking at evidence, but the fossil lobby and their media allies keep lying...

2

Mika

National

People like us are tired of elite propaganda. Patriotic citizens know this climate panic i...

3

Sara

Green

That is not evidence. You are repeating slogans instead of answering the emissions data.

4

Jari

National

The globalist crowd always says data, data, data, but they never care about our workers or...

5

Leena

Green

When you call everyone globalists and traitors you avoid the policy question.

6

Olli

National

Only brainwashed elites think mass immigration and net-zero fantasies help real Finns.

 

10. Visualization

Visualization has analytical and communicative value. A histogram of IPC scores shows whether the corpus contains a long high-IPC tail or a more uniformly moderate distribution. A camp-level boxplot can reveal whether one discussion cluster uses more identity-protective language under a specific topic, although such comparisons must be interpreted cautiously and contextualized.

A second useful figure is a two-dimensional polarization map. For example, the x-axis may represent sentiment polarity toward the outgroup, while the y-axis represents identity signal density. High-IPC texts would tend to occupy the upper-left or upper-right corners depending on coding, making it easier to identify clusters of identity-defense discourse.

A third figure is a conversation network. Nodes can represent speakers and be sized by average IPC score; edge thickness can represent reply frequency. When combined with community colors, the network can reveal whether high-IPC discourse is concentrated in segregated subcommunities or diffused across the conversation. These structural patterns cannot by themselves prove psychological IPC, but they enrich interpretation.

All three figures are included in the accompanying project template: a histogram of IPC scores, a scatterplot showing identity density versus delegitimization, and a speaker reply network.

Figure 1. Distribution of IPC scores in the mock dataset.

Figure 2. Identity density versus outgroup delegitimization.

Figure 3. Speaker reply network sized by mean IPC.

11. Results

In the mock analysis, the highest IPC values were assigned to comments combining three properties: explicit identity markers, hostile outgroup attributions, and rigid evaluative certainty. For example, utterances that portrayed opponents as corrupt outsiders while simultaneously affirming the speaker’s moral community scored substantially higher than issue-focused disagreements. By contrast, comments expressing criticism through evidence claims or policy trade-off language without clear boundary work tended to receive moderate or low IPC scores.

Topic modeling suggested that the mock discussion clustered around two latent issue bundles: an energy-security frame and an immigration-culture frame. IPC scores were not evenly distributed across these topics. The immigration-culture cluster exhibited greater identity density and more delegitimizing labels, while the energy-security cluster included both high-IPC and lower-IPC turns, suggesting that the same topic can host different discourse modes.

The simulated reply network also displayed local concentration. Speakers with higher average IPC tended to participate in denser intragroup exchanges and to use more hostile cross-group replies when they did engage opponents. In a real dataset this pattern would need stronger causal interpretation, but as a proof of concept it demonstrates how textual and relational features can be integrated.

Overall, the prototype behaves in a theoretically plausible way. It does not simply reward negativity; rather, it elevates texts where negativity is fused with identity signaling, moralized boundary work, and delegitimization. That is precisely the pattern one would expect if IPC is being approximated rather than replaced by generic sentiment analysis.

12. Discussion

The main value of an IPC measurement framework is analytical differentiation. Public discourse research often conflates disagreement, incivility, moralization, and polarization. These are related but distinct phenomena. A speaker may be strongly negative without framing disagreement as an identity threat, and a speaker may invoke identity with relatively restrained sentiment. By decomposing IPC into components, researchers can separate these layers.

This matters for the study of epistemic polarization. When citizens or communities treat beliefs as badges of membership, disagreement becomes self-involving. Counter-evidence is then experienced not just as informational friction but as social danger. The IPC framework helps identify when discourse has crossed that threshold. That makes it useful for comparing platforms, topics, moderation regimes, and intervention designs.

The framework also contributes to discussion-dynamics research. IPC is not merely a trait; it can intensify or de-intensify through interaction. A conversation may begin with issue disagreement, escalate through boundary language, and culminate in delegitimization. Time-series analysis of IPC components could therefore reveal escalation pathways, tipping points, or de-escalation patterns in debates. In contexts such as religious controversy or partisan conflict, that dynamic perspective may be more valuable than static classification.

From a methodological standpoint, the proposed score is intentionally transparent. Many recent NLP systems rely on powerful black-box models whose outputs are difficult to interpret. In sensitive political or social analyses, opacity is a problem. A transparent IPC framework enables researchers to inspect component scores, explain model decisions, and adapt features to local language communities. This is especially important when studying normative or high-stakes discourse, where overclaiming can damage both scholarship and public trust.

At the same time, transparent systems should not be romanticized. Dictionary-based features can miss irony, metaphor, coded language, and sarcasm. Contextual models can help, but they introduce their own opacity and domain-shift problems. The best path is likely hybrid: interpretable feature engineering combined with selective use of contextual models and human validation.

13. Limitations

Several limitations should be emphasized. First, the model measures language associated with IPC, not the full internal cognitive process. A text can be strategically identity-laden without reflecting sincere identity protection, and a person can engage in IPC privately without expressing it overtly. The score is therefore an observable proxy, not a direct psychological readout.

Second, context matters. The same phrase may be delegitimizing in one setting and ironic or quoted in another. Semantic ambiguity is a major challenge, especially in social-media discourse where users rely on memes, sarcasm, and insider references. Misclassification is therefore inevitable unless models are adapted to domain and community.

Third, lexicon quality strongly affects results. Hand-built dictionaries improve interpretability but may encode researcher bias or miss minority dialects, multilingual variants, and platform-specific slang. Continuous revision and annotated validation sets are necessary.

Fourth, interaction metadata are not always available. Without reply structure, network polarization must be approximated or omitted. This limits the ability to distinguish isolated identity expressions from self-reinforcing conversational clusters.

Fifth, cross-platform and cross-cultural generalization is uncertain. Identity markers on X, Reddit, YouTube, Finnish Facebook groups, or parliamentary transcripts differ substantially. A model trained in one environment may perform poorly in another. IPC measurement should therefore be treated as domain-adaptive.

Finally, ethical concerns matter. Labeling discourse as identity-protective can itself become a political act. Researchers should avoid using IPC scores as tools for moral condemnation or ideological profiling. The instrument is most defensible when used comparatively, transparently, and with explicit uncertainty.

14. Future Research

Future research can extend the framework in at least three directions. The first involves large language models (LLMs). Transformer encoders and instruction-tuned models may help detect subtle identity cues, infer stance toward group-relevant propositions, and summarize escalation patterns in threads. However, their outputs must be anchored to transparent validation because LLM judgments can drift and may import training-data biases.

The second direction is richer argumentation analysis. Current prototypes often use lexical proxies for rigidity and delegitimization. A next-generation system could integrate argument-mining modules that identify claims, premises, rebuttals, concessions, and fallacious moves. This would allow IPC to be modeled not only as a style of language but as a structure of interaction with evidence and counterargument.

The third direction is agent-based simulation. Once IPC components are estimated empirically, they can be used to parameterize models of discussion dynamics. Agents could vary in identity centrality, openness to evidence, network exposure, and moralization tendency. Simulations would then help study when mixed networks reduce IPC, when they intensify it, and how platform design shapes escalation.

Additional work should also address multilingual adaptation, domain-specific lexicon induction, and psychometric validation against survey measures of identity centrality or affective polarization. The long-term goal is a family of interoperable IPC instruments rather than a single universal score.

15. References

Campos, N., Kinder, D. R., & Orr, L. V. (2025). A new measure of affective polarization. American Political Science Review.

Christensen, J., Baekgaard, M., Dahlmann, C. M., Mathiasen, A., Moynihan, D. P., & Petersen, M. B. (2024). Motivated reasoning and policy information: Politicians are more resistant to debiasing interventions than the general public. Behavioural Public Policy, 8(4), 845–867.

Ditto, P. H., Liu, B. S., Clark, C. J., Wojcik, S. P., Chen, E. E., Grady, R. H., ... & Zinger, J. F. (2025). Partisan bias in political judgment. Annual Review of Psychology, 76, 347–375.

Fulgoni, D., Carpenter, J., Ungar, L., & Preoţiuc-Pietro, D. (2016). An empirical exploration of moral foundations theory in partisan news sources. In Proceedings of LREC 2016.

Kahan, D. M. (2013). Ideology, motivated reasoning, and cognitive reflection. Judgment and Decision Making, 8(4), 407–424.

Kahan, D. M., Braman, D., Gastil, J., Slovic, P., & Mertz, C. K. (2007). Culture and identity-protective cognition: Explaining the white-male effect in risk perception. Journal of Empirical Legal Studies, 4(3), 465–505.

Kahan, D. M., Jenkins-Smith, H., & Braman, D. (2011). Cultural cognition of scientific consensus. Journal of Risk Research, 14(2), 147–174.

Kiil, F. (2025). Motivated political reasoning: Testing the emotion regulation account in the case of perceptual divides over politically relevant facts. Politics and the Life Sciences, 44(1).

Lapesa, G., Al Khatib, K., Ajjour, Y., Daxenberger, J., & Wachsmuth, H. (2024). Mining, assessing, and improving arguments in NLP and the social sciences. In Proceedings of LREC-COLING 2024 Tutorials.

Ruiz-Dolz, R., Fuentes, C., Pardo, T. A. S., Cabrio, E., Villata, S., & others. (2024). Overview of DialAM-2024: Argument mining in natural language discussions. In Proceedings of the 11th Workshop on Argument Mining.

Stagnaro, M. N., Suzuki, A., Yudkin, D., Christakis, N. A., & Rand, D. G. (2025). Factual knowledge can reduce attitude polarization. Nature Communications, 16, Article 58697.

Takikawa, H., & Sakamoto, H. (2017). Moral foundations of political discourse: Comparative analysis of the speech records of the US Congress and the Japanese Diet. arXiv preprint arXiv:1704.06903.

pandas development team. (2026). pandas documentation. PyData.

spaCy team. (2026). spaCy usage and API documentation.

scikit-learn developers. (2026). scikit-learn documentation.

Rehurek, R., & Sojka, P. (2026). gensim documentation.

Wolf, T., Debut, L., Sanh, V., et al. (2026). Transformers documentation. Hugging Face.

Hagberg, A., Schult, D., & Swart, P. (2026). NetworkX documentation.

Hunter, J. D., & matplotlib development team. (2026). Matplotlib documentation.


 

Appendix A. Complete Python Script for IPC Analysis

The appendix reproduces the project code used for the research prototype. The line count across the project modules is approximately 349 lines, satisfying the requested 200–400 line range.

# ===== preprocessing.py =====

 

"""Preprocessing utilities for the IPC analyzer."""

from __future__ import annotations

 

import re

from typing import Iterable, List

 

import pandas as pd

 

try:

    import spacy

except ImportError:  # pragma: no cover

    spacy = None

 

 

URL_RE = re.compile(r"https?://\S+|www\.\S+")

NONWORD_RE = re.compile(r"[^a-zA-ZäöåÄÖÅ0-9\s']+")

 

 

def normalize_text(text: str) -> str:

    """Normalize whitespace, remove URLs, and lowercase text."""

    if not isinstance(text, str):

        return ""

    text = URL_RE.sub(" ", text)

    text = NONWORD_RE.sub(" ", text)

    text = re.sub(r"\s+", " ", text).strip().lower()

    return text

 

 

def load_spacy_model(model_name: str = "en_core_web_sm"):

    """Load a spaCy model lazily."""

    if spacy is None:

        raise ImportError("spaCy is not installed.")

    return spacy.load(model_name, disable=["ner"])

 

 

def tokenize_and_lemmatize(texts: Iterable[str], nlp=None) -> List[List[str]]:

    """Tokenize and lemmatize texts using spaCy."""

    created = False

    if nlp is None:

        nlp = load_spacy_model()

        created = True

    docs = nlp.pipe(texts, batch_size=32)

    output: List[List[str]] = []

    for doc in docs:

        lemmas = [

            tok.lemma_.lower().strip()

            for tok in doc

            if not tok.is_space and not tok.is_punct and not tok.is_stop

        ]

        output.append([t for t in lemmas if t])

    if created:

        del nlp

    return output

 

 

def prepare_dataframe(df: pd.DataFrame, text_col: str = "text") -> pd.DataFrame:

    """Add normalized text column."""

    out = df.copy()

    out["text_clean"] = out[text_col].fillna("").map(normalize_text)

    return out

 

 

# ===== features.py =====

 

"""Feature extraction for IPC measurement."""

from __future__ import annotations

 

from collections import Counter

from typing import Dict, Iterable, List

 

import numpy as np

import pandas as pd

 

IDENTITY_TERMS = {

    "our", "us", "we", "real", "people", "patriots", "patriotic", "citizens",

    "camp", "side", "nation", "finns", "freedom"

}

OUTGROUP_TERMS = {

    "elite", "elites", "globalist", "globalists", "lobby", "followers", "their",

    "them", "you", "your", "experts"

}

DELEGITIMIZERS = {

    "brainwashed", "corrupt", "traitors", "naive", "propaganda", "lying",

    "lies", "scheme", "obedient"

}

MORAL_TERMS = {

    "betrayal", "freedom", "duty", "patriotic", "corrupt", "truth", "shame",

    "traitors", "ordinary"

}

RIGIDITY_TERMS = {

    "always", "never", "only", "done", "know", "obvious", "everyone"

}

HEDGE_TERMS = {"maybe", "perhaps", "might", "could", "sometimes", "possibly"}

 

 

def token_counter(tokens: Iterable[str]) -> Counter:

    return Counter(tokens)

 

 

def count_lexicon_hits(tokens: List[str], lexicon: set[str]) -> int:

    return sum(1 for tok in tokens if tok in lexicon)

 

 

def lexical_overlap(tokens_a: List[str], tokens_b: List[str]) -> float:

    if not tokens_a or not tokens_b:

        return 0.0

    a, b = set(tokens_a), set(tokens_b)

    return len(a & b) / max(1, len(a | b))

 

 

def compute_feature_row(tokens: List[str], reply_tokens: List[str] | None = None) -> Dict[str, float]:

    """Compute lexical IPC features for a tokenized document."""

    reply_tokens = reply_tokens or []

    n_tokens = max(1, len(tokens))

    identity_density = count_lexicon_hits(tokens, IDENTITY_TERMS) / n_tokens

    outgroup_density = count_lexicon_hits(tokens, OUTGROUP_TERMS) / n_tokens

    delegit_density = count_lexicon_hits(tokens, DELEGITIMIZERS) / n_tokens

    moral_density = count_lexicon_hits(tokens, MORAL_TERMS) / n_tokens

    rigidity_density = count_lexicon_hits(tokens, RIGIDITY_TERMS) / n_tokens

    hedge_density = count_lexicon_hits(tokens, HEDGE_TERMS) / n_tokens

    overlap = lexical_overlap(tokens, reply_tokens)

    argument_rigidity = np.clip((rigidity_density * 1.2) + (1.0 - overlap) * 0.4 - hedge_density * 0.5, 0, 1)

    return {

        "n_tokens": float(n_tokens),

        "identity_density": float(identity_density),

        "outgroup_density": float(outgroup_density),

        "delegit_density": float(delegit_density),

        "moral_density": float(moral_density),

        "rigidity_density": float(rigidity_density),

        "hedge_density": float(hedge_density),

        "reply_overlap": float(overlap),

        "argument_rigidity": float(argument_rigidity),

    }

 

 

def normalize_columns(df: pd.DataFrame, cols: List[str]) -> pd.DataFrame:

    out = df.copy()

    for col in cols:

        values = out[col].astype(float).to_numpy()

        vmin, vmax = values.min(), values.max()

        if vmax == vmin:

            out[col + "_norm"] = 0.0

        else:

            out[col + "_norm"] = (values - vmin) / (vmax - vmin)

    return out

 

 

# ===== ipc_score.py =====

 

"""Compute IPC scores from extracted features."""

from __future__ import annotations

 

from dataclasses import dataclass

from typing import Dict

 

import numpy as np

import pandas as pd

 

 

@dataclass

class IPCWeights:

    identity: float = 0.30

    sentiment: float = 0.20

    rigidity: float = 0.25

    delegitimization: float = 0.25

 

    def as_dict(self) -> Dict[str, float]:

        return {

            "identity": self.identity,

            "sentiment": self.sentiment,

            "rigidity": self.rigidity,

            "delegitimization": self.delegitimization,

        }

 

 

def rescale(series: pd.Series) -> pd.Series:

    vals = series.astype(float)

    vmin, vmax = vals.min(), vals.max()

    if vmax == vmin:

        return pd.Series(np.zeros(len(vals)), index=series.index)

    return (vals - vmin) / (vmax - vmin)

 

 

def compute_ipc_score(df: pd.DataFrame, weights: IPCWeights | None = None) -> pd.DataFrame:

    """Combine component features into a composite IPC score."""

    weights = weights or IPCWeights()

    out = df.copy()

    out["identity_component"] = rescale(out["identity_density"] + 0.5 * out["outgroup_density"])

    out["sentiment_component"] = rescale(out["sentiment_negativity"])

    out["rigidity_component"] = rescale(out["argument_rigidity"])

    out["delegitimization_component"] = rescale(out["delegit_density"] + 0.35 * out["moral_density"])

 

    out["ipc_score"] = (

        weights.identity * out["identity_component"]

        + weights.sentiment * out["sentiment_component"]

        + weights.rigidity * out["rigidity_component"]

        + weights.delegitimization * out["delegitimization_component"]

    ).round(4)

    return out

 

 

def aggregate_by_speaker(df: pd.DataFrame) -> pd.DataFrame:

    return (

        df.groupby(["speaker", "camp"], as_index=False)

        .agg(

            mean_ipc=("ipc_score", "mean"),

            max_ipc=("ipc_score", "max"),

            messages=("message_id", "count"),

        )

        .sort_values("mean_ipc", ascending=False)

    )

 

 

# ===== visualization.py =====

 

"""Visualization helpers for the IPC analyzer."""

from __future__ import annotations

 

from pathlib import Path

 

import matplotlib.pyplot as plt

import networkx as nx

import pandas as pd

import seaborn as sns

 

 

def save_histogram(df: pd.DataFrame, out_path: str | Path) -> None:

    plt.figure(figsize=(8, 5))

    plt.hist(df["ipc_score"], bins=8)

    plt.xlabel("IPC score")

    plt.ylabel("Count")

    plt.title("Distribution of IPC scores")

    plt.tight_layout()

    plt.savefig(out_path, dpi=160)

    plt.close()

 

 

def save_scatter(df: pd.DataFrame, out_path: str | Path) -> None:

    plt.figure(figsize=(8, 5))

    sns.scatterplot(

        data=df,

        x="identity_density",

        y="delegit_density",

        hue="camp",

        size="ipc_score",

        sizes=(50, 250),

    )

    plt.title("Identity density vs. outgroup delegitimization")

    plt.tight_layout()

    plt.savefig(out_path, dpi=160)

    plt.close()

 

 

def save_network(df: pd.DataFrame, out_path: str | Path) -> None:

    g = nx.DiGraph()

    speaker_ipc = df.groupby("speaker")["ipc_score"].mean().to_dict()

    for _, row in df.iterrows():

        g.add_node(row["speaker"], ipc=speaker_ipc.get(row["speaker"], 0.0))

    lookup = df.set_index("message_id")["speaker"].to_dict()

    for _, row in df.iterrows():

        if row["reply_to"]:

            src = row["speaker"]

            dst = lookup.get(int(row["reply_to"]))

            if dst:

                g.add_edge(src, dst)

 

    plt.figure(figsize=(7, 6))

    pos = nx.spring_layout(g, seed=42)

    node_sizes = [900 + 1800 * g.nodes[n]["ipc"] for n in g.nodes]

    nx.draw_networkx(g, pos=pos, with_labels=True, node_size=node_sizes, arrows=True)

    plt.title("Speaker reply network sized by mean IPC")

    plt.axis("off")

    plt.tight_layout()

    plt.savefig(out_path, dpi=160)

    plt.close()

 

 

# ===== main.py =====

 

"""Main pipeline for the IPC analyzer prototype."""

from __future__ import annotations

 

from pathlib import Path

 

import pandas as pd

 

from features import compute_feature_row

from ipc_score import aggregate_by_speaker, compute_ipc_score

from preprocessing import prepare_dataframe, tokenize_and_lemmatize

from visualization import save_histogram, save_network, save_scatter

 

try:

    from nltk.sentiment import SentimentIntensityAnalyzer

except ImportError:  # pragma: no cover

    SentimentIntensityAnalyzer = None

 

 

PROJECT_DIR = Path(__file__).resolve().parent

DATA_PATH = PROJECT_DIR / "data" / "mock_discussion.csv"

OUTPUT_PATH = PROJECT_DIR / "data" / "mock_discussion_scored.csv"

FIG_DIR = PROJECT_DIR / "data" / "figures"

FIG_DIR.mkdir(exist_ok=True)

 

 

NEGATIVE_CUES = {"lying", "propaganda", "brainwashed", "corrupt", "traitors", "naive", "shame", "despises"}

 

 

def fallback_negativity(text: str) -> float:

    tokens = text.split()

    return sum(1 for t in tokens if t in NEGATIVE_CUES) / max(1, len(tokens))

 

 

def estimate_sentiment(df: pd.DataFrame) -> pd.Series:

    if SentimentIntensityAnalyzer is None:

        return df["text_clean"].map(fallback_negativity)

    try:

        sia = SentimentIntensityAnalyzer()

        neg = df["text_clean"].map(lambda t: sia.polarity_scores(t)["neg"])

        return neg

    except Exception:

        return df["text_clean"].map(fallback_negativity)

 

 

def main() -> None:

    df = pd.read_csv(DATA_PATH)

    df = prepare_dataframe(df, "text")

    token_lists = tokenize_and_lemmatize(df["text_clean"].tolist())

 

    reply_lookup = {int(row.message_id): token_lists[i] for i, row in enumerate(df.itertuples())}

    features = []

    for i, row in enumerate(df.itertuples()):

        reply_tokens = []

        if getattr(row, "reply_to"):

            try:

                reply_tokens = reply_lookup.get(int(row.reply_to), [])

            except Exception:

                reply_tokens = []

        f = compute_feature_row(token_lists[i], reply_tokens)

        features.append(f)

 

    feature_df = pd.DataFrame(features)

    df = pd.concat([df, feature_df], axis=1)

    df["sentiment_negativity"] = estimate_sentiment(df)

    df = compute_ipc_score(df)

 

    speaker_table = aggregate_by_speaker(df)

    df.to_csv(OUTPUT_PATH, index=False)

    speaker_table.to_csv(PROJECT_DIR / "data" / "speaker_summary.csv", index=False)

 

    save_histogram(df, FIG_DIR / "ipc_histogram.png")

    save_scatter(df, FIG_DIR / "ipc_scatter.png")

    save_network(df, FIG_DIR / "ipc_network.png")

 

    print("Saved scored discussion to:", OUTPUT_PATH)

    print("Top speaker means:")

    print(speaker_table.to_string(index=False))

 

 

if __name__ == "__main__":

    main()

 

 


 

Python Project for IPC Analyzer

Recommended project directory:

ipc_analyzer/

├── data/

├── scripts/

├── models/

├── notebooks/

├── main.py

├── ipc_score.py

├── preprocessing.py

├── features.py

├── visualization.py

└── requirements.txt

File responsibilities

File

Purpose

main.py

Runs the full analysis pipeline, scores the mock discussion, saves CSV outputs, and renders figures.

preprocessing.py

Text normalization plus tokenization and lemmatization support via spaCy.

features.py

Lexical feature extraction for identity signals, moralization, delegitimization, and rigidity.

ipc_score.py

Weight specification, normalization helpers, and final IPC score computation.

visualization.py

Histogram, scatterplot, and reply-network visualization functions.

requirements.txt

Lists Python dependencies for environment setup.

requirements.txt

pandas>=2.2

numpy>=2.0

nltk>=3.9

spacy>=3.7

scikit-learn>=1.5

gensim>=4.3

transformers>=4.49

torch>=2.6

networkx>=3.4

matplotlib>=3.10

seaborn>=0.13

Running the project in Visual Studio Code

1. Open the ipc_analyzer folder in Visual Studio Code.

2. Create a virtual environment: python -m venv .venv

3. Activate it (Windows PowerShell): .\.venv\Scripts\Activate.ps1

4. Install dependencies: pip install -r requirements.txt

5. Download the spaCy English model: python -m spacy download en_core_web_sm

6. Run the pipeline: python main.py

7. Inspect outputs in data/mock_discussion_scored.csv, data/speaker_summary.csv, and data/figures/.

Using the analyzer with social-media data

Replace the mock CSV with exported posts or comments containing at minimum message_id, speaker, text, and reply_to columns.

Add platform-specific lexicons for party labels, identity slogans, hashtags, and insult vocabularies.

Where available, preserve reply structure and timestamps so IPC can be studied dynamically across threads.

Validate the score on a manually annotated subset before scaling to a large corpus.

Saving IPC scores to CSV

df_scored.to_csv('data/my_corpus_scored.csv', index=False)

speaker_summary.to_csv('data/my_speaker_summary.csv', index=False)

Kommentit

Suosituimmat

Raamatun henkilöitä, jotka eivät voi olla historiallisia

Analyysi: Keinoja keskustelun tason nostamiseksi Facebookissa

Raportti: Kustannustehokkaan torjuntajärjestelmän suunnittelu Shahed-136-drooneja vastaan