Text Mining - Wissensrohstoff...

Institut für Informatik

Text Mining -

Wissensrohstoff Text

Gerhard Heyer

Universität Leipzig heyer@informatik.uni-leipzig.de

Stimmungsanalyse

Kommunikationstheoretischer Hintergrund

Stimmungsanalyse

Das Organonmodell von Bühler (Sprachtheorie)

• Grundlage sprachlicher Kommunikation sind sprachliche

Ausdrücke

• Sprachliche Ausdrücke haben drei Dimensionen:

– den Sender (Sprecher, Schreiber)

– den Empfänger (Hörer, Leser)

– die referenzierte Sache (Objekte und Ereignisse, Eigenschaften,

Tatsachen, …)

• In Bezug auf einen Sender, (intendierten) Empfänger und die

referenzierte Sache haben sprachliche Ausdrücke daher eine

dreifache Funktion:

– Symptom

– Apell

– Symbol

3 Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text

Stimmungsanalyse

Organonmodell – Schematische Darstellung

Stimmungsanalyse

Linguistische Aspekte

• Wir finden daher sprachliche Ausdrücke in Form von

– Ausrufen (Symptom)

– Bewertungen (Apell)

– Aussagen (Symbol)

• Beispiel

ECKHARD BERGER (stockend)

Ich weiß nicht, was passieren wird… aber ich habe

Angst… Angst vor meinen Kollegen: Jürgen

Wiesehöfer… Michael Nauen… und Sven Lienecke.

Wenn mir etwas zustößt, dann… (eine quälende

Pause, dann) diese drei Männer sind gefährlich…

(leise) möglicherweise Mörder.

Drehbuch SoKo

Leipzig Folge 6,

Stimmungsanalyse

Weiteres Beispiel

Stimmungsanalyse

Aufgaben

• Identifizieren von sentiment targets

• Identifizieren von sentiment expressions (Wörter, Sätze) und

deren Modifikatoren

• Berechnung eines sentiment index für sentiment expressions

• Berechnung eines sentiment index für komplexe sentiment

expressions (Abschnitte, Texte)

• Identifizieren und parametrisieren von Einflussfaktoren bei der

Interpretation von sentiment expressions, z. B.

– Kontext

(Fachdomäne, Interessen des Bewerters, Perspektive, …)

– Medium

(soziale Netzwerke, email, Film, …)

– Sprachregister

(Höflichkeit, Kanalrestriktionen, Kompetenzrestriktionen, …)

Stimmungsanalyse

Anwendungen

• Text und Interview basiertes Marketing (Markenbildung und –

veränderung, Kundenerwartungen)

• Text basierte Marktanalysen

• Text und Interview basierte sozialwissenschaftliche

Erhebungen (European Social Science Monitor)

• Wichtige Ergänzung zur Analyse von Trends und sozialen

Netzwerken

• eHumanities

Stimmungsanalyse

Projekte und Werkzeuge

Stimmungsanalyse

News stories collected in realtime

Reuters Sentiment Analysis Workflow

Stories are standardised

Linguistic analysis performed on each story

to produce sentiment

Word sense disambiguation performed on

each story

Sentiment feature vector produced to

describe each document

Feature vector matched against machine

learning vector in order to classify story

sentiment

Analysis results delivered to clients

Stimmungsanalyse

Live News FeedsThousands of news stories collected

Data is standardised with timestamps,

headlines, text...News stories serialised to storage

devices.

Stories are machine read and analysed

to detech sentmient

Results are used to produce time-series

trends

Overall results are analysed for trends,

repeating patterns and algorithmic

patterns.

Analysed data sent to clients

Dow Jones/RavenPack Sentiment Analysis Workflow

Stimmungsanalyse

Sentiment Analyse

Die nachfolgenden Folien basieren

auf Folien von Robert Remus und

Khurshid Ahmad

Stimmungsanalyse

Einführung

Einführung --- Definition Sentiment Analysis

Sentiment Analysis refers to a broad area of NLP, CL

and TM. Generally speaking, it aims to determine the

attitude of a speaker or a writer with respect to some

topic. The attitude may be their judgment or

evaluation, their affective state or their intended

emotional communication.

http://en.wikipedia.org/wiki/Sentimentanalysis

Stimmungsanalyse

Überblick --- Literaturlandschaft

Was umfasst der Begriff der Sentiment Analysis und das oft

synonym gebrauchte Opinion Mining in der Literatur?

Subjectivity Analysis:

Hat eine textuelle Einheit bspw. ein Wort, eine Phrase, ein Satz, ein

Dokument einen subjektiven oder objektiven Charakter?

Polarity Analysis:

Bringt eine textuelle Einheit eine positive, negative oder neutrale

Stimmung zum Ausdruck?

Beide Fragestellungen werden vornehmlich als Instanzen eines

(Text-)Klassifikationsproblems angesehen

Stimmungsanalyse

Subjectivity Analysis

Einflussreichste Arbeiten:

Wiebe et al. (2004), Wiebe & Riloff (2005)

Diese klassifizieren Sätze u.a. anhand sog. Subjectivity Clues:

„Ich glaube, die Qualität ist minderwertig.“

Wiebe et al. (2004) „lernen“ solche Ich-Du-Kookkurrenzen und

andere subjectivity clues, bspw. niederfrequente Wortformen,

aus einem großen Korpus

Stimmungsanalyse

Polarity Analysis

Polarity Analysis --- Typische Verfahren I

Frühe einflussreiche Studie: Pang et al. (2002)

Diese klassifiziert kurze Texte mittels typischer statistischer

Verfahren, u.a. Support Vector Machines, die auf einer hand-

annotierten Trainingsmenge angelernt wurden.

Regelbasierte Ansätze, bspw. Kennedy & Inkpen (2006) suchen in

Sätzen nach polaren Wörtern und beziehen Modifikatoren wie

Negationen, Abschwächungen und Verstärkungen in die

Analyse ein:

„Das ist keine schöne Vorstellung“

Stimmungsanalyse

Polarity Analysis

Polarity Analysis --- Typische Verfahren II

Nasukawa & Yi (2003):

[Polarity] analysis involves identification of

- sentiment expressions,

- polarity strength of the expressions, and

- their relationship to the subject.

Die ersten, die den Begriff der Sentiment Analysis in dieser

Form verwendeten, waren [Nasukawa & Yi 2003]

Stimmungsanalyse

Polarity Analysis

Polarity Analysis --- Benötigte Ressourcen

Regelbasierte Studien benötigen Wörterbücher, die positiv und

negativ konnotierte Termini aufführen.

Solche Ressourcen sind a priori nicht für alle Sprachen frei

verfügbar. Wie können wir sie erstellen?

- manuelle Auflistung

- Transfer bereits existierender fremdsprachlicher Ressourcen

- automatisches Lernen, bspw. durch Bootstrapping

Stimmungsanalyse

Lexikalische Ressourcen

‘Modern’ day dictionaries of affect:

• emotion as dimension and

• emotion as ‘finite category’

— good–bad axis: termed the dimension of valence,

evaluation or pleasantness

— active–passive axis (termed the dimension of arousal,

activation or intensity)

— strong–weak axis (termed the dimension of dominance

or submissiveness)

Stimmungsanalyse

Dictionaries of Affect

‘Modern’ day dictionaries of affect are used in

computing the frequency of sentiment words in a text

and the attempt usually is ensure that one picks up

sentences that pick up the ‘correct’/unambiguous

sense of the sentiment word

— General Inquirer [Stone et al. 1966];

— Dictionary of Affect [Whissell 1989];

— WordNet Affect [Strappavara and Valitutti 2004];

— SentiWordNet [Esuli and Sebastiani 2006].

Stimmungsanalyse

General Inquirer

The General Inquirer is a software system for analysing texts for

ascertaining the psychological attitude/orientation/behaviour of

the writer of a text as implicit in his or her writing.

The system has a large database of words and each word is

tagged primarily in terms of whether the word is generally used

positively or negatively.

But there are many fine gradations within the tags – ranging

from tags to describe active/passive orientation and whether

the word belongs to a specific subject category like economics,

or that the word is used usually by academics or found in legal

documents

Stimmungsanalyse

General Inquirer Categories

Name No. of Words

Meaning

Positiv 1,915 positive outlook.

Negativ 2,291 negative outlook

Pstv 1045 positive outlook

Affil 557 affiliation or supportiveness.

Ngtv 1160 Negative outlook

Hostile 833 an attitude or concern with hostility or aggressiveness

Strong 1902 implying strength

Power 689 Positive

Hostile 833 concern with hostility or aggressiveness

Weak 755 Negative

Submit 284 submission to authority or power, dependence on others, vulnerability to others, or withdrawal.

Stimmungsanalyse

SentiWS

SentiWS [Remus et. al. 2010]

SentiWS kurz für SentimentWortschatz ist ein

deutschsprachiges Wörterbuch

- führt 1650 positiv und 1818 negativ konnotierte Wörter auf

- gibt ihre Wortart und ihre Flexionsformen an

- gewichtet jeden Eintrag bzgl. seiner Ausdrucksstärke im

Intervall [-1, +1]

Seit Juni 2010 ist SentiWS frei verfügbar unter

http://wortschatz.informatik.uni-leipzig.de/download/

Stimmungsanalyse

SentiWS

SentiWS --- Quellen

SentiWS fußt auf einer Studie, die sich mit den Wechselwirkungen

zwischen Stimmungen in Zeitungstexten sowie Blogposts und den

Bewegungen im DAX 30 auseinandersetzt (Remus et al. 2009)

Die Kategorien Positiv und Negativ des englischsprachigen General

Inquirer wurden per Google Translate automatisch übersetzt,

domänenspezifische Begriffe à la Finanzkrise wurden hinzugefügt

Erweitert wird SentiWS durch signifikante Kookkurrenzen in

Kundenrezensionen.

Weiterhin erweitert wird SentiWS durch Überlappungen der bereits

identifizierten Begriffe mit semantischen Gruppen des Deutschen

Kollokationswörterbuchs (Quasthoff 2010)

Stimmungsanalyse

SentiWS

SentiWS --- Gewichtung I

Halbüberwachte Gewichtung der semantischen Orientierung

eines Wortes

mit Pwords einer Menge positiv konnotierter Wörter und

Nwords einer Menge negativ konnotierter Wörter

w wird mit einer positiven semantischen Orientierung markiert,

wenn SO-A(w) positiv ist und mit einer negativen semantischen

Orientierung, wenn SO-A(w) negativ ist. Der absolute Wert von

SO-A(w) zeigt die Stärke der semantischen Orientierung.

Pwordspword Nwordsnword

nwordwApwordwAwASO ),(),()(

Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text

Stimmungsanalyse

SentiWS

SentiWS --- Gewichtung II

A(w1, w2) wird durch Pointwise Mutual Information (PMI)

bestimmt:

Paradigmen für seed words nach [Turney Littmann, 2003]

(übersetzt):

Pwords = gut, schön, richtig, glücklich, erstklassig, positiv,

großartig, ausgezeichnet, lieb, exzellent, phantastisch

Nwords = schlecht, unschön, falsch, unglücklich, zweitklassig,

negativ, scheiße, minderwertig, böse, armselig, mies

)&(log),(

wwpwwPMI

Stimmungsanalyse

SentiWS

SentiWS --- Gewichtung III

Beispiele einiger Wörter inkl. der Gewichtung ihrer Ausdrucksstärke

bzw. Polarität (allgemeiner Wortschatz bzw. Automobilforen):

Wort Gewichtung Wort Gewichtung

Panne - 0,9010

Schaden - 0,5299

fehlerhaft - 0,3581

Vertrauen +0,3512

Zufriedenheit +0,2207

hervorragend +0,5891

Schulden - 0.8905

betrügen - 0.8368

freundlich +0.9273

Freude +1,0000

Stimmungsanalyse

SentiWS

SentiWS --- Anwendungen

• Analyse von Automobilforen und –blogs zusammen mit der Daimler

Forschung in Ulm (Folien von R.Remus)

• Einfache Polaritätsanalyse am Beispiel einer Auswertung von

Tiefeninterviews im Marketing (zusammen mit Uni HH)

Stimmungsanalyse

SentiWS

SentiWS --- Evaluation

Eine Evaluation der Gewichtungen gestaltet sich schwierig. Warum?

Beispiel:

Zufällige Auswahl von je 5 positiv und negativ konnotierten Wörtern

7 Probanden wurden dazu aufgefordert, zwei Rangfolgen zu

bestimmen, die die Worte ihrer Ausdrucksstärke nach ordnen.

Die gemessene Übereinstimmung (Cohens κ) zwischen den

Rangfolgen der Probanden ist 0,314 .

Cohens Kappa ist ein Maß für die Reliabilität von Annotationen

mehrerer Annotatoren -- drückt also die Abweichung von der zufällig

erwarteten Übereinstimmung zwischen ihnen aus.

Ein Wert von 0,314 wird in der Literatur (bspw. [Landis Koch, 1977])

als sehr geringe Übereinstimmung angesehen.

Stimmungsanalyse

SentiWS

SentiWS --- Evaluation

Folgerung:

Da es Menschen schwer fällt, einheitliche Rangfolgen

festzulegen, ist es schwer, einen Goldstandard zur Evaluation

von Sentiments zu definieren.

Eine unterschiedliche Gewichtung erscheint dennoch intuitiv

(vgl. Ausdrucksstärke „unklug“ vs. „bescheuert“) … …

Stimmungsanalyse

Praxisbeispiel: Analyse von Kundenerwartungen

Vgl. Torsten Teichert, Gerhard Heyer, Katja

Schöntag und Patrick Mairif: Co-Word Analysis

for assessing consumer associations: A case

study in market research. In: Affective

Computing and Sentiment Analysis, Springer

Science+Business Media B.V. , 2011

Stimmungsanalyse

Background

• Interviews for specific marketing tasks

• Manually rated and evaluated

– Concept features (metaphors)

– Emotional rating

– Clustering of concepts

• For each interview, features are manually counted

and fed into SPSS for factorial analysis

• Purpose of the present work: Test and evaluate the

efficiency of NLP for detecting and clustering

concept features

Stimmungsanalyse

• Goal of marketing: gain insight into consumers’

thoughts and feelings regarding specific brands and

products

• Widespread use of elicitation and analysis techniques

• As opposed to many text analysis applications, data

are not obtained from secondary (internet) sources

but from 30 personal in-depth interviews with female

consumers

• However, interviews yield a large amount of qualitative

data that is hard to handle and needs to be structured

in order to be analyzed

Sentiment analysis in marketing

Stimmungsanalyse

• Product categories are assumed to be emotionally

laden reaching far beyond mere functional aspects

• Data elicitation and processing techniques are based

on methods derived from human associative memory

models and network analysis

• Human Associative Memory (HAM) is a widely

accepted model with an increasing number of studies

based upon it

– information is stored in nodes which are linked (associated) with

each other forming a complex network of associations

– mental activity spreads from active concepts to all related concepts

Methods and assumptions

Stimmungsanalyse

• In the case of brands, the stimulating element can be a brand’s logo: the brand’s associative network is activated and becomes accessible and retrievable

• Activation then spreads to adjacent nodes

• This spread of activation produces a chain, or flow, of thoughts

• A representation of this flow of thoughts can be obtained from the flow of speech, for example when eliciting brand or product associations during an interview.

• Elicitation techniques help accessing subconscious memory of episodic, autobiographic, visual and sensory nature as well as a metaphoric description of thoughts, sentiments, and emotions

HAM in marketing

Stimmungsanalyse

• Researchers manually interpret the

interviewees’ statements

– Ambiguity of statements and expressions

– Subjective rating of the elicited data results

• Low replicability of the results

• By convention, inter-rater-reliabilities of 70

percent and above are acceptable

• Inter-rater-reliability is comparatively low for

emotional aspects as opposed to more

rational expressions

Problems of evaluating questionnaires

Stimmungsanalyse

• Text analysis tools offer a solution to this

problem

• Reduce the level of subjectivity (to a minimum)

– feature extraction

– categorization processes

• Raise replicability level

• Reach higher level of reliability

• The concept of Human Associative Memory

guides the data processing and evaluation

process

Goals and the role of NLP

Stimmungsanalyse

Goals and the role of NLP

Four main assumptions:

1. Words or concepts mentioned together are linked in the mind.

2. The more salient a concept is, the more often it is mentioned during the course of an interview.

3. The stronger the association between two concepts, the more often they are mentioned together.

4. Valence of a concept is indicated by positive or negative

Stimmungsanalyse

1. Extraction of features and consolidation of

extracted features into meaningful categories.

2. Processing of the data using a co-word-

analysis on a paragraph level as basis for the

development of associative networks.

3. Consideration of valence expressions for the

weighting of individual features.

Specific requirements

Stimmungsanalyse

Architecture

Definition of text sources

Extraction of features and definition of value

concepts within sentences

Base form per feature

(Lemmatisation)

Adding

synonymvectors to

features

Clustering of features

with similar

synonymvectors

Statistical processing of clusters

Frequency information per

feature

Clusters of features

Base forms

Stop words

Synonyms

concepts

Update Update

Stimmungsanalyse

Architecture

(Lemmatisation)

Adding

synonymvectors to

features

with similar

synonymvectors

feature

Base forms

Stop words

Synonyms

concepts

Update Update

Synonyms are computed

as similarity of global

context of word forms

Stimmungsanalyse

Architecture

(Lemmatisation)

Adding

synonymvectors to

features

with similar

synonymvectors

feature

Base forms

Stop words

Synonyms

concepts

Update Update

Clustering with graph

based methods

(Chinese whispers

algorithms)

Stimmungsanalyse

Example: World of shoes

Stimmungsanalyse

Example: World of shoes

Stimmungsanalyse

Example: Absolute counts

Stimmungsanalyse

Example: Synonyms (similar contexts)

Stimmungsanalyse

Example: Clusters of similar features

Stimmungsanalyse

Processing of clusters

• For each piece of text, the occurrence and

co-occurrence of clusters of similar features

is counted

• For each piece of text, a factorial analysis is

carried out

• The result is visualized using NetDraw

Stimmungsanalyse

Example: Final coneptual graph for shoes

Stimmungsanalyse

Example: Specific findings

• The product category of shoes activates a number of

highly emotional associations in the female

consumers’ minds.

• The purchasing process (marked in blue):

– “satisfy/please”, “wear/try on”, “spend time”, “discover”,

“examine”, “watch/perceive”, “satisfaction/gratification”, “enjoy”,

and “bliss.”

– Simply put: the process of selecting and buying shoes makes

female consumers happy and gives them a feeling of deep

satisfaction.

• Service quality and store ambience can therefore be

strong differentiating factors for a shoe or shoe

store brand.

Stimmungsanalyse

Conclusions

• A total of 1,938 different features could be

extracted from the transcribed interviews.

• Manual coding resulted in 133 and 112

categories for the two raters respectively.

• Inter-rater-reliability was 65.3 percent

– inter-rater-reliability was 60.6 percent for emotional

aspects

– while for rational aspects, it was 66.7 percent.

Stimmungsanalyse

Conclusions – The impact of NLP

• The automatic categorization resulted in 185 categories or clusters

• 100 of the 148 manually developed categories, i.e. 67.6 percent, were identical or similar to the automatically developed categories

• Results are comparable to manual coding

• But take only a fraction of time and effort

• The network representation of the main concepts offers a quick yet comprehensive overview of the complete pool of concepts

Stimmungsanalyse

Literatur

Church, K. W. & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 22--29. Kennedy, A. & Inkpen, D. (2006). Sentiment Classification of Movie Reviews Using Contextual Valence Shifters. Computational Intelligence, 22(2), 110--125. Nasukawa, T. & Yi, J. (2003). Sentiment Analysis: Capturing Favorability Using Natural Language Processing. In Proceedings of the 2nd International Conference on Knowledge Capture (pp. 70--77). Pang, B., Lee, L., Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 40th Annual Meeting of the ACL (pp. 79--86).

Quasthoff, U. (2010). Deutsches Kollokationswörterbuch. Berlin, New York: deGruyter.

Stimmungsanalyse

Literatur

Remus, R., Ahmad, K., Heyer, G. (2009). Sentiment in German-language News and Blogs, and the DAX. In Proceedings of the Conference on Text Mining Services (TMS), Ausgabe XIV of Leipziger Beiträge zur Informatik (pp. 149--158). Remus, R., Quasthoff, U., Heyer, G. (2010). SentiWS -- a German-language Resource for Sentiment Analysis. In Proceedings of LREC 2010.

Torsten Teichert, Gerhard Heyer, Katja Schöntag und Patrick Mairif: Co-Word Analysis for assessing consumer associations: A case study in market research. In: Affective Computing and Sentiment Analysis, Springer Science+Business Media B.V., 2011 Turney, P. (2002). Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In Proceedings of the 40th Annual Meeting of the ACL (pp. 417--424).

Stimmungsanalyse

Literatur

Turney, P. & Littman, M. (2003). Measuring Praise and Criticism:

Inference of Semantic Orientation from Association. ACM

Transactions on Information Systems (TOIS), 21(4), 315--346.

Wiebe, J. Riloff, E. (2005). Creating Subjective and Objective

Sentence Classifiers from Unannotated Texts. In Proceedings

of the Sixth International Conference on Intelligent Text

Processing and Computational Linguistics (CICLing), pp. 486--

Wiebe, J., Wilson, T., Bruce, R., Bell, M., Martin, M. (2004).

Learning Subjective Language. Computational Linguistics,

30(3), 277--308.

Text Mining - Wissensrohstoff...

Documents

Transcript of Text Mining - Wissensrohstoff...

Text-College „Ideenfindung“

Ein harter Text...

Platz für Text und Abbildungen Text immer Arial oder Arial fett Platz für Text und Abbildungen Text immer Arial oder Arial fett In diesem Bereich die Logos.

Text Jetzt

Vorlage1 · Web viewKapitel 5 –Überschrift 5 11 Anhang 12 Text 12 1 – Überschrift 1 Text Text Text Text 2 – Überschrift 2 Text Text 3 – Überschrift 3 Text Text Text 4

Vanessa Kogelbauer. font-weight: bold oder normal text-transform: uppercase (gesamter Text in Großbuchstaben), lowercase (gesamter Text in Kleinbuchstaben)

Text Mining Dr. Brigitte Mathiak. Was ist Text Mining? Die Kunst aus Text etwas maschinenverwertbares zu machen Methodisch an der Schnittstelle zwischen.

fbID fbID email email text text com datecom date · fbID fbID email email text text com_datecom_date 100001813575193 nagwa_nsm@hotmail.com 2) ﻢﻘر ﻪﺒاﺠاﻠا 2014-03-06

Text Mining - Wissensrohstoff Textasv.informatik.uni-leipzig.de/uploads/document/file_link/390/TMI07... · Clustering Zuordnung/Einteilung von Instanzen zu/in Klassen • Klassifikation:

Text Mining - Wissensrohstoff Textasv.informatik.uni-leipzig.de/uploads/document/file_link/350/TMI01... · Prof. Dr. G. Heyer Text Mining – Wissensrohstoff Text • Voraussetzung:

standard text standard text standard text standard text ...€¦ · Gymnasium Walldorf Schulportfolio Schuljahr 2015/16 1 Schulbeschreibung I.1. Lage der Schule Adresse : Schwetzinger

Modell text

Text | Advertorial

joern@TechFak.Uni-Bielefeld · XML-Praxis XPath 2/34. XML-Dokument als Baum title date root presentation status slide text text text item item item text toc text text title ilist

Ironie Text

Text Textsorte Korpus. 2 ? ? ? 3 Text ( nur kurze Darstellung aus der Textlinguistik) ???

Österreichische Akademie der Wissenschaften, Institut für Weltraumforschung, Graz, Austria, Text text text text Text text text text.

RVNL - Text

Metzger Text

Textmining – Wissensrohstoff Textasv.informatik.uni-leipzig.de/document/file_link/25/Textmining-4.pdf · U. Quasthoff Textmining – Wissensrohstoff Text 2 Wörter in Teile zerlegen