Bodong Chen

Crisscross Landscapes

Notes: Carley. (2012). Data-to-model: a mixed initiative approach for rapid ethnographic assessment



Citekey: @Carley2012-xn

Carley, K. M., Bigrigg, M. W., & Diallo, B. (2012). Data-to-model: a mixed initiative approach for rapid ethnographic assessment. Computational & Mathematical Organization Theory, 18(3), 300–327.






Rapid ethnographic assessment is used when there is a need to quickly create a socio-cultural profile of a group or region. While there are many forms such an assessment can take, we view it as providing insight into who are the key actors, what are the key issues, sentiments, resources, activities and locations, how have these changed in recent times, and what roles do the various actors play. We propose a mixed initiative rapid ethnographic approach that supports socio-cultural assess- ment through a network analysis lens. (p. 1)

We refer to this as the data-to-model (D2M) process. In D2M, semi-automated computer-based text-mining and machine learning techniques are used to extract networks linking people, groups, issues, sentiments, resources, activities and locations from vast quantities of texts. Human-in-the-loop procedures are then used to tune and correct the extracted data and refine the compu- tational extraction. Computational post-processing is then used to refine the extracted data and augment it with other information, such as the latitude and longitude of particular cities. (p. 1)

in many environments, there is a need to address these questions remotely; i.e., without actually going to the region of concern or meeting with the members of the group. (p. 2)

Rapid ethnographic assessment is an approach for rapidly characterizing the socio- cultural landscape (Beebe 1995). Rapid ethnographic assessment provides insight into the socio-cultural nature of a region by identifying the current key actors, issues, sentiments, resources, activities and locations, and any recent changes. (p. 2)

Traditional ethnography requires massive amounts of painstaking detailed daily investigation into a region or group and can take years. The result is a detailed rich understanding that can support theory development and model testing. (p. 2)

However the approach does not scale and the specific methods used are often context specific. In contrast, rapid ethnographic assessment though it derives from anthropological fieldwork fa- vors reproducibility, speed, and general understanding over exquisite detail. Rapid ethnographic assessment is increasingly widely used in a number of fields from hu- man computer interaction to medicine to counter-terrorism. (p. 2)

However, the set of tech- niques vary widely. Common themes are key informants, rapid assessment, and field data. (p. 2)

we propose, and demonstrate, a computer-assisted data-to-model approach for rapid ethnographic assessment. The proposed approach is a complement to, not a replacement for, traditional ethnography; and serves to fill a key gap—rapid re- mote assessment. (p. 2)

The heart of this approach is open-source text exploitation. The tremendous growth in the amount of digital information available, largely in unstructured or text form makes this approach possible. (p. 3)

Semi-automated approaches are efficient when dealing with vast quantities of unstructured texts which cannot be analyzed individually by Subject Matter Experts (SME) in a timely, unbiased and systematic fashion (Corman et al. 2002). (p. 3)

The overall text analysis approach is often referred to as Network Text Analy- sis (NTA) which encodes a meta-network based on the relationships of words in a text (Carley 1997; Popping 2000) and then comparing the network of texts (Batagelj et al. 2002). (p. 3)

We note that within the field of text-mining there exist a number of techniques each operating independently and researched separately (Manning et al. 2008). These include, but are not limited to entity extraction (Chakrabarti 2002), content analysis (Holsti 1969; Krippendorff 2004), topic modeling (Hofmann 1999; Blei et al. 2004), entity resolution and event discovery (Roth and Yih 2007), theme analysis (Landauer et al. 1998), role discovery (Wang et al. 2010), link extrac- tion (Ramakrishnan et al. 2006), multi-dimensional extraction (Lin et al. 2010; Zhang et al. 2009;Dingetal.2010) and meta-network extraction (Carley et al. 2011b; Diesner and Carley 2008). Our approach employs a number of these individual tech- niques, and other techniques not heretofore mentioned such as deduplication and anaphor resolution. (p. 3)

Overall, the data-to- model process employs a vast number of distinct methodologies from web-scrapers and sensor feeds to text-mining techniques to network analysis tools to agent-based simulations. (p. 3)

Central to the data-to-model approach is network analysis. Specifically, this ap- proach involves the extraction, analysis, and then simulation of the impact of various interventions of a network based model of the complex socio-cultural system being analyzed. The system is represented as a meta-network (Carley 2002, 2006) (p. 3)

meta-network: an ecology of networks linking the who, what, when, where, why and how are collectively incorporated into a multi-mode, multi-link, multi-level network representation system. (p. 3)

Human involvement often that of subject matter experts, is critical to the over- all process. From a development perspective, humans played the following roles: (a) developed specialized generalization thesauri, (b) vetted computer generated the- sauri, © vetted extract networks for people to people, ethnic-group/tribe to ethnic- group/tribe, people to ethnic-group/tribe, organization to organization, people to or- ganization, (d) vetted time line, and (e) confirmed results. (p. 4)

2Data (p. 4)

Sudan: Social change and situation assessment. 71,000 texts from news-sources, websites, books, and writings by scholars. Project-based thesauri contains 38,552 gazetteer locations and 16,001 entries. Afghanistan: primary and secondary actors and region of influence analysis: 247 large texts from news-sources and websites. Project-based thesauri contains 146,573 entries. Catnet: competitive adaptation analysis: 1,109 texts from news-sources and web- sites. Project-based thesauri contains 28,727 entries. (p. 4)

3 Data-to-model (D2M) process (p. 5)

The D2M process mirrors a manual approach that is taken by the analyst for devel- oping a model from unstructured text data: downloading and fixing data, cleaning and preparing text, finding important concepts, forming relationships between con- cepts, attempting initial analysis, further refinement of concepts and relationships with associated re-analysis, and then finally after the output is correct the addition of attributes to the concepts, and final analysis results. Using D2M the analyst converts raw texts to networks; specifically to meta-networks. (p. 5)

A semantic network is a set of interlinked concepts. (p. 5)

A meta-network is a geo-temporal set of interlinked networks connecting nodes in multiple ontological classes. The ontological classes we use are: agents, organi- zations, locations, knowledge, resources, beliefs, tasks and events. Using D2M, con- cepts are extracted, linked into a semantic network, cross-classified into their ontolog- ical class, and attributes of nodes are added. (p. 5)

The D2M process is designed to minimize the strain on the analyst of using net- work text analysis to generate this meta-network model and the difficulty in using a multi-tool cycle to clean the meta-network model. D2M uses various component tools and routines in AutoMap (Carley et al. 2011b) and ORA (Carley et al. 2011a)in predefined sequences. (p. 5)

3.1 Component technologies (p. 6)

At the core of the data-to-model process are two basic sub-tools, AutoMap and ORA. AutoMap and ORA were designed to work together, i.e., the output of AutoMap is directly readable by ORA. (p. 6)

AutoMap (Carley et al. 2011b) is used for text-mining and data extraction. Au- toMap is a toolbox of techniques and approaches for network text analysis. (p. 7)

ORA (Carley et al. 2011a) is used for network analytics and simulation. ORA is a toolbox of techniques for network analysis. (p. 7)

3.2 High level overview of D2M steps (p. 7)

  1. Corpus Collection 2. Corpus Cleaning 3. Meta-Network Extraction 4. Refinement (Thesauri Construction and Cleaning) 5. Final Meta-Network Extraction 6. Data Augmentation 7. Network Analytics and Forecasting (p. 7)

Script 1—Meta-network extraction component For data coding, we use machine learning techniques in support of advanced network text analysis. At an abstract and very high level, a network text analysis involves four-steps for converting unstruc- tured texts to meta-networks (Diesner and Carley 2005): (p. 8)

  1. Concept Identification: The analyst identifies what concepts are or are not of in- terest. 2. Entity Identification: The analyst defines an ontology; i.e., the set of entities into which the data will be coded. Typical entity classes are people, organizations, places, resources, tasks, events, and beliefs. Note that step 1 and 2 are temporally interchangeable. 3. Concept Classification: The analyst maps the identified concepts of interest into the entity classes using a meta-network or ontology thesaurus. 4. Meta-network extraction: The analyst using an automated tool, applies the afore- mentioned thesauri and delete lists to the texts. This tool then processes the texts and extracts concepts, cross-classifies them into their ontological category, and then extracts the relation among the identified concepts. The result is a structured dataset containing a meta-network. (p. 9)

On the surface this sounds straight forward. It turns out, however, that there are many coding choices that the analyst must make along the way that influence the results (Carley 1993). (p. 9)

3.3 Detailed description of D2M steps (p. 9)

3.3.1 Step 1: Corpus collection (p. 9)

The analyst begins with a corpus of raw texts. These texts can be drawn from any number of sources. For the three data sets we used, data was collected via web- scrapers and/or provided by external sources. Meta-information was removed. Du- plicate texts were removed. (p. 9)

including emails, news articles, web pages, blogs, twitter or RSS feeds (p. 9)

Depending on the source of the data, deduplication is necessary. (p. 10)

Both deduplication and the removal of extraction related information do impact the results. (p. 11)

3.3.2 Step 2: Corpus cleaning (p. 11)

Corpus cleaning is designed to take the body of the text that the analyst wants to process and augment or alter it to remove various typesetting and linguistic style vagaries. It includes generic preprocessing activities: e.g., typo correction and the ex- pansion of contractions and abbreviations. In addition, during this phase pronouns are resolved and unidentified pronouns removed. This is a completely automated process. (p. 11)

two phases: syntax-preserving clean- ing and syntax-destroying cleaning. (p. 11)

The syntax-preserving operations include: removal of extra spaces, expansion of contractions and abbreviations, typo correction, pronoun resolution, and British to (p. 11)

American spelling conversion. (p. 12)

The syntax-destroying operations of the preparation phase include: removing sin- gle letters and noise words, and the replacement of common phrases with its n-gram equivalent. (p. 12)

An additional step, context-sensitive stemming, is done at the end of the cleaning that uses information generated from the syntax-preserved text and uses it for the last step of the cleaning process. (p. 13)

Stemming, using a Porter stemmer (Porter 1980), is only performed on common nouns and common verbs. (p. 13)

All of these cleaning processes are automated. (p. 14)

3.3.3 Step 3: Meta-network extraction (p. 14)

Using advanced text-mining techniques the set of texts are processed to generate the meta-network, the semantic network and an initial project thesauri. Concepts are categorized into their appropriate meta-network ontological class, which includes agents, organizations, locations, events, knowledge, resources, beliefs and tasks (Car- ley 2002). (p. 14)

Script 1: Choices and process in creating the automated script To create the script that is now script 1 we went through various non-automated and semi-automated pro- cesses and identified best practices. (p. 14)

The first over-arching key issue is degree of generalization. Coding choices can result in the extracted concept set being over- or under-generalized relative to the an- alysts needs. (p. 15)

The second overarching key issue is the degree of connection. We found that link extraction was improved by using multiple link-extraction techniques and fusing the results. One technique is proximity based. (p. 15)

We then identified a number of preliminary steps to standardize the extraction process and to eliminate non-bearing concepts from across the texts (Carley 1993). (p. 16)

Concept identification is a very involved process that includes deleting uninter- esting terms and generalizing others concepts into the set of interest. (p. 16)

Entity Identification is a theoretically based step; i.e., it depends on the research questions of interests. The complexity of the data processing and analysis increases nonlinearly as the number of entities increases. Thus, there is a trade-off that one needs to consider when deciding the number of entity classes. (p. 17)

Concept classification: This step involves creation of another type of thesaurus, the meta-network or ontology. In general, high quality classification requires employing many classification techniques to create these thesauri. We employ: historical meta- network thesauri developed on other projects including location and world-leader lists, machine learning techniques, parts-of-speech tagging, and specialized meta- network thesauri by SMEs. (p. 17)

Meta-network extraction: This step is very computer-intensive and results in the extracted networks. We use AutoMap to process the set of texts, applying the thesauri, and generate out a meta-network. (p. 18)

Graph and analyze data: At this point, the text has been turned into a structured data set, specifically a meta-network, that can be graphed and analyzed. (p. 18)

We have one such meta-network per text. Since there are thousands of texts there are thousands of networks. The next decision the analyst needs to make is which results to combine. This decision is dependent on the research question. (p. 18)

if the issue is how different is the view from the newspaper and the scholars then you would create two fused meta-networks (p. 18)

A yearly compression enabled us to take a longitudinal perspective. (p. 19)

Our final step was to take all of these findings and solidify them into a workflow that can be used again and on different data sets. (p. 19)

To re-iterate the output of this step is a semantic network, a meta-network, and a project thesauri with both the generalization and the ontology classification compo- nents. These are then reviewed by the analyst in the next cleaning step. (p. 19)

3.3.4 Step 4: Refinement (thesauri construction and cleaning) (p. 19)

The refinement step uses both human-in-the-loop cleaning and then application of Script 2. (p. 19)

The refinement step involves: 1. Merging any concepts that need to be merged; e.g., when there are newly identified alias. 2. Deleting concepts that are not of interest to the study. 3. Checking and fixing the classification of concepts into entity classes. 4. Checking and fixing the classification of concepts as specific or generic. (p. 20)

Script 2 and refinement: choices and process (p. 21)

One of the first choices made had to do with locations. While locations of interest, such as cities and countries (p. 21)

A second choice is centered around how to treat n-grams as a single concept. (p. 21)

A third choice centered on the provision of n-grams to the analyst for review. (p. 21)

The D2M process will extract a meta-network (nee social network) as it appears in the text (Diesner and Carley 2005). Concepts that do not contribute to the final meta- network can be removed early in the text processing, and then will not appear in future codings. (p. 22)

3.3.5 Step 5: Final meta-network extraction (p. 23)

Once the analyst is satisfied with the refinement of the concepts the final meta- network is extracted. This is a completely automated process. All thesauri and coding choices selected earlier are applied. The resultant network can then be read into ORA for assessment or it can be augmented with additional data. (p. 23)

meta-network will include both the semantic network— as a network of all concepts to all concepts and a set of subnetworks based on the ontologically cross-classified concepts. This meta-network will include both forms of the cross-class networks. A cross-class network is a bipartite network; i.e., a net- work formed from two entity classes, such as agent x organization and organization x agent. (p. 23)

3.3.6 Step 6: Data augmentation (p. 24)

Data augmentation uses additional data sets and rules of inference to augment the extracted networks. Three types of additional data are typically used: geo-spatial, attribute files and meta-data. (p. 24)

Geo-Spatial Augmentation: Using post-processing geo-information, latitude and longitude, are added to facilitate analysis and visualization. For this purpose we used GeoNames (, which is an open source gazetteer site. Alter- natively NGIA GeoNET ( could be used. (p. 24)

Attribute files augmentation: Attributes are information about nodes that are of routine important (p. 24)

Meta-data: Meta-data is that information in the structured part of a semi-structured data. For example, in email, this is the header information. (p. 24)

The final approach to augmentation is inference. Using findings from anthropol- ogy and organization science, ORA has a series of procedures for inferring beliefs and group membership. (p. 24)

3.3.7 Step 7: Network analytics and forecasting (p. 25)

The resulting meta-network can then be processed using network analysis routines in ORA (Carley et al. 2011a). (p. 25)

4 Summary (p. 25)

Carley KM (1997) Network text analysis: the network position of concepts. In: Roberts CW (ed) Text analysis for the social sciences. Lawrence Erlbaum, Mahwah (p. 27)

Carley KM, Reminga J, Storrick J, Columbus D (2011a) ORA user’s guide 2011. Carnegie Mellon Uni- versity, School of Computer Science, Institute for Software Research, Technical Report, CMU-ISR- 11-107 Carley KM, Columbus D, Bigrigg M, Kunkel F (2011b) AutoMap user’s guide 2011. Carnegie Mellon University, School of Computer Science, Institute for Software Research, Technical Report, CMU- ISR-11-108 (p. 27)

Diesner J, Carley KM (2005) Revealing social structure from texts: meta-matrix text analysis as a novel method for network text analysis In: Causal mapping for information systems and technology re- search: approaches, advances, and illustrations. Idea Group Publishing, Harrisburg (p. 27)