Genre Program Name Developer Citation Description from website Languages URL
Semantic, Syntactic Coh-Metrix Graesser et al. McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge, M.A.: Cambridge University Press. Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts. Coh-Metrix allows readers, writers, educators, and researchers to instantly gauge the difficulty of written text for the target audience. English, Chinese http://cohmetrix.com/
Sentiment LIWC Pennebaker et al. Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), 2001. LIWC2015 is the gold standard in computerized text analysis. Learn how the words we use in everyday language reveal our thoughts, feelings, personality, and motivations. Based on years of scientific research, LIWC2015 is more accurate, easier to use, and provides a broader range of social and psychological insights compared to earlier LIWC versions. Check it out. English, Chinese, Arabic, Spanish, Dutch, French, German, Italian,Russian and Turkish https://liwc.wpengine.com/
Personality IBM Watson IBM High, R. (2012). The era of cognitive systems: An inside look at IBM Watson and how it works. IBM Corporation, Redbooks. Go beyond artificial intelligence with Watson. With Watson APIs and solutions, businesses are already achieving outcomes – from improving customer engagement, to scaling expertise, to driving innovation and growth. Arabic, Catalan, Chinese, Danish, Dutch, English, Farsi, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish https://www.ibm.com/watson/developercloud/personality-insights.html
Visual Gephi Bastian et al. Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. ICWSM, 8, 361-362. Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free. Runs on Windows, Mac OS X and Linux. N/A https://gephi.org/
Semantic, Syntactic Semilar Vasile Rus Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D. (2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, Sofia, Bulgaria. The SEMILAR software environment offers users, researchers, and developers, easy access to fully-implemented semantic similarity methods in one place through both a GUI-based interface and a library. English http://www.semanticsimilarity.org/
Sentiment SentiStrength Thelwall et al. Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558. SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths: -1 (not negative) to -5 (extremely negative) 1 (not positive) to 5 (extremely positive). English, Finnish, German, Dutch, Spanish, Russian, Portuguese, French, Arabic, Polish, Persian, Swedish, Greek, Welsh, Italian, Turkish http://sentistrength.wlv.ac.uk/
Personality Profiler Plus Michael Young Young, M. D. (2001). Building worldview (s) with Profiler+. Progress in communication sciences, 17-32. Social Science Automation works with clients in government, business, and academia who benefit from the speed and accuracy of automated text analysis. By using the Profiler Plus Text Coding Platform and applying various coding schemes, they are able to answer real world questions for their business or industry. Profiler Plus can help you: Track sentiment in traditional/social media for brands, companies, people, or nations. Make more fully informed decisions about new athletic talent for your team. Use language analysis to evaluate threats. Mitigate risks by tracking and evaluating event data. Understand different cultures and make multi-lingual comparisons about them. Profile leaders of countries, corporations or any other individuals in positions of power. English, Arabic, Spanish, Russian, Chinese https://profilerplus.org/
LDA Stanford Daniel Ramage & Evan Rosen Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics. The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: Import and manipulate text from cells in Excel and other spreadsheets. Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text. Select parameters (such as the number of topics) via a data-driven process. Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data. All https://nlp.stanford.edu/software/tmt/tmt-0.4/
LDA R Packages Johnathan Chang Chang, J., & Chang, M. J. (2010). Package ‘lda’. Implements latent Dirichlet allocation (LDA) and related models. This includes (but is not limited to) sLDA, corrLDA, and the mixed-membership stochastic blockmodel. Inference for all of these models is implemented via a fast collapsed Gibbs sampler written in C. Utility functions for reading/writing data typically used in topic models, as well as tools for examining posterior distributions are also included. All https://cran.r-project.org/web/packages/lda/lda.pdf
LDA Mallet Andrew McCallum McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” http://mallet.cs.umass.edu. 2002. Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example Probabilistic Topic Models by Steyvers and Griffiths (2007). All http://mallet.cs.umass.edu/topics.php
LDA Sprout Zhiqiang Cai Cai, Z., Graesser, A.C., Windsor, L., & Hu, X. (2017) Sprout: A semantic analysis tool for small corpora. Paper presented at the Forty-seventh annual meeting of the society for computer in psychology, Vancouver.; Cai, Z., Pennebaker, J. W., Eagan, B., Shaffer, D. W., Dowell, N. M., & Graesser, A. C. (2017). Epistemic network analysis and topic modeling for chat data from a collaborative learning environment. In X. Hu, T. Barnes, A. Hershkovitz, L. Paquette (Eds), Proceedings of the 10th International Conference on Educational Data Mining (pp.104-111). Wuhan, China: EDM Society. Topic modeling seed method English, Chinese N/A
Epistemic Network Analysis ENA David Shaffer Shaffer, D.W., Collier, W., & Ruis, A.R. (2016). A tutorial on epistemic network analysis: Analyzing the structure of connections in cognitive, social, and interaction data. Journal of Learning Analytics, 3(3), 9–45. English http://www.epistemicnetwork.org/
Text-as-Data tools Plot Mapper Beauchamp N/A This maps the key words and the progression of the text (its “plot”) in a two-dimensional space. N/A http://www.nickbeauchamp.com/projects/plotmapper.php
Text-as-Data tools AutoScale Beauchamp N/A This automatically scales a set of documents and their words according to the major dimensions of variation, some of which are often ideological. N/A http://www.nickbeauchamp.com/projects/pcascale.php
Text-as-Data tools QuickTopics Beauchamp N/A This characterizes a set of documents by finding a latent set of topics, along with the words associated with those topics. N/A http://www.nickbeauchamp.com/projects/topicmodel.php
Text-as-Data tools Gscale Beauchamp N/A This uses google search results to scale anything. You choose the search terms to scale (Senator names, eg) and two reference terms to scale them between (democrat, republican, eg), and gscale uses the google text results from latter two to scale the former. N/A http://www.nickbeauchamp.com/projects/gscale/gscale.php
Text-as-Data tools Summarizer Beauchamp N/A This “summarizes” a text by finding the N most representative sentences. N/A http://www.nickbeauchamp.com/projects/summarizer.php
Text-as-Data tools Motifator Beauchamp N/A This finds small clusters of related words (motifs) in the text. N/A http://www.nickbeauchamp.com/projects/motifator2.php
Text-as-Data tools Text Predictor Beauchamp N/A The generates more text in the style of the original text — although the syntax is a bit rough. N/A http://www.nickbeauchamp.com/projects/textpredictor3.php
Text-as-Data tools Choose Your Own Beauchamp N/A This is like Text Predictor, except you can choose which of the probably next words you want. N/A http://www.nickbeauchamp.com/projects/chooseyourown.php
Text-as-Data tools Text Compare Beauchamp N/A This allows you compare a set of different texts, ranking their overall similarities in various ways. N/A http://www.nickbeauchamp.com/projects/textcompare.php
Text-as-Data tools Cliché Score Beauchamp N/A This counts “cliches,” loosely speaking. It tracks the appearance of common 4-word sequences in your text. English http://www.nickbeauchamp.com/projects/clichescore.php
Text-as-Data tools Word Counter Beauchamp N/A Counts words. N/A http://www.nickbeauchamp.com/projects/wordcounter.php
Text-as-Data tools Democratic Writing Beauchamp N/A This allows a group of participants to collectively write a text using a nomination and voting system that ensures strictly democratic equality. (Due to a change in PHP since this was written in 2002, existing projects may be viewed, but new ones cannot be created or modified.) N/A http://www.nickbeauchamp.com/DW_index.php
Text-as-Data tools Events Will Lowe Lowe W. (2012) ‘events: Store and manipulate event data’. R package version 0.5, URL http://cran.r-project.org/web/packages/events/ Events is an R package to make life a bit easier for people who analyse event data (that’s the kind of thing that KEDS/TABARI generates as output e.g. here). There’s no fancy stuff, just a logical interface to all the data massaging we do to event data before any actual analysis. The package also bundles the CAMEO and WEIS event code schemes, plus their Goldstein numerical scalings, and some event data from the Bosnian conflict. N/A http://conjugateprior.org/software/events/
Text-as-Data tools Austin Will Lowe Will Lowe. 2015. Austin: Do things with words. Version 0.2.2 URL http://github.org/conjugateprior/austin Austin is an R package for doing things with words. Right now that means scaling them in the style of Wordscores and Wordfish. N/A http://conjugateprior.org/software/austin/
Text-as-Data tools Yoshikoder Will Lowe Lowe W. (2015) ‘Yoshikoder: Cross-platform multilingual content analysis’. Java software version 0.6.5, URL http://www.yoshikoder.org The Yoshikoder is a cross-platform multilingual content analysis program developed as part of the Identity Project at Harvard‘s Weatherhead Center for International Affairs. You can load documents, construct and apply content analysis dictionaries, examine keywords-in-context, and perform basic content analyses, in any language. N/A http://conjugateprior.org/software/yoshikoder/
Text-as-Data tools Jfreq Will Lowe Lowe W. (2011) ‘JFreq: Count words, quickly’. Java software version 0.5.4, URL http://www.conjugateprior.org/software/jfreq/ JFreq takes plain text documents and turns them into a word frequency matrix. It tries hard to be a) quick, and b) not take up much memory. It could be better at both, but it’s quite usable. The graphical version looks like this on a Mac with the online help open. N/A http://conjugateprior.org/software/jfreq/
Text-as-Data tools YKConverter Will Lowe Lowe W. (2010) ‘YKConverter: Turn documents into texts’. Java software version 0.5, URL http://www.conjugateprior.org/software/ykconverter/ The YKConverter is a utility that tries to extract the text from documents in various formats (HTML, Word, PDF, Powerpoint, Excel) and save it as UTF-8 encoded text. You might do this to prepare for a subsequent content analysis. N/A http://conjugateprior.org/software/ykconverter/
Text-as-Data tools Re-encoder Will Lowe Lowe W. (2012) ‘Re-encoder: Switch encodings’. Java software version 0.2, URL http://conjugateprior.org/software/reencoder/ Re-encoder take a folder full of files, assumes that they are encoded plain text, and saves each one into a different encoding. You specify the ‘from’ and ‘to’ encodings. If you’re not sure, there’s a preview function where you can experiment with different encodings until your file looks right. N/A http://conjugateprior.org/software/reencoder/
Text-as-Data resource Content Analysis in Python Will Lowe N/A This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it’ll hopefully just whet your appetite. The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business… N/A http://conjugateprior.org/software/ca-in-python/
Text-as-Data tools VBPro Mark Miller Miller, M. (1997). VBPro [computer software]. VBPro is Mark Miller’s bundle of content analysis programs. Although development of VBPro has ceased, the most important algorithmic details of each program are in the process of being open sourced. This should help researchers replicating and extending analyses that use the software. In the meantime, the original VBPro distribution will be available from here. Thanks to Mark Miller for making this possible. N/A http://www.mariapinto.es/ciberabstracts/Articulos/VBPro.htm
Text-as-Data tools Nvivo Ashley Castleberry Castleberry, A. (2014). NVivo 10 [software program]. Version 10. QSR International; 2012. If you want to get an edge by better understanding the explosion of unstructured data in the world today, you need NVivo – powerful software for qualitative data analysis. Whether you are working individually or in a team, on Windows or Mac, are new to research or have years of experience, there’s an NVivo option to suit you. Can you afford to miss the insights your data is trying to show you? English, French, German, Japanese, Portuguese, Simplified Chinese, Simplified Spanish http://www.qsrinternational.com/product
Sentiment Tone Analysis IBM N/A The IBM Watson™ Tone Analyzer service uses linguistic analysis to detect three types of tones from written text: emotions, social tendencies, and writing style. Users can use the Tone Analyzer service to analyze conversations and communications. Use the output to respond to customers appropriately and craft the perfect message. English https://www.ibm.com/watson/developercloud/tone-analyzer.html
NLP Natural Language Classifier IBM N/A The Natural Language Classifier service understands the intent behind text and returns a corresponding classification, complete with a confidence score. For example “What is the weather like today? or “Is it hot out?” or “Is it going to be nice today?” are all ways of asking about “temperature”. Use NLC to answer questions in a contact center, create chatbots, categorize volumes of written content and more. English, Arabic, French, German, Japanese, Italian, Portuguese, and Spanish https://www.ibm.com/watson/developercloud/nl-classifier.html
NLP WordNet The Global WordNet Association Fellbaum, C. (1998). WordNet. Blackwell Publishing Ltd. A free, public and non-commercial organization that provides a platform for discussing, sharing and connecting wordnets for all languages in the world. Afrikaans, Albanian, Arabic, Hindi, Indonesian, Japanese, Lao, Mongolian, Burmese, Nepali, Sinhala, Thaie, Vietnamese, Malaysian, Bantu languages, English, Spanish, Catalan, Basque, Italian, Bengali, Bulgarian, Czech, Greek, Romanian, Serbian, Turkish, Chinese (Traditional), Chinese (Simplified), Croation, Danish, Dutch, Estonian, Finnish, French, German, Hebrew, Hungarian, Icelandic, Latin, Irish, Assamese, Bodo, Gujarati, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Sanskrit, Tamil, Telugu, Punjabi, Urdu, Oriya, Korean, Kurdish, Latvian, Macedonian, Maltese, Moldavian, Norwegian, Persian, Polish, Russian, Serbian, Slovenian, Swedish, Tamil, Turkish http://globalwordnet.org/wordnets-in-the-world/
Visual Trelliscope Ryan Hafan Hafen, R. Interface, design, and computational considerations for D&R. Trelliscope is a powerful visualization tool for large data that allows users to rapidly render customized graphics that are both detailed (i.e., drilled-down) and interpretable across large datasets — and then filter and sample the results to focus on examples that share common traits or characteristics (e.g., display only plots with correlation above 95% or display plots where regression model fit was poor). Trelliscope is also very useful for visualizing small datasets as well. It is unique in that it is highly customizable and is based on trellis displays which guide users to focus on relatively small, rational subsets of data. This approach helps data scientists develop machine learning algorithms and predictive analytics that account for the scientific phenomenology displayed in the graphics — an essential requirement in many environments. Language Agnostic https://signatures.pnnl.gov/software/trelliscope.stm
Semantic, Syntactic Snowball M.F. Porter Porter, M. F. (2001). Snowball: A language for stemming algorithms. Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This site describes Snowball, and presents several useful stemmers which have been implemented using it. The Snowball compiler translates a Snowball script into another language – currently ISO C, Java and Python are supported. English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish http://snowball.tartarus.org/texts/introduction.html
Visual Motion Chart Battista & Cheng Battista, V., & Cheng, E. (2011). Motion charts: Telling stories with statistics. In American Statistical Association Joint Statistical Meetings (Vol. 4473). A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash. Language Agnostic https://developers.google.com/chart/interactive/docs/gallery/motionchart?csw=1
Translation Google Translate Google N/A Not official on website, but it translates. Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Corsican, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galacian, Georgian, German, Greek, Gujarati, Hatian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukranian, Urdu, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu https://translate.google.com/
Personality AnalyzeWords Roger Booth & Jamie Pennebaker Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), 2001. AnalyzeWords helps reveal your personality by looking at how you use words. It is based on good scientific research connecting word use to who people are. So go to town – enter your Twitter name or the handles of friends, lovers, or Hollywood celebrities to learn about their emotions, social styles, and the ways they think. English, Chinese, Arabic, Spanish, Dutch, French, German, Italian,Russian and Turkish http://www.analyzewords.com/
NLP nltk Steven Bird Bird, S. (2006, July). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. 50+ languages (Project Gutenberg sample) http://www.nltk.org/
NLP Stanford Core NLP Christopher Manning Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. English, Arabic, Chinese, French, German, Spanish http://stanfordnlp.github.io/CoreNLP/
NLP spaCy Matthew Honnibal Choi, J. D., Tetreault, J., & Stent, A. (2015). It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 387-396). spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython. Independent research has confirmed that spaCy is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using. spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. I like to think of spaCy as the Ruby on Rails of Natural Language Processing. spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, Keras, Scikit-Learn, Gensim and the rest of Python’s awesome AI ecosystem. spaCy helps you connect the statistical models trained by these libraries to the rest of your application. English, German, Chinese, Spanish, Italian, French, Portuguese, Dutch, Swedish, Finnish, Hungarian, Bengali, Hebrew https://spacy.io/
Event Data ICEWS Phil Schrodt Boschee, E., Lautenschlager, J., O’Brien, S., Shellman, S., Starz, J., & Ward, M. (2015). ICEWS coded event data. Harvard Dataverse, 5. As one of the Assistant Secretary of Defense for Research and Engineering (ASD(R&E))’s Human Social, Culture and Behavior (HSCB) flagship programs, Integrated Crisis Early Warning System (ICEWS) has developed and is deploying a comprehensive, integrated, automated, generalizable, and validated system to monitor, assess, and forecast national and international crises in a way that supports decision-making on how to mitigate them. ICEWS provides Combatant Commanders (COCOMs), the IC, and various government agencies with a powerful, systematic capability to anticipate, track, and respond to stability challenges around the world. English http://www.lockheedmartin.com/us/products/W-ICEWS/W-ICEWS_Publications.html
Event Data GDELT Kalev Leetaru Leetaru, K., & Schrodt, P. (1979). GDELT: Global Data of Events. Language, and Tone, 2012, 2013. Supported by Google Jigsaw, the GDELT Project monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world. N/A http://www.gdeltproject.org/
Event Data CAMEO Phil Schrodt Gerner, D. J., Schrodt, P. A., Yilmaz, O., & Abu-Jabr, R. (2002). Conflict and Mediation Event Observations (CAMEO): A new event data framework for the analysis of foreign policy interactions. International Studies Association, New Orleans. The Computational Event Data System is the current name for a series of projects beginning around 1998 that have focused on the machine coding of international event data using pattern recognition and simple grammatical parsing. These systems are designed to work with short news articles such as those found in wire service reports or chronologies. To date, the software has primarily been used to code events from Reuters and Agence France Presse wire service lead sentences but in principle it can be used for other event coding schemes. N/A http://eventdata.parusanalytics.com/data.dir/cameo.html
Event Data WEIS Phil Schrodt Schrodt, P. A., & Leibsohn, D. (1985). An Algorithm for the Classification of WEIS Event Code from WEIS Textual Descriptions. International Studies Association, Washington. Most of the codes that are used in the data sets produced by TABARI are the standard WEIS codes originally developed by Charles McClelland (see “World Event/Interaction Survey (WEIS) Project, 1966-1978”, ICPSR Study No. 5211) However, at various points we have experimented with introducing new codes into WEIS, borrowing most of these from the PANDA project. We assigned weights to the new codes that are comparable to the weights used in the Goldstein scale, and those weights are used in the aggregated data. The full list of these codes can be found at the following two links. N/A http://eventdata.parusanalytics.com/data.dir/weis.html
Event Data KEDS Phil Schrodt Schrodt, P. A., Davis, S. G., & Weddle, J. L. (1994). Political science: KEDS—a program for the machine coding of event data. Social Science Computer Review, 12(4), 561-587. While not as current as TABARI, the KEDS (Kansas Event Data System) program was our first major foray into the automation of events data. The level of documentation and development of the program surpasses that of TABARI, even though the latter is, in most respects, a more capable program. The KEDS program runs natively on Mac OS 6.0 or later and uses a limited amount of disk space. Additionally, KEDS supports a nifty GUI interface that is to date absent in TABARI. N/A http://eventdata.parusanalytics.com/software.dir/keds.html
Event Data Open Event Data Alliance Phil Schrodt, John Beieler, Patrick Brandt, Erin Simpson, & Andy Halterman N/A The prime objective of the OEDA is to provide reliable, open access, multi-sourced political event datasets that are updated at least weekly, are transparent and have documented source texts, and use one or more of the open coding ontologies supported by the organization. As an organization, OEDA will aggregate, rather than generate, such data—in particular we expect to be linking to multiple data sets. These datasets will share a common format or be supported by open software that will translate them into a common format. Data generated with open source coding engines and dictionaries are preferred, but the organization is open to proprietary coding methods, provided the resulting data are open access, documented, and clear of intellectual property issues. The OEDA does not seek to establish any definitive imprimatur but rather to provide guidance for voluntary solutions to coordination problems on issues and resources of common concerns. N/A http://openeventdata.org/#about
Event Data Petrarch C. Norris Norris, C. (2016). Petrarch 2: Petrarcher. arXiv preprint arXiv:1602.07236. Code for the new Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH) event data coder. The coder now has all of the functions from the older TABARI coder and the new CAMEO.verbpatterns.140609.txt dictionary incorporates both parser-based matching and extensive synonym sets. The program coded 60,000 AFP sentences from the GigaWord corpus without crashing, using the included dictionaries. N/A http://openeventdata.org/
Event Data Phoenix Phil Schrodt Schrodt, P. A. (2014). Phoenix Event Data Set Documentation. The Phoenix dataset is a new, near real-time event dataset created using the next-generation event data coding software, PETRARCH. The data is generated using news content scraped from over 400 sources. This scraped content is run through a processing pipeline that produces coded event data as a final output. Our current settings produce roughly 3,000 coded events per day. These coded events are in the standard who-did-what-to-whom format typically associated with event data. Each event is coded along on multiple dimensions, specifically source and target actors and event type. These dimensions are described in greater detail below. N/A http://phoenixdata.org/
Event Data Plover Phil Schrodt N/A PLOVER–Political Language Ontology for Verifiable Event Records–is a next generation political event coding specification under development by the Open Event Data Alliance (http://openeventdata.org/) which is intended to replace the earlier [CAMEO] (http://eventdata.parusanalytics.com/data.dir/cameo.html) system. N/A https://github.com/openeventdata/PLOVER
NLP Trint Adam Trent, Peter J. Kofman, & Puneet Kukkal Trent, A., Kaufman, P. J., & Kukkal, P. (1999). U.S. Patent No. 5,961,620. Washington, DC: U.S. Patent and Trademark Office. North American English, British English, Australian English, French, Spanish, German, Italian, Portuguese, Russian, Polish, Finnish, Hungarian, Dutch, Romanian, Swedish https://trint.com/
Mixed Method Data Analyis QDA Miner Provalis Research Lewis, R. B., & Maas, S. M. (2007). QDA Miner 2.0: Mixed-model qualitative data analysis software. Field methods, 19(1), 87-108. QDA Miner is an easy-to-use qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner qualitative data analysis tool may be used to analyze interview or focus group transcripts, legal documents, journal articles, speeches, even entire books, as well as drawings, photographs, paintings, and other types of visual documents. Its seamless integration with SimStat, a statistical data analysis tool, and WordStat, a quantitative content analysis and text mining module, gives you unprecedented flexibility for analyzing text and relating its content to structured information including numerical and categorical data. English, French, Spanish https://provalisresearch.com/products/qualitative-data-analysis-software/
Text as Data Tools WordStat Provalis Research Péladeau, N. (2003). WordStat: Content analysis module for SIMSTAT. Montréal: Provalis Research. WordStat is a flexible and easy-to-use texWordStat is a flexible and easy-to-use text analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – QDA Miner – our qualitative data analysis software – and Stata – the comprensive statistical software from StataCorp, gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.t analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – QDA Miner – our qualitative data analysis software – and Stata – the comprensive statistical software from StataCorp, gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data. English, French, Italian, German, and Spanish https://provalisresearch.com/products/content-analysis-software/
Statistical Analysis SimStat Provalis Research Péladeau, N. (1996). SIMSTAT for Windows. Provalis Research, Montreal. Simstat goes beyond mere statistical analysis. It offers output management features not found in any other program, as well as its own scripting language to automate statistical analysis and to write small applications, interactive tutorials with multimedia capabilities, as well as computer assisted interviewing systems. English https://provalisresearch.com/products/simstat/
Mixed Method Data Analyis ProSuite Provalis Research Provalis Research (2014). ProSuite, Montréal. http://provalisresearch.com/products/prosuite/ William S. Hein & Company World constitutions. HeinOnline. ProSuite is an integrated collection of Provalis Research text analytics tools that allow one to explore, analyze and relate both structured and unstructured data. Provalis Research Text Analytics Tools allows one to perform advanced computer assisted qualitative coding on documents and images using QDA Miner, to apply the powerful content analysis and text mining features of WordStat on textual data, and to perform advanced statistical analysis on numerical and categorical data using SimStat. https://provalisresearch.com/products/prosuite-text-analytics-tools/
Text as Data Tools Seance Kristopher Kyle Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social order analysis. Behavior Research Methods 49(3), pp. 803-821. doi:10.3758/s13428-016-0743-z. SEANCE is an easy to use tool that includes 254 core indices and 20 component indices based on recent advances in sentiment analysis. In addition to the core indices, SEANCE allows for a number of customized indices including filtering for particular parts of speech and controlling for instances of negation. SEANCE takes plain text files as input (it will process all plain text files in a particular folder) and produces a comma separated values (.csv) spreadsheet that is easily read by any spreadsheet software. English http://www.kristopherkyle.com/seance.html
Text as Data Tools CLA Kristopher Kyle Kyle, K., Crossley, S.A., & Kim, Y. J. (2015). Native Language Identification and Writing Proficiency. International Journal of Learner Corpus Research 1(2), pp. 187-209. doi: 10.1075/ijlcr.1.2.01kyl. CLA is a simple but powerful text analysis tool. One can use CLA to analyze texts using very large custom dictionaries. In addition to words, custom dictionaries can include n-grams and wildcards. English http://www.kristopherkyle.com/cla.html
Text as Data Tools CRAT Kristopher Kyle Crossley, S. A, Kyle, K., Davenport, J., & McNamara, D. S. (2016). Automatic assessment of constructed response data in a chemistry tutor. In T. Barnes, M. Chi, & M. Feng (Eds.), Proceedings of the 9th International Educational Data Mining (EDM) Society Conference (pp. 336-340). CRAT is an easy to use tool that includes over 700 indices related to lexical sophistication, cohesion and source text/summary text overlap. CRAT is particularly well suited for the exploration of writing quality as it relates to summary writing. English http://www.kristopherkyle.com/crat.html
Text as Data Tools SiNLP Kristopher Kyle Crossley, S. A., Allen, L. K., Kyle, K., & McNamara, D.S. (2014). Analyzing discourse processing using a simple natural language processing tool (SiNLP). Discourse Processes, 51(5-6), pp. 511-534, DOI: 10.1080/0163853X.2014.910723 SiNLP is a simple tool that allows users to analyze texts with regard to the number of words, number of types, TTR, letters per word, number of paragraphs, number of sentences, and number of words per sentence for each text. In addition, users can analyze texts with regard to their own custom dictionaries. English http://www.kristopherkyle.com/sinlp.html
Text as Data Tools TAACO Kristopher Kyle Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods 48(4), pp. 1227-1237. doi:10.3758/s13428-015-0651-7 TAACO is an easy to use tool that calculates 150 indices of both local and global cohesion, including a number of type-token ratio indices, adjacent overlap indices, and connectives indices. English http://www.kristopherkyle.com/taaco.html
Text as Data Tools TAALES Kristopher Kyle Kyle, K. & Crossley, S. A. (2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly 49(4), pp. 757-786. doi: 10.1002/tesq.194; Kyle, K., Crossley, S. A., & Berger, C. (in press). The tool for the analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods. TAALES TAALES is a tool that measures over 400 classic and new indices of lexical sophistication, and includes indices related to a wide range of sub-constructs. Included are indices for both single words and n-grams. Starting with version 2.2, TAALES also provides comprehensive index diagnostics. English http://www.kristopherkyle.com/taales.html
Text as Data Tools TAASSC Kristopher Kyle Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication (Doctoral Dissertation). Retrieved from http://scholarworks.gsu.edu/alesl_diss/35.; Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474-496. TAASSC is an advanced syntactic analysis tool that measures fine-grained indices of clausal and phrasal complexity, classic indices of syntactic complexity, and frequency-based verb argument construction indices. English http://www.kristopherkyle.com/taassc.html
Audio OpenSmile Florian Eyben, Martin Wöllmer, & Björn Schuller Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462). ACM. We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/. https://github.com/naxingyu/opensmile
Text-as-Data resource Wordfish Sven-Oliver Proksch & Jonathan B. lapin Proksch, S. O., & Slapin, J. B. (2008). WORDFISH: Scaling software for estimating political positions from texts. Version, 1, 323-344. Wordfish is a computer program written in the R statistical language to extract political positions from text documents. Word frequencies are used to place documents onto a single dimension. Wordfish is a scaling technique and does not need any anchoring documents to perform the analysis. Instead, it relies on a statistical model of word counts. The current implementation assumes a Poisson distribution of word frequencies. Positions are estimated using an expectation-maximization algorithm. Confidence intervals for estimated positions can be generated from a parametric bootstrap.The name Wordfish pays tribute to the French meaning of the word “poisson”. N/A http://www.wordfish.org/
Wordshoal Lauderdale, B. E., & Herzog, A. (2016). Measuring political positions from legislative speech. Political Analysis, 24(3), 374-394. Benjamin E. Lauderdale & Alexander Herzog Lauderdale, B. E., & Herzog, A. (2016). Measuring political positions from legislative speech. Political Analysis, 24(3), 374-394. Existing approaches to measuring political disagreement from text data perform poorly except when applied to narrowly selected texts discussing the same issues and written in the same style. We demonstrate the first viable approach for estimating legislator-specific scores from the entire speech corpus of a legislature, while also producing extensive information about the evolution of speech polarization and politically loaded language. In the Irish Da’ il, we show that the dominant dimension of speech variation is government-opposition, with ministers more extreme on this dimension than backbenchers, and a second dimension distinguishing between the establishment and anti-establishment opposition parties. In the U.S. Senate, we estimate a dimension that has moderate within-party correlations with scales based on roll-call votes and campaign donation patterns; however, we observe greater overlap across parties in speech positions than roll-call positions and partisan polarization in speeches varies more clearly in response to major political events. N/A https://github.com/kbenoit/wordshoal
NLP Apache OpenNLP Apache Software Foundation Apache Software Foundation OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution. Afrikaans, Arabic, Asturian, Azerbaijani, Bashkir, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Catalan, Cebuano, Czech, Chechen, Mandarin Chinese, Welsh, Danish, German, Standard Estonian, Greek(Modern), English, Esperanto, Estonian, Basque, Faroese, Persian, Finnish, French, Western Frisian, Irish, Galician, Swiss German, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian, Javanese, Japanese, Kannada, Georgian, Kazakh, Kirghiz, Korean, Latin, Latvian, Limburgan, Lithuanian, Luxembourgish, Standard Latvian, Malayalam, Marathi, Minangkabau, Macedonian, Maltese, Mongolian, Maori, Malay, Min Nan Chinese, Low German, Nepali, Dutch, Norwegian Nynorsk, Norwegian Bokmål, Occitan, Panjabi, Iranian Persian, Plateau Malagasy, Western Panjabi, Polish, Portuguese, Pushto, Romanian, Russian, Sanskrit, Sinhala, Slovak, Slovenian, Somali, Spanish, Albanian, Serbian, Sundanese, Swahili, Swedish, Tamil, Tatar, Telugu, Tajik, Tagalog, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Volapük, Waray, Zulu https://opennlp.apache.org/