by Jarosław Jura & Kaja Kałużyńska
The increasing number of digital and digitized content sources (online versions of traditional media, news portals, various websites on myriads of topics, and, of course, social media) has started to influence empirical social research. Huge amounts of easily accessible and almost ready-to-analyze datasets seem to be a dream coming true for social researchers, especially those who prefer to work with unobtrusively-collected data.
Such large datasets ask for being analysed by mixed methods, to avoid wasting their potential by either choosing a sample or focusing on quantitatively obtained information only. Here come other tools that make the life of a contemporary researcher much more comfortable – software solutions. Of course, in the ideal situation, one could just ‘feed’ all the data to AI and wait for the results, but there are many limitations to such an approach, like usability in specific cases, its accessibility, and, of course, the researcher’s nightmare: a limited project’s budget. Moreover, in the case of smaller datasets, consisting of heterogeneous data, analysis’ results might prove unsatisfactory.
Our research project, an exploratory study on the image of China and the Chinese in Zambia and Angola, included also an analysis of textual media content, namely news articles published in these countries and mentioning China or the Chinese. We obtained a mid-sized dataset, consisting of 2477 articles; the material was very heterogeneous, because of the wide scope of topics covered by the texts and the fact that we analysed content from both English- and Portuguese-language media.
In the course of analysis, we realized that a new method would be needed to obtain the best possible results on the basis of the collected data. After a series of trial-and-error approaches, we managed to develop MIHA – Mixed Integrative Heuristic Approach. The application of this method allowed us to create an exhaustive, contextual and precise keyword dictionary for automated classification of text units as well as a set of sentiment indexes.
We have to admit, that even though we did our best to utilize all the possibilities of the software (Provalis QDA Miner and Wordstat), the dictionary creation process was a time-consuming task since it included reviewing each word of frequency higher or equal to 10 in the whole database.
Our classification, similar to the initial conceptualization of theoretical categories within the grounded theory approach, aimed to explore the most frequent contexts in which China was depicted in African e-media. Each examined word was either added to an exclusion list (words irrelevant from the point of view of the research) or assigned to a chosen – sometimes a newly created – category, together with other words of the same root and all the synonyms.
In the next step, we examined the already categorized keywords in their context to refine the categorization results, mainly by removing those keywords that appeared within the text in unexpected contexts. Most of the categories were re-coded, and some of the keywords were re-assigned in the next steps. This heuristic approach resulted in a set of categories, including ‘emotional’ ones, positive and negative, that later on were used to design sentiment indexes. Our indexes are based on a comparison of the results of quantitative and qualitative analysis and coding. They could be used as a tool for improving dictionary-based sentiment analysis by comparing the results of sentiment analysis performed on the basis of automated coding with manually-coded samples.
We believe that MIHA constitutes a conceptual approach applicable by researchers of various backgrounds in projects focused on investigating the general image presented in textual content, especially in case of mid-sized, heterogeneous data sets. We do not overlook the fact that soon, automated machine learning coding methods will constitute the main approach towards text analysis. However, since such procedures are still imperfect and context-sensitive, we presume that MIHA, consisting of a contextualized dictionary, manual coding of chosen parts of the database and index measurements, could be useful for analysis of data sets related to less common study areas (social groups, languages, geographical areas, subcultures, etc.), in which machine learning-based research would contain a low level of construct validity.
Both the dictionary-creation process and the indexes are described in detail in our paper.
Read the full article in the IJSRM here.