In an earlier blog post I played with the idea of visualizing Knowledge Forum discourse in a 2D space, to support one specific perspective of community knowledge focusing on semantic similarity/distance among contributions. I used Latent Semantic Analysis (LSA) for similarity computation and Multidimensional scaling (MDS) for dimensionality reduction. The results was interesting, with notes around a same topic generally clustered together. However, as one colleague pointed out in a comment, it’s hard to interpret topics around different clusters and it would be nice if it’s possible to inspect clusters and somehow label them.
At the CSCL 2013 conference around ten days ago I attended an interactive presentation by Norma Ming. In her work, she conducted probabilistic LSA (pLSA) on four weeks of online discussion by college students in a biology class and extracted 100 topics for the dataset. After pLSA, the semantic space of the online discussion could be represented by a 100-dimensional space, with each post represented by a 100-dimensional vector. To visualize all posts on a 2D space, locally linear embedding (LLE) was further conducted, reducing the 100-dimensional space to 2D while keeping locally clustered posts in the high dimensional space still close in the 2D space. With posts visualized in a 2D canvas, Norma further mapped students’ scores of the final exam as well as week # of discussion onto the visualization. Interestingly, they found students achieving higher scores tended to contribute more posts in “outer space”, while low-achieving students’ posts only covered the central area. With course week information, they found quite clear boundaries among discussion of different weeks, although over-plotting might be an issue as you see in the following figure from their paper. These results were pretty powerful, especially if we don’t only think retrospectively but also prospectively. For example, whether is it possible to guide students to “outer space” by progressively represent their discussion in this way to encourage “metadiscourse” (namely, discourse about discourse)?
Coming back from CSCL, I resumed the work of representing semantic space of Knowledge Forum discussion for metadiscourse. Rather than using pLSA, I used latent Dirichlet allocation (LDA), a more sophisticated technique for topic modeling. The key idea is similar to Norma’s work, which is to represent discourse in a high-dimensional space and then reduce it to 2D for visualization. A Shiny app was developed as a proof of concept. It has three main functionalities:
I. Visualization of semantic space. The visualization is implemented with a Google (“non-moving Motion”) Chart, which allows users to interact with the visualization. Users can easily change x and y axises as well as the meaning of color and size of dots in the visualization to help interpret the results. It provides a potentially powerful way to interact with data if the data have a rich set of variables. Further, zooming in functionality is available if an area is cluttered with too many dots so the problem of over-plotting could be alleviated.
II. Summary of extracted topics. Top words of each extracted topic are listed to help users interpret the results. Users can choose to show more (or less) top words in the process of making sense of a topic. Moreover, numbers of notes under topics are also provided, to help identify popular topics. It opens the door to further tracking popularity of topics in a discussion.
III. Detailed content of notes. Detailed content of notes are also presented for users to make sense of the results and to decide whether the topic modeling results are accurate.
This application is a first step of my exploration of visualizing students’ online discourse to facilitate metadiscourse. I plan to introduce more variables such as authorship, timestamp, and various types of coding results into this visualization in the future. Please let me know if you wish to explore any idea.