FRAPT: An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs
Developers increasingly rely on API tutorials to facilitate software development. However, it remains a challenging task for them to discover relevant API tutorial fragments explaining unfamiliar APIs. Existing supervised approaches suffer from the heavy burden of manually preparing corpus-specific annotated data and features. In this study, we propose a novel unsupervised approach, namely Fragments Recommender for APIs with PageRank and Topic model (FRAPT). FRAPT can well address two main challenges lying in the task and effectively determine relevant tutorial fragments for APIs. In FRAPT, a Fragment Parser is proposed to identify APIs in tutorial fragments and replace ambiguous pronouns and variables with related ontologies and API names, so as to address the pronoun and variable resolution challenge. Then, a Fragment Filter employs a set of nonexplanatory detection rules to remove non-explanatory fragments, thus address the non-explanatory fragment identification challenge. Finally, two correlation scores are achieved and aggregated to determine relevant fragments for APIs, by applying both topic model and PageRank algorithm to the retained fragments. Extensive experiments over two publicly open tutorial corpora show that, FRAPT improves the state-of-the-art approach by 8.77% and 12.32% respectively in terms of F-Measure. The effectiveness of key components of FRAPT is also validated.
(2) Tutorial Corpora:Two publicly open tutorial are constructed by other researchers, namely McGill corpus consists of five tutorials and Android corpus is composed of four tutorials related to Android APIs. These corpora have been manually annotated into relevant and irrelevant fragments with their contained APIs. Therefore they are used as the ground truth for comparing different approaches.
The two tutorial corpora can be download here: McGill Corpus and Android Corpus.
3.1 Coreference Resolution: Reconcile, an automatic coreference resolution tool, is leveraged to perform pronoun resolution. The .jar file is here: http://www.cs.utah.edu/nlp/reconcile/, and the command line is: $ java -jar reconcile-1.0.jar file1 file2 ...
3.2 PageRank Algorithm: PageRank algorithm attempts to evaluate the importance of each sentence through link analysis. It takes in a set of linked entities, and outputs a numerical weight for each entity to estimate their importance. The source code is here: PageRank.
3.3 Topic Model: As a popular way to analyze a large scale of documents, topic model can be used to find semantic relationships between documents and terms. Topic model extracts several topics from document collections by mining co-occurrence terms. In this study, one type of topic model, namely Latent Dirichlet Allocation (LDA) is leveraged, and Stanford Topic Modeling Toolbox is introduced to help us perform LDA.
(4) The FRAPT tool:The Java Archive File is here: FRAPT. The dependency packages of FRAPT can be requested from Jingxuan Zhang by mailing to jxzhang@nuaa.edu.cn.
If you think these materials are userful for you, please cite our paper, Thank you!
If you have any question about our paper, please contact Jingxuan Zhang, E-mail: jxzhang@nuaa.edu.cn.