The ‘Peer Group Similarity Hypothesis’ Part I: The Basics

In the process of developing ‘Author Metrics’, a contract cheating detection feature, PlagScan experimentally discovered that the ‘Peer Group Similarity Hypothesis’ can be applied to detect ghostwriters.

From plagiarism detection to authorship verification

Presenting the system during the annual European Conference for Academic Integrity (ENAI) in Vilnius, Lithuania last month, aroused great interest in the topic. This is why I would like to share more details on it.

In comparison to the traditional plagiarism detection solution for safeguarding Academic Integrity, Author Metrics currently takes a birds-eye-view on a group of documents. When inputting a set of documents written by a group of students in the same class who are writing on the same assignment, our preliminary data suggests a Gaussian distribution for the metrics we compute on the documents. What we see is that most students are clustering in the middle of the values, with a few high and low scoring students, all of whom typically perform within a standard deviation of the mean.

The Authorship Verification Problem: You have a set of 'known' documents by a given author, and an 'unknown' document of questioned authorship. Now the question arises: Was the unknown document written by the same author as the set of 'known' documents.

Outliers in the low we consider to be low-performance students, which we do not flag. High outliers are flagged for further inspection. High outlier performance across a series of measures can indicate either a prodigious student or a potential ghostwriter. Determining which the student is, falls into the responsibility of the instructor. Our flag simply raises awareness to certain student submissions that look unusual. We are fully aware that the way we consider high or low scores for these metrics may be the result of culturally specific pedagogical assumptions, which will be addressed at a later date. For now, in order to mitigate these assumptions, we base our score on eight metrics in total and not just on a single one.

Comparing assignments of one single author

In the long run, the plan for Author Metrics is to perform the validation by comparison of one assignment with other assignments the student has previously uploaded. Participants of the yearly PAN competitions have already heavily researched this problem, and so the task of Authorship Verification is somewhat old news. However, our method is more robust, not as prone to fluctuations with the different backgrounds each student, nor is it as sensitive to genre or topical interference. Nonetheless, in order to work, first, the baseline data for each student has to be established. This is why PlagScan is currently doing cross-classroom comparisons to collect data and cautiously flag documents.

The concept of “speech communities”

Situating our work into what’s been done before, we turn to dialectectology–which takes quantitative approaches to identify ‘prototypical’ speakers alongside geographically mapping the prevalence of linguistic variants and the language features which form dialect groups as well as turning to linguistic anthropology–a field which is concerned with the language variation that exists within communities and the social meanings that are constructed through different forms of communication. In both these exciting fields exist the research concept of “speech communities”– where groups of speakers who regularly interact share patterns of language use that identifies them as a member of their respective communities.

Detecting ghostwritten works within groups of students

PlagScan’s preliminary data introduced ghostwritten works into a set of documents written by groups of students in the same classroom all writing on the same topic. Using this dataset we were able to identify the ghostwriter because they scored as a significantly different high outlier for almost all measures currently implemented in Author Metrics. Pedagogical research into linguistically diverse education, alternatively phrased as the multicultural classroom, may provide important nuance to our understanding of what constitutes speech communities and peer groups as we must recognize the reality of increasingly globalized and heterogeneous classrooms.

For more on this, stay tuned for the second part on the Peer Group Similarity Hypothesis!

In the meanwhile, we’re happy to hear from you about your view on our hypothesis! Either comment below or reach out to us through

About Verena Gehrmann

Verena is our fierce defender of Academic Integrity. She pointed out that everybody should have the same chances to develop themselves, regardless of their backgrounds and that education plays an important role in equality. The open-minded native German branched out to another continent to explore a new culture when she studied in Tübingen, Beijing, and Taipei. She stayed in Taipei for four more years, developing marketing and branding strategies for different IT markets around the globe.

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “The ‘Peer Group Similarity Hypothesis’ Part I: The Basics