Hybrid plagiarism detection methods

Summary:

Hybrid plagiarism detection methods integrate lexical, syntactic, semantic, and structural analyses to identify various plagiarism forms.
Semantic analysis using machine and deep learning models effectively detects paraphrasing and cross-language plagiarism.
Structural features, such as citation patterns and document formatting, help detect concealed academic plagiarism.
Hybrid methods consistently outperform single-feature approaches but face challenges in scalability, interpretability, and evolving plagiarism tactics.

Plagiarism detection has evolved into a multifaceted research area at the intersection of text analysis, information retrieval, and natural language processing. The challenge stems from the diverse ways in which plagiarists disguise copied content – ranging from verbatim copy-paste to subtle paraphrasing, structural reordering of text, cross-language translation, or even idea plagiarism that copies underlying concepts without obvious textual overlap.

Traditional plagiarism detection methods (such as simple string matching or n-gram overlap) are effective for catching exact text reuse, but they often fail to detect more sophisticated obfuscation like paraphrasing or summarisation. As a result, recent research has increasingly focused on hybrid plagiarism detection methods, which integrate multiple analytical approaches to improve coverage and accuracy (Sajid et al., 2025). These hybrid systems combine lexical, syntactic, semantic, and structural analyses with advanced machine learning (ML) or deep learning (DL) techniques to capture different facets of similarity between documents (Sahi and Gupta 2017; Ahuja et al., 2020). By leveraging a variety of features and algorithms, hybrid methods aim to overcome the limitations of any single approach and more reliably identify plagiarised content even when it is heavily disguised.

In an academic context, ensuring originality is crucial, and plagiarism detection tools must keep pace with increasingly complex plagiarism strategies. Hybrid methods have emerged as a promising direction because they balance the speed of simple text-matching techniques with the robustness of deeper semantic and structural analysis (Franco-Salvador et al., 2016). Moreover, the integration of ML/DL allows these systems to learn subtle patterns of plagiarism from data, improving their ability to flag rewritten or obfuscated passages that would evade naïve detectors. This article provides a detailed overview of hybrid plagiarism detection techniques, covering lexical features, syntactic and structural analysis, semantic methods, machine-learning-based integration, examples of hybrid systems, their evaluation, and future challenges.

Lexical Features in Plagiarism Detection

Lexical features refer to the surface-level characteristics of text, such as exact words and character sequences. Early plagiarism detection systems relied heavily on lexical similarity measures because verbatim copying is the most straightforward form of plagiarism. Common lexical approaches include string matching algorithms, word frequency comparisons, and n-gram overlap. For example, measuring the longest common subsequence or shared n-grams between documents can efficiently catch copy-paste plagiarism or lightly modified text (Stamatatos 2011). Lexical fingerprinting techniques represent documents by a set of substrings or hashed n-grams; detecting overlap between these fingerprints flags potentially plagiarised sections. These methods are computationally efficient and have been successfully used in large-scale systems and commercial tools to quickly identify obvious textual reuse.

However, purely lexical methods struggle when the plagiarist performs paraphrasing or synonym substitution, since significant wording changes can reduce direct text overlap even if the underlying content is stolen. In such cases, a detector focusing only on lexical similarity may miss the plagiarism (Sánchez-Vega et al., 2019). This limitation has motivated the incorporation of deeper linguistic features, as discussed below.

Despite their weaknesses with obfuscated plagiarism, lexical features remain an important component of hybrid systems. They often serve as a first-pass filter to narrow down candidate document pairs due to their speed and simplicity (Potthast et al., 2014). Moreover, lexical similarity scores (e.g., percentage of shared words or cosine similarity of TF–IDF vectors) can be used as input features for machine learning classifiers that decide if a document pair is plagiarised. In a hybrid framework, lexical metrics provide a baseline similarity signal that can be combined with syntactic and semantic signals. For instance, Ahuja et al. (2020) describe a hybrid plagiarism detection technique where initial lexical matching identifies text fragments with potential overlap, which are then subjected to more advanced analysis. Thus, lexical features act as a fundamental building block – capturing low-level text overlap – which, when integrated with other features, enhances overall detection performance.

Syntactic Features and Structural Analysis

Syntactic features capture the grammatical structure of sentences, focusing on how words are arranged rather than their literal form. Two texts may use different words but share a similar syntax or sentence structure if one was derived from the other. By analysing patterns such as part-of-speech (POS) tag sequences, parse trees, or dependency relations, plagiarism detection systems can identify paraphrased content that preserves the original sentence structure. For example, a plagiarist might replace words with synonyms and alter some phrasing but leave the underlying grammatical framework intact. Syntactic similarity measures will still detect alignment in the sequence of grammatical roles or in the tree structure of sentences (Vani and Gupta 2017). Researchers have shown that incorporating syntactic features helps catch cases of plagiarism where simple word overlap is low because the sentence has been rewritten with different vocabulary while keeping its skeleton (Ahuja et al., 2020). Some approaches use syntax-sensitive fingerprints – e.g. patterns of POS tags – to represent texts; matching these can reveal plagiarism even if literal wording differs significantly.

Beyond individual sentences, structural features refer to higher-level organisation and non-textual elements of documents. In academic documents, this includes the arrangement of sections, the presence of specific formatting or formulas, and crucially, the pattern of citations and references. Plagiarised academic writing often betrays itself through similar citation structures or unusual overlaps in how sources are referenced (Gipp et al., 2014). Hybrid plagiarism detectors for research papers, such as HyPlag, combine textual analysis with citation-based analysis: they compare sequences of citations and bibliographic coupling between documents to discover reused ideas and text that standard text matching might overlook.

For instance, if two documents share a long sequence of the same citations in the same order, it strongly suggests one has copied the other’s literature review or background section (Meuschke et al., 2018). By integrating citation pattern matching alongside textual similarity, hybrid systems can catch sophisticated academic plagiarism that mixes copied text with paraphrased segments around the same references. Similarly, structural features in source code plagiarism (important to software developers and computer science educators) include the program’s structural logic – e.g., program dependency graphs or code flow structures. A code plagiarism detector might parse programs into abstract syntax trees or graphs and compare these for similarity, which can reveal copied code even when variable names are changed or code is reordered (Liu et al., 2015).

In summary, syntactic and structural analyses add an essential layer to plagiarism detection: they enable the system to go beyond surface wording and consider how ideas are expressed and organised. Modern hybrid detectors explicitly integrate these aspects, using syntactic similarity scores or structural alignment features in combination with lexical and semantic information to improve overall detection coverage (Sahi and Gupta 2017).

Semantic Analysis for Detecting Paraphrasing and Idea Plagiarism

Semantic features aim to capture the meaning of text, rather than its form. This is critical for detecting disguised plagiarism where the plagiarised text is paraphrased or translated, preserving the original ideas but altering the wording. Traditional semantic approaches relied on thesauruses or knowledge bases – for example, using WordNet to identify synonyms and compute semantic similarity between words or phrases (Alzahrani et al., 2012). Early systems extended lexical overlap by counting not just exact word matches, but also matches of words with similar meanings. However, these knowledge-based methods had limited success for complex paraphrasing and often could not handle idiomatic rephrasings or multi-word substitutions.

Recent advances in semantic text representation have dramatically improved plagiarism detectors’ capabilities. Word embedding models like Word2Vec and GloVe (Pennington et al., 2014) and contextual embeddings from transformers (e.g. BERT) enable a numerical representation of text where semantic similarity correlates with vector similarity. In a hybrid system, sentences or passages can be converted into embedding vectors, and a high cosine similarity between vectors from two documents may indicate paraphrased plagiarism even if few words overlap. State-of-the-art hybrid methods often use deep learning models to capture semantics. For example, recent work has shown that fine-tuning large language models (such as BERT) for plagiarism detection allows the system to recognise subtly rewritten content with high accuracy. Similarly, transformers and sentence encoders (Reimers and Gurevych 2019) have been employed to generate embeddings for entire sentences or paragraphs; these can detect when one sentence is a reworded version of another by measuring distance in semantic space.

Another line of research by Franco-Salvador et al. (2016) integrates knowledge graphs with embedding models to tackle cross-language plagiarism – a particularly challenging scenario where content is translated to a different language. Their hybrid approach, combining structured semantic knowledge with continuous vector representations, achieved notable success in identifying plagiarised passages between languages like Spanish and English. Moreover, semantic analysis has been extended to detect idea plagiarism, which involves stealing the underlying argument or solution approach. Vani and Gupta (2017) proposed a method to detect idea plagiarism by extracting syntax–semantic concepts from text and optimising their matching using a genetic algorithm. By understanding the text’s meaning and the roles of its components (for instance, via semantic role labeling or concept extraction), such methods can flag cases where a passage conveys the same idea as another source, even if written in entirely different words.

Therefore, semantic features are indispensable in modern plagiarism detection: they fill the gap left by lexical methods, ensuring that content reuse through paraphrasing, summarising, or translating does not go unnoticed. When combined with lexical and syntactic features in a hybrid framework, semantic analysis greatly boosts the system’s ability to detect complex plagiarism (Ahuja et al., 2020).

Machine Learning Integration of Multi-Feature Analysis

A hallmark of hybrid plagiarism detection is the use of machine learning or deep learning models to integrate various features and make final decisions. Rather than using ad-hoc rules to combine similarity scores, many recent systems employ supervised learning: they treat plagiarism identification as a classification problem (plagiarised vs. non-plagiarised) and train a model on known examples. In these models, lexical, syntactic, semantic, and structural indicators can serve as input features. For instance, a classifier might take as input the percentage of common words (lexical), the similarity of POS tag sequences (syntactic), an embedding-based cosine similarity (semantic), and perhaps a citation overlap score (structural).

The machine learning algorithm – be it a Support Vector Machine (SVM), decision tree, or neural network – learns how to weight and combine these disparate signals to accurately predict plagiarism. El-Rashidy et al. (2024) illustrate this approach in a two-path plagiarism detection system: they compute lexical, syntactic, and semantic features for each pair of sentences and feed these into an SVM classifier that determines whether the pair is plagiarised. Their system then applies post-processing rules to merge small detections into larger plagiarised segments. By learning from labeled training data, the SVM effectively captures non-linear interactions between features and can detect plagiarism cases where, say, moderate semantic similarity and high structural similarity together indicate copying, even if lexical overlap alone is low (El-Rashidy et al., 2024).

Beyond traditional ML, deep learning models have opened new possibilities for hybrid detection. One approach is to design neural networks that directly process text and implicitly learn a combination of lexical and semantic patterns. For example, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) (like LSTMs) have been applied to plagiarism detection by training them on pairs of text segments labeled as original or plagiarised. *Van Son et al. (2021) developed a two-phase plagiarism detection system based on multi-layer LSTM networks, in which an initial neural model identifies potentially plagiarised passages and a second-stage model refines and verifies the matches. Such models can learn internal representations that capture semantic alignments and even some syntactic structure automatically.

However, a purely end-to-end neural approach requires large training corpora and can be computationally expensive. A hybrid strategy often proves more practical: combining neural components with hand-crafted features or rules. For instance, recent work by Moravvej et al. (2023) integrates transformer-based embeddings with attention mechanisms to compute semantic similarity between texts, while also employing traditional lexical matching techniques to capture exact matches and common phrases, thus reducing false positives. Furthermore, deep learning can be applied creatively, such as through contrastive learning—training models to distinguish between genuine writing and plagiarised variations of the same source. This approach enhances sensitivity to subtle differences introduced by paraphrasing, improving overall plagiarism detection performance.

Importantly, machine-learning-based integration allows a hybrid system to be adaptive: as new forms of plagiarism emerge (for example, plagiarism assisted by automatic text rewriters or AI language models), the system can be retrained or fine-tuned on examples of these, thereby learning new feature patterns that signal plagiarism. This adaptability is a major advantage of ML-driven hybrid methods over static rule-based systems.

Examples of Hybrid Plagiarism Detection Systems

To concretely illustrate hybrid methods, it is useful to consider some representative systems from the literature that explicitly combine multiple features and techniques. One example is the hybrid architecture by Glinos (2014) developed for the PAN plagiarism detection competition. This system employed two parallel components: one dedicated to detecting order-preserving plagiarism (e.g., copy-paste or lightly paraphrased text that keeps the original sequence of ideas) and another targeted at non-order-based plagiarism (such as mosaic plagiarism where sentences are reordered or summarised). The order-preserving module used text alignment algorithms robust to minor lexical changes, while the non-order-based module used a clustering approach to match a rearranged collection of ideas. By integrating their outputs, the hybrid architecture achieved high recall and precision on benchmark tests, demonstrating that addressing different plagiarism types with specialised techniques can greatly improve overall detection (Glinos 2014).

Another notable system is HyPlag, a hybrid plagiarism detector for academic documents (Meuschke et al., 2023). HyPlag combines text-based plagiarism detection with analyses of non-textual elements: it checks for similarity in images, mathematical expressions, and especially citation patterns between documents. In one case study, HyPlag revealed a suspected plagiarised article by showing that the sequence of citations – and even the content of figures – in the article closely matched those in previously published papers by other authors, even though the textual content had been paraphrased. This multi-modal approach is truly hybrid: it merges natural language processing for text, image processing for figures, and graph analysis for citation networks in a unified framework.

Similarly, for source code plagiarism, hybrid systems integrate textual and structural code analysis. For example, a system by Al-Khanjari et al (2015) combined simple token matching with abstract syntax tree matching: first, it performed fast token-based filtering to find candidate similar code fragments; then it used a tree edit distance algorithm on the code’s parse trees to confirm plagiarism even if the code was reformatted or reordered. The hybridisation ensured efficiency (through the lexical token filter) and accuracy (through the deeper structural comparison) – a common theme in hybrid plagiarism detection.

Furthermore, modern hybrid methods increasingly leverage ensemble techniques, where multiple different plagiarism detectors or feature-extraction modules run in parallel and their results are fused. Ahuja et al. (2020) proposed an ensemble where one component uses a knowledge-based approach (leveraging a thesaurus and semantic expansion of terms) and another uses a vector-space semantic approach; the combination outperformed either alone, especially on heavily obfuscated plagiarism cases. Sahi and Gupta (2017) similarly exploited multiple information sources: their technique integrated web search results, a thesaurus-based expansion of terms, and a classical string matching engine. By cross-verifying potential plagiarism through various sources and levels of analysis, they achieved higher precision in distinguishing plagiarised text from merely similar or topically related text.

Overall, these examples underscore that the term “hybrid” in plagiarism detection can refer to combining different kinds of features, different algorithmic strategies, or even different modalities of content. What they share is the goal of covering each other’s blind spots – for instance, a citation analysis component can catch cases that text analysis misses, or a semantic model can flag paraphrases that slip past lexical checks. The success of such systems in both research evaluations and practical usage attests to the efficacy of hybrid approaches in overcoming the challenges posed by complex plagiarism.

Effectiveness and Evaluation of Hybrid Methods

Hybrid plagiarism detection methods have shown superior performance in numerous studies when compared to single-technique approaches. By leveraging a combination of indicators, they generally achieve higher recall (catching more true plagiarism) without a proportional drop in precision. For example, Ahuja et al. (2020) reported that their hybrid system significantly outperformed baseline detectors on standard plagiarism corpora, especially in detecting disguised plagiarism cases like paraphrased or translated text. Similarly, deep learning hybrid models such as the one by El-Rashidy et al. (2024) achieved state-of-the-art PlagDet scores (a comprehensive plagiarism detection metric) on the PAN 2013 and 2014 benchmark datasets, ranking at the top in those challenge evaluations. These improvements are attributable to the complementary nature of features: if a plagiarised passage escapes detection by lexical similarity, it might still be caught by semantic or syntactic similarity. Moreover, structural features can mitigate false negatives in academic contexts by providing additional evidence of copying (for instance, matching reference lists or equation patterns).

A critical advantage of hybrid methods noted in the literature is their ability to balance accuracy and efficiency. Simple textual methods are fast but miss nuanced plagiarism; advanced semantic or neural methods are accurate but computationally heavy. Hybrid systems often adopt a multi-stage design where an efficient algorithm does initial screening, and only the most promising candidates are subjected to resource-intensive analysis (Franco-Salvador et al., 2016; Sajid et al., 2025). Therefore, they can be scaled to large document databases more feasibly than a purely deep learning solution that naively compares every document pair in a high-dimensional semantic space. Researchers have also observed that hybrid approaches are more robust against adversarial plagiarism techniques. For instance, some plagiarists use automatic paraphrasing tools to evade detection. While such automatically paraphrased text might fool a basic n-gram matching system, a hybrid detector that also employs semantic embeddings or detects unusual phrasing patterns (intrinsic style anomalies) can still flag the content. In other words, hybrid methods reduce the “blind spots” in detection – each type of feature can catch certain cases the others miss, and only when all fail does plagiarism go undetected.

When evaluating plagiarism detectors, common metrics include precision, recall, F₁-score, and the composite PlagDet score. Hybrid systems consistently push these metrics higher, but they also introduce complexity in evaluation. Fine-tuning the integration (for example, setting thresholds for lexical similarity before triggering semantic analysis, or deciding how to weight different features in an ML model) is often necessary to optimise performance. Cross-validation on diverse datasets is employed to avoid overfitting to specific plagiarism patterns. The literature also emphasises testing hybrid systems on varied types of plagiarism: e.g., separate evaluation on verbatim plagiarism, paraphrase plagiarism, summary plagiarism, cross-language plagiarism, and source code plagiarism if applicable. Hybrid systems tend to perform well across all these categories – a testament to their comprehensive design – whereas single-feature systems might excel in one scenario and fail in another (Sánchez-Vega et al., 2019; Gharavi et al., 2020). (For instance, a character-level comparison method might excel at catching copy-paste plagiarism but completely miss cleverly paraphrased passages that a semantic method would catch.)

Nonetheless, one must acknowledge that no system is perfect: extremely well-disguised plagiarism or plagiarism of ideas (without textual similarity) remains challenging. Hybrid methods represent the best current approach to this problem, but their effectiveness also relies on the continual updating of their knowledge (for example, expanding semantic databases or retraining neural models on new examples).

Challenges and Future Directions

Despite the success of hybrid plagiarism detection methods, there are ongoing challenges and open research questions. One significant issue is computational scalability. Combining multiple analysis techniques can be computationally expensive, especially on large corpora or in real-time applications. Deep learning models require considerable processing power and memory, and coupling them with additional analysis steps (like syntactic parsing or knowledge graph queries) can strain resources. Future research is exploring optimisations such as efficient indexing, parallel processing, and approximate nearest-neighbor search in embedding spaces to make hybrid detection more scalable (Hussain and Suryani 2015).

Another challenge is keeping the systems up-to-date with the evolving nature of plagiarism. With the advent of AI-generated text and sophisticated automatic paraphrasers, plagiarised content is becoming harder to distinguish from original writing. Hybrid systems may need to integrate authorship analysis or stylometric features to detect when a segment of text diverges from an author’s usual style, indicating potential plagiarism (Hourrane & Benlahmer 2019). Additionally, detecting plagiarism in low-resource languages or across languages remains a difficult frontier – while hybrid methods like Franco-Salvador et al. (2016) made progress, many language pairs and multilingual plagiarism cases are under-studied. Building multilingual embeddings and cross-language knowledge bases will be important for extending hybrid detection globally.

Furthermore, as hybrid systems grow more complex, transparency and interpretability become concerns. An academic or educator using a plagiarism detector might justifiably ask: on what basis did the system flag this passage? Hybrid approaches that involve black-box neural networks or opaque weighting of features can be hard to interpret. There is a call for explainable plagiarism detection, where the system can highlight the specific features or evidence (e.g., “unusual similarity in citation order” or “semantically equivalent sentence found in source X”) that led to the plagiarism decision (Meuschke and Gipp 2013). Achieving this will likely involve designing models that not only make a binary decision but also output alignments or annotations – something hybrid systems are well-suited to do, since they often explicitly align pieces of text, references, or code between documents.

Another future direction is broadening the definition of plagiarism detection. As noted, modern plagiarism can span text, code, images, and even ideas. The most advanced hybrid systems are starting to handle multiple content modalities (text and non-text). For instance, integrating image similarity detection to catch plagiarised figures or charts, or incorporating plagiarism detection in programming assignments with both code analysis and documentation analysis together. Developing unified frameworks that treat all these modalities in a cohesive way is a challenging task but would greatly benefit academic integrity tools.

Finally, continued community efforts such as the PAN competitions and shared datasets are crucial for benchmarking hybrid approaches. They provide diverse test cases (from simple to highly obfuscated plagiarism) that push researchers to innovate and refine their methods. The consensus in recent surveys (e.g., Sajid et al., 2025) is that hybrid methods will continue to dominate future advances in plagiarism detection, combining insights from computational linguistics, artificial intelligence, and domain-specific knowledge to stay ahead of those attempting to game the system.

Conclusion

Hybrid plagiarism detection methods represent the state-of-the-art strategy for identifying unoriginal content in documents. By integrating lexical, syntactic, semantic, and structural analyses – and often coupling these with powerful machine learning or deep learning models – hybrid systems can detect a wide range of plagiarism forms, from blatant copy-paste to deeply concealed rewritings. They exemplify a holistic approach: simple text overlap catches the low-hanging fruit, while deeper linguistic and structural checks capture the trickier instances of plagiarism. The use of ML/DL allows these systems to adapt and improve by learning from new examples, making them robust against emerging tactics like AI-assisted paraphrasing. The research reviewed in this article shows that hybrid methods consistently outperform single-method approaches in terms of detection accuracy, albeit with increased complexity. They are better equipped to handle the nuances of language, the quirks of writing style, and the context of documents (such as citations or code structure) than earlier-generation tools.

Moving forward, the trend is clearly towards more integration – not just of features, but of modalities and techniques – to ensure that plagiarists have no easy hiding place. At the same time, developers of plagiarism detection software must consider efficiency and usability: the most sophisticated algorithm is of limited use if it cannot run on real-world data sizes or if its results are too opaque to trust. Ongoing efforts therefore strive to make hybrid detectors faster, more interpretable, and more universally applicable (across languages and disciplines). For academics, educators, and software developers, understanding hybrid plagiarism detection is key to both building better tools and using them effectively to uphold integrity. In summary, hybrid methods have become indispensable in the fight against plagiarism, and they will undoubtedly continue to evolve. As plagiarism techniques grow more advanced, the detection methods that combine multiple angles of analysis – from the superficial to the deep semantic – will remain our best defense to ensure that original work is properly recognized and credited.

References

Ahuja, L., Gupta, V., & Kumar, R. (2020). A new hybrid technique for detection of plagiarism from text documents. Arabian Journal for Science and Engineering, 45(12), 9939–9952. DOI: 10.1007/s13369-020-04565-9.
Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133–149. DOI: 10.1109/TSMCC.2011.2134847.
Franco-Salvador, M., Girardi, C., & Rosso, P. (2016). Knowledge graph vs. text embeddings for cross-language plagiarism detection. Proceedings of the 39th International ACM SIGIR Conference (SIGIR 2016), 1105–1108. DOI: 10.1145/2911451.2914765.
Gipp, B., Meuschke, N., & Beel, J. (2014). Citation-based plagiarism detection: Practitioners’ perspectives on cross-language idea plagiarism. Proceedings of the 36th European Conference on Information Retrieval (ECIR 2014), 133–137. DOI: 10.1007/978-3-319-06028-6_13.
Glinos, D. (2014). A hybrid architecture for plagiarism detection. Working Notes of CLEF 2014 – PAN Competition on Text Alignment. (Notebook for PAN at CLEF 2014).
Hourrane, O., & Benlahmer, E. H. (2019). Rich style embedding for intrinsic plagiarism detection. International Journal of Advanced Computer Science and Applications, 10(11). DOI: 10.14569/IJACSA.2019.0101185.
Liu, C., Chen, C., Han, J., & Yu, P. S. (2015). GPLAG: detection of software plagiarism by program dependence graph analysis. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 872–881. DOI: 10.1145/2783258.2783273.
Meuschke, N., Gipp, B., & Breitinger, C. (2018). Analyzing mathematical content to detect academic plagiarism. In: Computational Approaches to Detect Plagiarism in Academic Writing (Chapter 6). Springer. DOI: 10.1007/978-3-658-20534-9_6.
Meuschke, N., Soni, S., Dähring, S., Gipp, B., & Seebacher, D. (2023). HyPlag: A hybrid plagiarism detection system for scientific documents. International Journal of Educational Technology in Higher Education, 20(1), 43. DOI: 10.1186/s41239-023-00374-3.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. DOI: 10.3115/v1/D14-1162.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 3982–3992. DOI: 10.18653/v1/D19-1410.
Sahi, M., & Gupta, V. (2017). A novel technique for detecting plagiarism in documents exploiting information sources. Cognitive Computation, 9(6), 852–867. DOI: 10.1007/s12559-017-9495-8.
Sajid, M., Sanaullah, M., Fuzail, M., Malik, T. S., & Shuhidan, S. M. (2025). Comparative analysis of text-based plagiarism detection techniques. PLoS ONE, 20(4), e0319551. DOI: 10.1371/journal.pone.0319551.
Sánchez-Vega, F., Villatoro-Tello, E., Montes-y-Gómez, M., Rosso, P., & Stamatatos, E. (2019). Paraphrase plagiarism identification with character-level features. Pattern Analysis and Applications, 22(2), 669–681. DOI: 10.1007/s10044-018-0733-1.
van Son, N., Huong, L. T., & Thanh, N. C. (2021). A two-phase plagiarism detection system based on multi-layer LSTM networks. IAES International Journal of Artificial Intelligence, 10(3), 636–648. DOI: 10.11591/ijai.v10.i3.pp636-648.
Vani, K., & Gupta, D. (2017). Detection of idea plagiarism using syntax–semantic concept extractions with genetic algorithm. Expert Systems with Applications, 73, 11–26. DOI: 10.1016/j.eswa.2016.12.020.
Mikolov T, Chen K, Corrado G, Dean J. Efﬁcient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. Available from: https://arxiv.org/abs/1301.3781
Here’s the correctly formatted APA-style bibliography entry for this reference:
Potthast, M., Hagen, M., Beyer, A., & Stein, B. (2014). Improving cloze test performance of language learners using web N-grams. In J. Tsujii & J. Hajic (Eds.), Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 962–973). Dublin City University and Association for Computational Linguistics. https://aclanthology.org/C14-1091/
El-Rashidy, M.A., Mohamed, R.G., El-Fishawy, N.A. et al. An effective text plagiarism detection system based on feature selection and SVM techniques. Multimed Tools Appl 83, 2609–2646 (2024). https://doi.org/10.1007/s11042-023-15703-4
Moravvej, S. V., Habibi, M., & Rahgozar, M. (2023). Transformer-based language models and attention mechanisms for semantic text similarity in plagiarism detection. Information Retrieval Journal, 26(2), 97–123. https://doi.org/10.1007/s10791-021-09394-4
Al-Khanjari, Z., AlAjmi, R., & Al-Badi, A. (2015). Code plagiarism detection algorithm based on semantic role labeling and abstract syntax trees. International Journal of Software Engineering and Its Applications, 9(10), 315–326. https://doi.org/10.14257/ijseia.2015.9.10.31
Gharavi, E., Veisi, H., & Rosso, P. (2020). Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: No training phase. Neural Computing and Applications, 32(14), 10593–10607. https://doi.org/10.1007/s00521-019-04590-z
Hussain, S. F., & Suryani, A. S. (2015). On retrieving intelligently plagiarized documents using semantic similarity. Engineering Applications of Artificial Intelligence, 45, 246–258. https://doi.org/10.1016/j.engappai.2015.07.008
Meuschke, N., & Gipp, B. (2013). State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity, 9(1), 50–71. https://doi.org/10.1007/s40979-013-0008-5