Analiza ilościowa zestawów danych i schematów ich wykorzystania w modelach językowych


Adam Jachimczyk 


Abstrakt

Cel/Teza: Rosnąca popularność modeli językowych implikuje wzrost zainteresowania zestawami danych wykorzystywanymi do ich szkolenia. Analiza wpisuje się w nurt badań poświęconych tym zbiorom, koncentrując się na częstotliwości ich wykorzystania oraz schematach użycia przez różne typy modeli językowych.

Koncepcja/Metody badań: Przeanalizowano 56 762 modele językowe z repozytorium Hugging Face. W ramach analizy zidentyfikowano zestawy danych oraz określono częstość ich wykorzystania przez modele językowe. Zbadano także stopień zróżnicowania wykorzystania danych przez autorów modeli oraz stopień wykorzystania zestawów w różnych kategoriach modeli językowych.

Wyniki i wnioski: Większość modeli była szkolona na podstawie tylko jednego zestawu danych. Zaobserwowano także, że w kategoriach z mniejszą liczbą modeli występuje większe zróżnicowanie wykorzystywanych do ich szkolenia zestawów.

Oryginalność/Wartość poznawcza: Analiza ujawnia pewne schematy dotyczące stopnia zróżnicowania wykorzystania zestawów danych w zależności od autora i kategorii modelu językowego.

Słowa kluczowe

Hugging Face. Model językowy. Sztuczna inteligencja. Zestaw danych.


Arif, A. (2023). Introduction to Large Language Models (LLMs): An overview of BERT, GPT, and other popular models. John Snow Labs. https://www.johnsnowlabs.com/introduction-to-large-language-models-llms-an-overview-of-bert-gpt-and-other-popular-models/arXiv.org e-Print archive. (b.d.). https://arxiv.org/

Askari, A., Aliannejadi, M., Kanoulas, E., Verberne, S. (2023). Generating synthetic documents for cross-encoder re-rankers: A aomparative study of ChatGPT and human experts (No. arXiv:2305.02320). arXiv. https://doi.org/10.48550/arXiv.2305.02320

Auffarth, B. (2024). Generatywna sztuczna inteligencja z LangChain: Budowanie aplikacji AI opartych na LLM z użyciem Pythona, ChatGPT i innych modeli językowych. Helion.

Baradaran, R., Ghiasi, R., & Amirkhani, H. (2022). A survey on machine reading comprehension systems. Natural Language Engineering, 28(6), 683–732. https://doi.org/10.1017/S1351324921000395

Common Crawl—Open Repository of Web Crawl Data. (b.d.). https://commoncrawl.org/

Data Provenance Initiative. (b.d.). https://www.dataprovenance.org/about

Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., Gardner, M. (2021). Documenting large webtext corpora: a case study on the Colossal Clean Crawled Corpus (No. arXiv:2104.08758). arXiv. ttps://doi.org/10.48550/

arXiv.2104.08758

DuoRC. (b.d.). https://duorc.github.io/

Dzendzik, D., Foster, J., Vogel, C. (2021). English machine reading comprehension datasets: A survey. In M.-F. Moens, X. Huang, L. Specia, S. W. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8784–8804). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp--main.693

Eigenbrode, S., Barth, A., Fregly, C., Watrak, A. (red.). (2024). Generatywna sztuczna inteligencja na platformie AWS: Tworzenie multimodalnych aplikacji wnioskujących kontekstowo. Helion.

FacebookAI/roberta-base · Hugging Face. (b.d.) https://huggingface.co/FacebookAI/roberta-base

Feature extraction with LLMs, Hugging Face and MinIO. (b.d.). https://www.e2enetworks.com/blog/feature-extraction-with-large-language-models-hugging-face-and-minio Fka/

awesome-chatgpt-prompts · Datasets at Hugging Face. (2024). https://huggingface.co/datasets/fka/awesome-chatgpt-prompts

GitHub · Build and ship software on a single, collaborative platform. (b.d.). https://github.com/

Goetze, T. S., Abramson, D. (2021). Bigger isn’t better: The ethical and scientific vices of extra-large datasets in language models. Companion Publication of the 13th ACM Web Science Conference 2021, 69–75. https://doi.org/10.1145/3462741.3466809

Google-research-datasets/natural-questions. (2025). [Dataset]. Google Research Datasets. https:// github.com/google-research-datasets/natural-questions (Original work published 2019)

Guo, X., Chen, Y. (2024). Generative AI for synthetic data generation: methods, challenges and the future (No. arXiv:2403.04190). arXiv. https://doi.org/10.48550/arXiv.2403.04190

Hämäläinen, P., Tavast, M., Kunnari, A. (2023). Evaluating Large Language Models in generating synthetic HCI research data: A case study. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 1–19. https://doi.org/10.1145/3544548.3580688

HuggingFaceH4/stack-exchange-preferences · Datasets at Hugging Face. (b.d.). https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences

Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., Weller, A. (2022). Synthetic Data—What, why and how? (No. arXiv:2205.03257). arXiv. https://doi.org/10.48550/arXiv.2205.03257

Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., Setyawan, M., Sarin, S., Samb, S., Sagot, B., Rivera, C., Rios, A., Papadimitriou, I., Osei, S., Suarez, P. O., ... Adeyemi, M. (2022).

Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447

Li, Z., Zhu, H., Lu, Z., Yin, M. (2023). Synthetic data generation with Large Language Models for text classification: potential and limitations (No. arXiv:2310.07849). arXiv. https://doi.org/10.48550/arXiv.2310.07849

Liu, Y., Cao, J., Liu, C., Ding, K., Jin, L. (2024). Datasets for Large Language Models: A comprehensive survey (No. arXiv:2402.18041). arXiv. https://doi.org/10.48550/arXiv.2402.18041

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach (No. arXiv:1907.11692). arXiv. https://doi.org/10.48550/arXiv.1907.11692

Longpre, S., Mahari, R., Chen, A., Obeng-Marnu, N., Sileo, D., Brannon, W., Muennighoff, N., Khazam, N., Kabbara, J., Perisetla, K., Wu, X. (Alexis), Shippole, E., Bollacker, K., Wu, T., Villa, L., Pentland, S., Hooker, S. (2024). A large-scale audit of dataset licensing and attribution in AI. Nature Machine Intelligence, 6(8), 975–987. https://doi.org/10.1038/s42256-024-00878-8

Longpre, S., Singh, N., Cherep, M., Tiwary, K., Materzynska, J., Brannon, W., Mahari, R., Obeng-Marnu, N., Dey, M., Hamdy, M., Saxena, N., Anis, A. M., Alghamdi, E. A., Chien, V. M., Yin, D., Qian, K., Li, Y., Liang, M., Dinh, A., ... Kabbara, J. (2025). Bridging the data provenance gap across text, speech and video (No. arXiv:2412.17847). arXiv. https://doi.org/10.48550/arXiv.2412.17847

Lurunchik/WikiHowNFQA · Datasets at Hugging Face. (b.d.). https://huggingface.co/datasets/Lurunchik/WikiHowNFQA

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P., Iyyer, M., Zettlemoyer, L., Hajishirzi, H. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in long form text generation. In H. Bouamor, J. Pino, K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12076–12100). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.741

Models—Hugging Face. (2025). https://huggingface.co/models

Newest Questions. (b.d.). Stack Overflow. https://stackoverflow.com/questions

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., ... Zoph, B. (2024). GPT-4 Technical Report (No. arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774

Paullada, A., Raji, I. D., Bender, E. M., Denton, E., Hanna, A. (2021). Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11). https://doi.org/10.1016/j.patter.2021.100336

PKU-Alignment/PKU-SafeRLHF · Datasets at Hugging Face. (2024). https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF

Rejeleene, R., Xu, X., Talburt, J. (2024). Towards trustable Language Models: Investigating information quality of Large Language Models (No. arXiv:2401.13086). arXiv. https://doi.org/10.48550/arXiv.2401.13086

Rogers, A., Gardner, M., Augenstein, I. (2023). QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Comput. Surv., 55(10), 197:1–197:45. https://doi.org/10.1145/3560260

Rossi, L., Harrison, K., Shklovski, I. (2024). The problems of LLM-generated data in social science research. Sociologica, 18(2), Article 2. https://doi.org/10.6092/issn.1971-8853/19576

Samsung/samsum · Datasets at Hugging Face. (2022). https://huggingface.co/datasets/Samsung/samsum

Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Liu, X., Yan, H., Shao, Y., Tang, Q., Zhang, S., Zhao, X., Chen, K., Zheng, Y., Zhou, Z., Li, R., Zhan, J., Zhou, Y., Li, L., Yang, X., ... Qiu, X. (2024). MOSS: An open conversational Large Language Model. Machine Intelligence Research, 21(5), 888–905. https://doi.org/10.1007/s11633-024-1502-8

Tang, R., Han, X., Jiang, X., & Hu, X. (2023). Does Synthetic Data Generation of LLMs Help Clinical Text Mining? (No. arXiv:2303.04360). arXiv. https://doi.org/10.48550/arXiv.2303.04360

Tasksource/deberta-small-long-nli · Hugging Face. (2023). https://huggingface.co/tasksource/deberta-small-long-nli

Tatsu-lab/alpaca · Datasets at Hugging Face. (2024). [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca

The Stanford Question Answering Dataset. (b.d.). https://rajpurkar.github.io/SQuADexplorer/TriviaQA. (b.d.). https://nlp.cs.washington.edu/triviaqa/

Wei, J., Huang, D., Lu, Y., Zhou, D., Le, Q. V. (2024). Simple synthetic data reduces sycophancy in large language models (No. arXiv:2308.03958). arXiv. https://doi.org/10.48550/arXiv.2308.03958

What is Feature Extraction? – Hugging Face. (2024). https://huggingface.co/tasks/feature-extraction

What is Fill-Mask? – Hugging Face. (2025). https://huggingface.co/tasks/fill-mask

What is Question Answering? | IBM. (2025). https://www.ibm.com/think/topics/question-answering

What is Sentence Similarity? – Hugging Face. (2025). https://huggingface.co/tasks/sentence-similarity

What is Table Question Answering? – Hugging Face. (2023). https://huggingface.co/tasks/table-question-answering

What is Text Classification? – Hugging Face. (2023). https://huggingface.co/tasks/textclassification

What is Text Generation? – Hugging Face. (2024). https://huggingface.co/tasks/text-generation

What is Token Classification? – Hugging Face. (2024). https://huggingface.co/tasks/tokenclassification

What is Zero-Shot Classification? – Hugging Face. (2023). https://huggingface.co/tasks/zero-shot-classification

Wikimedia Downloads. (b.d.). https://dumps.wikimedia.org/

Yu, X. V., Asai, A., Chatterjee, T., Hu, J., Choi, E. (2022). Beyond counting datasets: A survey of multilingual dataset construction and necessary resources (No. arXiv:2211.15649). arXiv. https://doi.org/10.48550/arXiv.2211.15649

Zeng, C., Li, S., Li, Q., Hu, J., Hu, J. (2020). A survey on machine reading comprehension: tasks, evaluation metrics and benchmark datasets (No. arXiv:2006.11880). arXiv. https://doi.org/10.48550/arXiv.2006.11880

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., ... Wen, J.-R. (2025). A survey of Large Language Models (No. arXiv:2303.18223). arXiv. https://doi.org/10.48550/arXiv.2303.18223


Opublikowane: 2025-12-01



Adam Jachimczyk  a.jachimczyk@uw.edu.pl






Creative Commons License

Utwór dostępny jest na licencji Creative Commons Uznanie autorstwa – Użycie niekomercyjne – Bez utworów zależnych 4.0 Międzynarodowe.

CC BY-NC-ND 4.0 Uznanie autorstwa - Użycie niekomercyjne - Bez utworów zależnych 4.0 Międzynarodowe