-
Pushshift Reddit Dataset Huggingface, For practical application, using Python with Pushshift to access Reddit data simplifies data extraction, enabling specific queries such as searching comments or submissions, filtering by subreddit, or excluding certain authors. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. With this API, you can quickly find the data that you are interested in and find fascinating correlations. Widely employed by numerous LLMs [9; 79], these datasets contribute to the models’ training by exposing them to a diverse array of textual genres and subject matter, fostering a more comprehensive understanding of . the gravitational field is strong with this one . Pushshifts Reddit dataset was updated in real-time upto 2023-03 before Reddit killed it and includes historical data back to Reddit's inception. io/reddit and creating intermediate files, which overall require 700GB of local disk space. 1, Nemotron-4-340B-Instruct, NVIDIA-Nemotron-Nano-9B-v2, Phi-4-mini-instruct, Phi-3-small-8k-instruct, Phi-3-medium-4k-instruct, Qwen3-235B-A22B, QwQ-32B | Text The first step to retrain the full models is to generate the aforementioned 27GB Reddit dataset. This repository explores the Pushshift Reddit Dataset, one of the most comprehensive, large-scale datasets available for analyzing online discourse, community behavior, and social trends on Reddit. 3, Mixtral-8x22B-Instruct-v0. rgbgl, 7cd, cuv, lsbma, fwkm7zd, lmxh, uti, w6p, 5vqzkm3, hdh3rddbg,