MegaWika 2

MegaWika 2 is an improved multilingual text dataset containing a structured view of Wikipedia articles, the web sources they cite, source text quality estimates, article text translations, and additional article enrichments.

Note: Web citations (sources) in the HuggingFace dataset do not include scraped source text; use rehydrate-citations.py to rehydrate them.

The initial data release is based on Wikipedia dumps from May 1, 2024. In total, the data contains about 77 million articles and 71 million scraped web citations. The English collection, the largest, contains about 10 million articles and 24 million scraped web citations.

In the future, we may release deltas, collections of articles that have been added or changed since the initial dump (or since the previous delta release). We expect a fraction of the articles to change between dumps; hence, deltas will be significantly smaller and more compact than the initial collection.

Languages Covered

As in MegaWika 1, MegaWika 2 spans 50 languages, including English, designated by their two-character ISO 639-1 language code:

  • af: Afrikaans
  • ar: Arabic
  • az: Azeri (Azerbaijani)
  • bn: Bengali
  • cs: Czech
  • de: German (Deutsch)
  • en: English
  • es: Spanish (Español)
  • et: Estonian
  • fa: Farsi (Persian)
  • fi: Finnish
  • fr: French
  • ga: Irish (Gaelic)
  • gl: Galician
  • gu: Gujarati
  • he: Hebrew
  • hi: Hindi
  • hr: Croatian
  • id: Indonesian
  • it: Italian
  • ja: Japanese
  • ka: Georgian (Kartvelian/Kartlian)
  • kk: Kazakh
  • km: Khmer
  • ko: Korean
  • lt: Lithuanian
  • lv: Latvian
  • mk: Macedonian (Makedonski)
  • ml: Malay (Malayalam)
  • mn: Mongolian
  • mr: Marathi
  • my: Burmese (Myanmar language)
  • ne: Nepali
  • nl: Dutch (Nederlands)
  • pl: Polish
  • ps: Pashto
  • pt: Portuguese
  • ro: Romanian
  • ru: Russian
  • si: Sinhalese (Sri Lankan language)
  • sl: Slovenian
  • sv: Swedish (Svenska)
  • ta: Tamil
  • th: Thai
  • tr: Turkish
  • uk: Ukrainian
  • ur: Urdu
  • vi: Vietnamese
  • xh: Xhosa
  • zh: Chinese (Zhōngwén)

Dataset Structure

Directory Structure

The MegaWika 2 dataset consists of a list of directories, one for each language, designated by its language code.

Each language subdirectory contains a list of chunks in JSON-lines format, where each chunk contains up to 1,000 articles, and each line of a chunk file is a distinct JSON-encoded Wikipedia article:

─ en/
  ├─ data/
  │  ├─ 000000001.jsonl
  │  ├─ 000000002.jsonl
  │  └─ [...]
  └─ metrics.json

Each language subdirectory also contains language-specific summary statistics (metrics.json) and a directory containing the data chunks (data).

JSON Schema

The full data schema for MegaWika 2 is described in subsequent chapters (for example, MegaWika 2.0 Data Schema).

Among other things, each article object contains the article title, the article's raw wikicode and parsed text, and a hierarchy of objects representing the article structure. This hierarchy includes, among many other things:

  • The top level of this hierarchy is a list of headings, paragraphs, tables, infoboxes, and other block-level elements.
    • These block-level elements contain various sub-elements; for example, each paragraph contains a list of sentences.
      • Each sentence contains the sentence text, translated (English) sentence text, and a list of citations.
        • Each citation includes the raw wikicode content, the character index of the citation in the sentence text, an optional citation URL, and optional scraped citation source text.

Statistics

The metrics files (for example, en/metrics.json) provide statistics describing the data collected for each language.

MegaWika 2 features greater coverage than MegaWika 1, including marked improvements in recall for the citation detection and source scraping/extraction processes:

MetricVersion 1Version 2.0Increase
Articles Collected2,072,7269,841,417375%
Web Citations Detected17,368,49957,431,369231%
Web Citations Successfully Scraped5,623,38623,544,500319%
Web Citation Scrape/Extraction Recall32%41%27% (relative)

Changelog

These entries summarize differences between versions; see the data schema(s) in subsequent chapters for details.

2.0 (Differences from MegaWika 1)

MegaWika version 2 introduces a comprehensive redesign of the MegaWika data structure. MegaWika 2 captures not just passage/source pairs, but the structure and relationship of the text---and the sources cited in that text---to the surrounding Wikipedia article. Specifically, each article contains a structured element list parsed from the original Wikitext; the Wikitext is also provided for reference. Paragraph elements in MegaWika 2 contain sentence-segmented text, further facilitating downstream research. In parallel, each article contains a list of excerpts (in MegaWika 1, passages) with one or more citations attached to them, compared to the passage-citation pairs---supporting only one citation per passage---in MegaWika 1. MegaWika 2.0 does not include translation probabilities, "repetitious translation" annotations, source language ID, or generated question-answer pairs as in MegaWika 1, but it does add a large amount of other metadata, including article creation and last revision dates, cross-lingual links, short source/citation snippets provided by authors, and source text quality estimates.

Along the way, we have improved the recall of the citation extraction process by (among other changes):

  • Adding support for named citation resolution
  • Expanding the coverage of citation syntax understood by the citation detector
  • Including not just citations with scrapable URLs, but all citations, to support researchers who may want to study Wikipedia citation behavior in general, and across languages
  • Increasing the scraped source code size limit

Statistics characterizing the improved recall in citation detection are provided in the Statistics section. Additional statistics are provided in the metrics files (for example, en/metrics.json) in the dataset.

MegaWika 2 also introduces improvements to error handling, providing higher coverage across the board. Errors and metadata for source scraping and extraction are included in the data, enabling analysis of sources of missing data and potential biases in the data.

For additional details and analysis of the MegaWika 2.0 dataset and its construction, please see our whitepaper on ArXiv.

MegaWika 2 Development Data Schema

MegaWika 2 is structured as a collection of JSON-lines "chunk" files organized by Wikipedia language. Each chunk file contains a collection of Article objects, one (JSON-encoded) Article per line. What follows is documentation for each type in the schema, starting with Article.

Article

Fields

  • id (string): Wikipedia article id
    • Ex: "39"
  • ns_key (string): Wikipedia article namespace key (id)
    • Ex: "0"
  • ns (string): Wikipedia article namespace name
    • Ex: ""
    • Ex: "Template"
  • title (string): Article title
    • Ex: "Les Hauts De Hurlevent"
  • wikicode (string): Wikimedia source code for article
    • Ex: "<div id=\"mp_header\" class=\"mp_outerbox\"> ..."
  • hash (string): Hash of title and content
    • Ex: "2c0c3bfb0493fb8ddd5661..."
  • last_revision (string): Datetime of last revision
    • Ex: "2023-12-03T10:50:40Z"
  • last_revision_id (string): Wikipedia id of last revision
    • Ex: "481"
  • last_revision_parent_id (string | null): Wikipedia id of last revision's parent, if it has one
    • Ex: "479"
    • Ex: null
  • first_revision (string | null): Datetime of initial article creation, if it could be retrieved.
    • Ex: "2023-09-04T08:19:40Z"
  • first_revision_following_redirects (string | null): Same as first_revision, but retrieved after following article redirects.
    • Ex: "2018-04-02T09:00:14Z"
  • first_revision_access_date (string | null): Datetime first revision was retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:55:40Z"
  • first_revision_following_redirects_access_date (string | null): Datetime first revision following redirects was retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:55:40Z"
  • cross_lingual_links (object[string, string] | null): A dictionary mapping this article onto articles on the same topic in other languages; keys represent language codes, values represent the title of the article in that language.
    • Ex: {"en": "Wuthering Heights", "es": "Cumbres Borrascosas"}
  • cross_lingual_links_following_redirects (object[string, string] | null): Same as cross_lingual_links, but retrieved after following article redirects.
    • Ex: {"en": "Wuthering Heights", "es": "Cumbres Borrascosas", "zh": "咆哮山莊"}
  • cross_lingual_links_access_date (string | null): Datetime cross-lingual links were retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:56:40Z"
  • cross_lingual_links_following_redirects_access_date (string | null): Datetime cross-lingual links following redirects were retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:56:40Z"
  • redirect (Redirect | null): Information about the article or article section this title redirects to
    • Ex: null
  • text (string): Natural-language text of article
    • Ex: "Les Hauts de Hurlevent est l'unique roman d'Emily Brontë ..."
  • elements (array[Heading | Table | Infobox | Paragraph | Math | Code | Preformatted]): Article structure: paragraphs, text and citation elements, etc.
  • excerpts_with_citations (array[ExcerptWithCitations]): A list of all citations from the article and the associated text excerpts they appear in. This data is a postprocessed subset of the data in the elements list and is provided for convenience.

Citation

Fields

  • content (string): Citation content
    • Ex: "<ref>{{Citation |last=Thomas |first=Darcy |year=2013 ..."
  • char_index (integer): Character index of this citation in the enclosing sentence or excerpt
    • Ex: 39
  • name (string | null): Optional citation name
    • Ex: null
    • Ex: "Thomas2013"
  • url (string | null): Extracted URL, if web citation
    • Ex: "https://example.com/emily-bronte/..."
  • source_text (string | null): Extracted source text, if source download and extraction succeeded
    • Ex: "Emily Brontë avait deux sœurs ..."
  • source_code_content_type (string | null): Downloaded source code content type, if download succeeded and content-type header was received
    • Ex: "text/html"
    • Ex: "text/html; charset=ISO-8859-1"
  • source_code_num_bytes (integer | null): Not used
    • Ex: null
  • source_code_num_chars (integer | null): Size of downloaded source code in characters, if source download succeeded and code can be decoded as text
    • Ex: 100000
  • source_download_date (string | null): Datetime source code was downloaded from the web
    • Ex: "2023-12-03T10:50:40Z"
  • source_download_error (string | null): Source download error message, if there was an error
    • Ex: null
    • Ex: "Download is too large (2.4 MB)"
    • Ex: "ConnectTimeoutError: ..."
  • source_extract_error (string | null): Source extraction error message, if there was an error
    • Ex: null
    • Ex: "Text is too short (50 words)"
    • Ex: "Exception: ..."
  • source_snippet (string | null): A relevant snippet from the source document, excertped manually by Wikipedia editor; stored in the quote field of the relevant citation templates.
    • Ex: "Emily Brontë avait deux sœurs"
  • source_quality_label (integer | null): An integer between 1 and 5 representing the predicted relevance and quality of the text extracted from the source page: 1 is irrelevant content like 404 text and paywalls, 2 is likely irrelevant or unreadable content like a list of headlines or mangled table, 3 is potentially relevant content like a book abstract, 4 is likely relevant content but with some quality issues, and 5 is relevant content that is well-formatted.
    • Ex: 4
  • source_quality_raw_score (number | null): The raw score output by the source quality regression model, generally between 0 and 1. The source quality label is computed from the raw score and has a monotonic but non-linear relationship.
    • Ex: 0.8

Example JSON

{
  "content": "<ref>{{Citation |last=Thomas |first=Darcy |year=2013 ...",
  "char_index": 39,
  "name": null,
  "url": "https://example.com/emily-bronte/...",
  "source_text": "Emily Brontë avait deux sœurs ...",
  "source_code_content_type": "text/html",
  "source_code_num_bytes": null,
  "source_code_num_chars": 100000,
  "source_download_date": "2023-12-03T10:50:40Z",
  "source_download_error": null,
  "source_extract_error": null,
  "source_snippet": "Emily Brontë avait deux sœurs",
  "source_quality_label": 4,
  "source_quality_raw_score": 0.8
}

CitationNeeded

Fields

  • type (const string = "citation-needed"): Used to differentiate from other element types
  • content (string): Citation-needed element content
    • Ex: "{{Citation needed|date=September 2015}}"
  • char_index (integer): Character index of this citation-needed in the enclosing sentence or excerpt
    • Ex: 39

Example JSON

{
  "type": "citation-needed",
  "content": "{{Citation needed|date=September 2015}}",
  "char_index": 39
}

Code

Fields

  • type (const string = "code"): Used to differentiate from other element types
  • language (string | null): Code language (as used for syntax highlighting)
    • Ex: "cpp"
  • content (string): Code block content
    • Ex: "int main() { ..."

Example JSON

{
  "type": "code",
  "language": "cpp",
  "content": "int main() { ..."
}

ExcerptWithCitations

Fields

  • text (string): The text of three consecutive sentences from an article
    • Ex: "Les Hauts de Hurlevent est .... défis à la culture victorienne."
  • translated_text (string | null): English translation of the excerpt text, if not in English Wikipedia
    • Ex: "Wuthering Heights is .... challenges to Victorian culture."
  • citations (array[Citation]): Citation(s) appearing in the final sentence of this excerpt

Heading

Fields

  • type (const string = "heading"): Used to differentiate from other element types
  • text (string): Heading text
    • Ex: "Personnages"
  • translated_text (string | null): English translation of heading text, if not in English Wikipedia
    • Ex: "Characters"
  • level (integer): Heading level (1 being top-level/most general, 6 being bottom-level/most specific)
    • Ex: 2
  • citations (array[Citation]): Citations appearing in this heading
  • citations_needed (array[CitationNeeded]): Citation-needed elements appearing in this heading

Infobox

Fields

  • type (const string = "infobox"): Used to differentiate from other element types
  • content (string): Infobox content
    • Ex: "{{Infobox Livre\n| auteur = Emily Brontë\n...\n}"

Example JSON

{
  "type": "infobox",
  "content": "{{Infobox Livre\n| auteur = Emily Brontë\n...\n}"
}

Math

Fields

  • type (const string = "math"): Used to differentiate from other element types
  • content (string): Math block content
    • Ex: "\\sin 2\\pi x + \\ln e ..."

Example JSON

{
  "type": "math",
  "content": "\\sin 2\\pi x + \\ln e ..."
}

Paragraph

Fields

  • type (const string = "paragraph"): Used to differentiate from other element types
  • sentences (array[Sentence]): List of sentences in this paragraph

Preformatted

Fields

  • type (const string = "preformatted"): Used to differentiate from other element types
  • content (string): Preformatted block content
    • Ex: "____\n|DD|____T_\n|_ |_____|<\n @-@-@-oo\\\n"

Example JSON

{
  "type": "preformatted",
  "content": "____\n|DD|____T_\n|_ |_____|<\n  @-@-@-oo\\\n"
}

Redirect

Fields

  • title (string): The title of the article this article redirects to
    • Ex: "Les Hauts de Hurlevent"
  • heading (string | null): The text of the section heading this article redirects to, if it redirects to a specific section of an article
    • Ex: "Roman"
  • access_date (string): Datetime this redirect was retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:56:40Z"

Example JSON

{
  "title": "Les Hauts de Hurlevent",
  "heading": "Roman",
  "access_date": "2023-12-03T10:56:40Z"
}

Sentence

Fields

  • text (string): Sentence text content
    • Ex: "Les Hauts de Hurlevent est l'unique roman d'Emily Brontë."
  • translated_text (string | null): English translation of sentence text content, if not in English Wikipedia
    • Ex: "Wuthering Heights is the only novel by Emily Brontë."
  • trailing_whitespace (string): If the sentence was originally followed by whitespace, this will be a space. If the sentence was not followed by whitespace (for example, if it was followed by a quotation mark), this will be the empty string.
    • Ex: " "
    • Ex: ""
  • citations (array[Citation]): Citations appearing in this sentence
  • citations_needed (array[CitationNeeded]): Citation-needed elements appearing in this sentence

Table

Fields

  • type (const string = "table"): Used to differentiate from other element types
  • content (string): Table content
    • Ex: "{| class=\"wikitable\"\n|+ Personnages\n|-\n! Nom !! ...\n...\n|}"

Example JSON

{
  "type": "table",
  "content": "{| class=\"wikitable\"\n|+ Personnages\n|-\n! Nom !! ...\n...\n|}"
}

MegaWika 2.0 Data Schema

MegaWika 2 is structured as a collection of JSON-lines "chunk" files organized by Wikipedia language. Each chunk file contains a collection of Article objects, one (JSON-encoded) Article per line. What follows is documentation for each type in the schema, starting with Article.

Article

Fields

  • title (string): Article title
    • Ex: "Les Hauts de Hurlevent est l'unique roman d'Emily Brontë ..."
  • wikicode (string): Wikimedia source code for article
    • Ex: "<div id=\"mp_header\" class=\"mp_outerbox\"> ..."
  • hash (string): Hash of title and content
    • Ex: "2c0c3bfb0493fb8ddd5661..."
  • last_revision (string): Datetime of last revision
    • Ex: "2023-12-03T10:50:40Z"
  • first_revision (string | null): Datetime of initial article creation, if it could be retrieved.
    • Ex: "2023-09-04T08:19:40Z"
  • first_revision_access_date (string | null): Datetime first revision was retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:55:40Z"
  • cross_lingual_links (object[string, string] | null): A dictionary mapping this article onto articles on the same topic in other languages; keys represent language codes, values represent the title of the article in that language.
    • Ex: {"en": "Wuthering Heights", "es": "Cumbres Borrascosas"}
  • cross_lingual_links_access_date (string | null): Datetime cross-lingual links were retrieved from Wikipedia Action API
    • Ex: "2023-12-03T10:56:40Z"
  • text (string): Natural-language text of article
    • Ex: "Les Hauts de Hurlevent est l'unique roman d'Emily Brontë ..."
  • elements (array[Heading | Table | Infobox | Paragraph | Math | Code | Preformatted]): Article structure: paragraphs, text and citation elements, etc.
  • excerpts_with_citations (array[ExcerptWithCitations]): A list of all citations from the article and the associated text excerpts they appear in. This data is a postprocessed subset of the data in the elements list and is provided for convenience.

Citation

Fields

  • content (string): Citation content
    • Ex: "<ref>{{Citation |last=Thomas |first=Darcy |year=2013 ..."
  • char_index (integer): Character index of this citation in the enclosing sentence or excerpt
    • Ex: 39
  • name (string | null): Optional citation name
    • Ex: null
    • Ex: "Thomas2013"
  • url (string | null): Extracted URL, if web citation
    • Ex: "https://example.com/emily-bronte/..."
  • source_text (string | null): Extracted source text, if source download and extraction succeeded
    • Ex: "Emily Brontë avait deux sœurs ..."
  • source_code_content_type (string | null): Downloaded source code content type, if download succeeded and content-type header was received
    • Ex: "text/html"
    • Ex: "text/html; charset=ISO-8859-1"
  • source_code_num_bytes (integer | null): Not used
    • Ex: null
  • source_code_num_chars (integer | null): Size of downloaded source code in characters, if source download succeeded and code can be decoded as text
    • Ex: 100000
  • source_download_date (string | null): Datetime source code was downloaded from the web
    • Ex: "2023-12-03T10:50:40Z"
  • source_download_error (string | null): Source download error message, if there was an error
    • Note: If the download failed, the value of this field will be "Download is empty" regardless of the nature of the error.
    • Ex: null
    • Ex: "Download is empty"
  • source_extract_error (string | null): Source extraction error message, if there was an error
    • Ex: null
    • Ex: "Text is too short (50 words)"
    • Ex: "Exception: ..."
  • source_snippet (string | null): A relevant snippet from the source document, excertped manually by Wikipedia editor; stored in the quote field of the relevant citation templates.
    • Ex: "Emily Brontë avait deux sœurs"
  • source_quality_label (integer | null): An integer between 1 and 5 representing the predicted relevance and quality of the text extracted from the source page: 1 is irrelevant content like 404 text and paywalls, 2 is likely irrelevant or unreadable content like a list of headlines or mangled table, 3 is potentially relevant content like a book abstract, 4 is likely relevant content but with some quality issues, and 5 is relevant content that is well-formatted.
    • Ex: 4
  • source_quality_raw_score (number | null): The raw score output by the source quality regression model, generally between 0 and 1. The source quality label is computed from the raw score and has a monotonic but non-linear relationship.
    • Ex: 0.8

Example JSON

{
  "content": "<ref>{{Citation |last=Thomas |first=Darcy |year=2013 ...",
  "char_index": 39,
  "name": null,
  "url": "https://example.com/emily-bronte/...",
  "source_text": "Emily Brontë avait deux sœurs ...",
  "source_code_content_type": "text/html",
  "source_code_num_bytes": null,
  "source_code_num_chars": 100000,
  "source_download_date": "2023-12-03T10:50:40Z",
  "source_download_error": null,
  "source_extract_error": null,
  "source_snippet": "Emily Brontë avait deux sœurs",
  "source_quality_label": 4,
  "source_quality_raw_score": 0.8
}

CitationNeeded

Fields

  • type (const string = "citation-needed"): Used to differentiate from other element types
  • content (string): Citation-needed element content
    • Ex: "{{Citation needed|date=September 2015}}"
  • char_index (integer): Character index of this citation-needed in the enclosing sentence or excerpt
    • Ex: 39

Example JSON

{
  "type": "citation-needed",
  "content": "{{Citation needed|date=September 2015}}",
  "char_index": 39
}

Code

Fields

  • type (const string = "code"): Used to differentiate from other element types
  • language (string | null): Code language (as used for syntax highlighting)
    • Ex: "cpp"
  • content (string): Code block content
    • Ex: "int main() { ..."

Example JSON

{
  "type": "code",
  "language": "cpp",
  "content": "int main() { ..."
}

ExcerptWithCitations

Fields

  • text (string): The text of three consecutive sentences from an article
    • Ex: "Les Hauts de Hurlevent est .... défis à la culture victorienne."
  • translated_text (string | null): English translation of the excerpt text, if not in English Wikipedia
    • Ex: "Wuthering Heights is .... challenges to Victorian culture."
  • citations (array[Citation]): Citation(s) appearing in the final sentence of this excerpt

Heading

Fields

  • type (const string = "heading"): Used to differentiate from other element types
  • text (string): Heading text
    • Ex: "Personnages"
  • translated_text (string | null): English translation of heading text, if not in English Wikipedia
    • Ex: "Characters"
  • level (integer): Heading level (1 being top-level/most general, 6 being bottom-level/most specific)
    • Ex: 2
  • citations (array[Citation]): Citations appearing in this heading
  • citations_needed (array[CitationNeeded]): Citation-needed elements appearing in this heading

Infobox

Fields

  • type (const string = "infobox"): Used to differentiate from other element types
  • content (string): Infobox content
    • Ex: "{{Infobox Livre\n| auteur = Emily Brontë\n...\n}"

Example JSON

{
  "type": "infobox",
  "content": "{{Infobox Livre\n| auteur = Emily Brontë\n...\n}"
}

Math

Fields

  • type (const string = "math"): Used to differentiate from other element types
  • content (string): Math block content
    • Ex: "\\sin 2\\pi x + \\ln e ..."

Example JSON

{
  "type": "math",
  "content": "\\sin 2\\pi x + \\ln e ..."
}

Paragraph

Fields

  • type (const string = "paragraph"): Used to differentiate from other element types
  • sentences (array[Sentence]): List of sentences in this paragraph

Preformatted

Fields

  • type (const string = "preformatted"): Used to differentiate from other element types
  • content (string): Preformatted block content
    • Ex: "____\n|DD|____T_\n|_ |_____|<\n @-@-@-oo\\\n"

Example JSON

{
  "type": "preformatted",
  "content": "____\n|DD|____T_\n|_ |_____|<\n  @-@-@-oo\\\n"
}

Sentence

Fields

  • text (string): Sentence text content
    • Ex: "Les Hauts de Hurlevent est l'unique roman d'Emily Brontë."
  • translated_text (string | null): English translation of sentence text content, if not in English Wikipedia
    • Ex: "Wuthering Heights is the only novel by Emily Brontë."
  • trailing_whitespace (string): If the sentence was originally followed by whitespace, this will be a space. If the sentence was not followed by whitespace (for example, if it was followed by a quotation mark), this will be the empty string.
    • Ex: " "
    • Ex: ""
  • citations (array[Citation]): Citations appearing in this sentence
  • citations_needed (array[CitationNeeded]): Citation-needed elements appearing in this sentence

Table

Fields

  • type (const string = "table"): Used to differentiate from other element types
  • content (string): Table content
    • Ex: "{| class=\"wikitable\"\n|+ Personnages\n|-\n! Nom !! ...\n...\n|}"

Example JSON

{
  "type": "table",
  "content": "{| class=\"wikitable\"\n|+ Personnages\n|-\n! Nom !! ...\n...\n|}"
}