Standard Ebooks

Metadata

Metadata in a Standard Ebook is stored in the ./src/epub/content.opf file. The file contains some boilerplate that an ebook producer won’t have to touch, and a lot of information that they will have to touch as an ebook is produced.

Follow the general structure of the content.opf file generated by se create-draft. Don’t rearrange the order of anything in there.

General URL rules

  1. URLs used in metadata are https where possible.

  2. URLs used in metadata do not contain query strings, or if a query string is required, only contain the minimum necessary query string to render the base resource.

  3. URLs used for Project Gutenberg page scans look like: https://www.gutenberg.org/ebooks/<BOOK-ID>.

  4. URLs used for HathiTrust page scans look like: https://catalog.hathitrust.org/Record/<RECORD-ID>.

  5. URLs used for Google Books page scans look like: https://books.google.com/books?id=<BOOK-ID>.

  6. URLs used for Internet Archive page scans look like: https://archive.org/details/<BOOK-ID>.

The ebook identifier

  1. The <dc:identifier> element contains the unique identifier for the ebook. The identifier is the Standard Ebooks URL for the ebook, prefaced by url:.

    <dc:identifier id="uid">url:https://standardebooks.org/ebooks/anton-chekhov/short-fiction/constance-garnett</dc:identifier>

Forming the SE URL

The SE URL is formed by the following algorithm.

(Note: Strings can be made URL-safe using the se make-url-safe tool.)

  • Start with the URL-safe author of the work, as it appears on the titlepage. If there is more than one author, continue appending subsequent URL-safe authors, separated by an underscore. Do not alpha-sort the author name.

  • Append a forward slash, then the URL-safe title of the work. Do not alpha-sort the title.

  • If the work is translated, append a forward slash, then the URL-safe translator. If there is more than one translator, continue appending subsequent URL-safe translators, separated by an underscore. Do not alpha-sort translator names.

  • If the work is illustrated, append a forward slash, then the URL-safe illustrator. If there is more than one illustrator, continue appending subsequent URL-safe illustrators, separated by an underscore. Do not alpha-sort illustrator names.

  • Finally, do not append a trailing forward slash.

Publication date and release identifiers

There are several elements in the metadata describing the publication date, updated date, and revision number of the ebook. Generally these are not updated by hand; instead, the se prepare-release tool updates them automatically.

  1. <dc:date> is a timestamp representing the first publication date of this ebook file. Once the ebook is released to the public, this value doesn’t change.

  2. <meta property="dcterms:modified"> is a timestamp representing the last time this ebook file was modified. This changes often.

Book titles

Books without subtitles

  1. The <dc:title id="title"> element contains the title.

  2. The <meta property="file-as" refines="#title"> element contains alpha-sorted title, even if the alpha-sorted title is identical to the unsorted title.

<dc:title id="title">The Moon Pool</dc:title> <meta property="file-as" refines="#title">Moon Pool, The</meta>
<dc:title id="title">Short Fiction</dc:title> <meta property="file-as" refines="#title">Short Fiction</meta>

Books with subtitles

  1. The <meta property="title-type" refines="#title">main</meta> element identifies the main part of the title.

  2. A second <dc:title id="subtitle"> element contain the subtitle, and is refined with <meta property="title-type" refines="#subtitle">subtitle</meta>.

  3. A third <dc:title id="fulltitle"> element contains the complete title on one line, with the main title and subtitle separated by a colon and space, and is refined with <meta property="title-type" refines="#fulltitle">extended</meta>.

  4. All three <dc:title> elements have an accompanying <meta property="file-as"> element, even if the file-as value is the same as the title.

<dc:title id="title">The Moon Pool</dc:title> <meta property="file-as" refines="#title">Moon Pool, The</meta>
<dc:title id="title">The Man Who Was Thursday</dc:title> <meta property="file-as" refines="#title">Man Who Was Thursday, The</meta> <meta property="title-type" refines="#title">main</meta> <dc:title id="subtitle">A Nightmare</dc:title> <meta property="file-as" refines="#subtitle">Nightmare, A</meta> <meta property="title-type" refines="#subtitle">subtitle</meta> <dc:title id="fulltitle">The Man Who Was Thursday: A Nightmare</dc:title> <meta property="file-as" refines="#fulltitle">Man Who Was Thursday, The</meta> <meta property="title-type" refines="#fulltitle">extended</meta>

Books with a more popular alternate title

Some books are commonly referred to by a shorter name than their actual title. For example, The Adventures of Huckleberry Finn is often simply known as Huck Finn.

  1. The <dc:title id="title-short"> element contains the common title. It is refined with <meta property="title-type" refines="#title-short">short</meta> and <meta property="file-as">.

Books published with multiple titles

Some books may have been published under more than one official title. This is not the same as a book being more commonly known by a popular title. For example, The Mark of Zorro was originally serialized as The Curse of Capistrano.

  1. The <meta property="dcterms:alternative" refines="#title"> element contains the alternate title. It is not refined with <meta property="file-as">.

    <dc:title id="title">The Mark of Zorro</dc:title> <meta property="file-as" refines="#title">Mark of Zorro, The</meta> <meta property="dcterms:alternative" refines="#title">The Curse of Capistrano</meta>

Books with numbers or abbreviations in the title

Books that contain numbers or abbreviations in their title may be difficult to find with a search query, because there can be different ways to search for numbers or abbreviations. For example, a reader may search for Around the World in Eighty Days by searching for “80” instead of “eighty”.

  1. If a book title contains numbers or abbreviations, a <meta property="dcterms:alternate" refines="#title"> element is placed after the main title block, containing the title with expanded or alternate spelling to facilitate possible search queries.

    <dc:title id="title">Around the World in Eighty Days</dc:title> <meta property="file-as" refines="#title">Around the World in Eighty Days</meta> <meta property="dcterms:alternate" refines="#title">Around the World in 80 Days</meta>
    <dc:title id="title">File No. 113</dc:title> <meta property="file-as" refines="#title">File No. 113</meta> <meta property="dcterms:alternate" refines="#title">File Number One Hundred and Thirteen</meta>

Book subjects

The <dc:subject> element

<dc:subject> elements describe the categories the ebook belongs to.

  1. Each <dc:subject> has the id attribute set to subject-#, where # is a number starting at 1, without leading zeros, that increments with each subject.

  2. The <dc:subject> elements are arranged sequentially in a single block.

  3. <dc:subject> values are sourced from Library of Congress Subject Headings.

  4. If the transcription for the ebook comes from Project Gutenberg, the values of the <dc:subject> elements come from the Project Gutenberg “bibrec” page for the ebook. Otherwise, the values come from the Library of Congress catalog listing for the book.

  5. After the block of <dc:subject> elements there is a block of <meta property="authority" refines="#subject-N"> and <meta property="term" refines="#subject-N"> element pairs.

    1. <meta property="authority" refines="#subject-N"> contains the source for the category. For Library of Congress categories, the value is LCSH.

    2. <meta property="term" refines="#subject-N"> contains the term ID for that subject heading.

      1. For subject headings that are proper names, the Library of Congress uses the NACOAF identifier in place of a regular subject heading. In such cases, the terms are set to a NACOAF identifier (a string starting with n), but the authority is still set to LCSH.

      2. Term IDs of subject headings that do not have LCSH identifiers or NACOAF identifiers in the Library of Congress system are set to Unknown.

Examples

This example shows how to mark up the subjects for A Voyage to Arcturus, by David Lindsay:

<dc:subject id="subject-1">Science fiction</dc:subject> <dc:subject id="subject-2">Psychological fiction</dc:subject> <dc:subject id="subject-3">Quests (Expeditions) -- Fiction</dc:subject> <dc:subject id="subject-4">Life on other planets -- Fiction</dc:subject> <meta property="authority" refines="#subject-1">LCSH</meta> <meta property="term" refines="#subject-1">sh85118629</meta> <meta property="authority" refines="#subject-2">LCSH</meta> <meta property="term" refines="#subject-2">sh85108438</meta> <meta property="authority" refines="#subject-3">LCSH</meta> <meta property="term" refines="#subject-3">sh2008110314</meta> <meta property="authority" refines="#subject-4">LCSH</meta> <meta property="term" refines="#subject-4">sh2008106912</meta>

SE subjects

Along with the Library of Congress categories, a set of SE subjects is included in the ebook metadata. Unlike Library of Congress categories, SE subjects are purposefully broad. They’re more like the subject categories in a small bookstore, as opposed to the precise, detailed, hierarchical Library of Congress categories.

  1. SE subjects are included with one or more <meta property="se:subject"> elements.

    <meta property="se:subject">Fantasy</meta> <meta property="se:subject">Philosophy</meta>
  2. There is at least one SE subject.

  3. SE subjects are in alphabetical order.

All SE subjects

  • Adventure

  • Autobiography

  • Biography

  • Children’s

  • Comedy

  • Drama

  • Fantasy

  • Fiction

  • Horror

  • Memoir

  • Mystery

  • Nonfiction

  • Philosophy

  • Poetry

  • Satire

  • Science Fiction

  • Shorts

  • Spirituality

  • Travel

Required SE subjects for specific types of books

  1. Ebooks that are collections of short stories have the SE subject Shorts as one of the SE subjects.

  2. Ebooks that are young adult or children’s books have the SE subject Children’s as one of the SE subjects.

Book descriptions

An ebook has two kinds of descriptions: a short <dc:description> element, and a much longer <meta property="se:long-description"> element.

The short description

The <dc:description> element contains a short, single-sentence summary of the ebook.

  1. The description is a single complete sentence ending in a period, not a sentence fragment or restatement of the title.

  2. The description summarizes the main theme or plot thread in the book, in an active voice, without using proper names.

    Sally the witch curses Bob Smith. He is turned in to a frog. His career as a barber is put on hold.
    This is a book about love, loss, and recovering from tragedy.
    An evil witch transforms a garrulous barber into a frog, putting his career on hold as he comes to grips with his new station in life.
  3. For collections, compilations, and omnibuses, a sentence fragment is acceptable as a description.

  4. The description is typogrified, i.e. it contains Unicode curly quotes, em dashes, and the like.

The long description

The <meta property="se:long-description"> element contains a much longer description of the ebook.

  1. The long description is a non-biased, encyclopedia-like description of the book, including any relevant publication history, backstory, or historical notes. It is as detailed as possible without giving away plot spoilers. It does not impart the producer’s opinions of the book, or include content warnings. Think along the lines of a Wikipedia-like summary of the book and its history, but under no circumstances can a producer copy and paste from Wikipedia! (Wikipedia licenses articles under a CC license which is incompatible with Standard Ebooks’ CC0 public domain dedication.)

  2. The long description is typogrified, i.e. it contains Unicode curly quotes, em dashes, and the like.

  3. The long description is in escaped HTML, with the HTML beginning on its own line after the <meta property="se:long-description"> element.

  4. Long description HTML follows the general code style conventions.

  5. The first occurrence of the author’s name is linked to the Standard Ebooks author page. For example, for Arthur Conan Doyle this would look like <a href="https://standardebooks.org/ebooks/arthur-conan-doyle">Arthur Conan Doyle</a>. If the long description references other authors, books and story collections that already have pages on Standard Ebooks then the first occurrence of these are linked as well.

  6. The long description does not contain external links other than links to other Standard Ebooks books or authors.

Book language

  1. The <dc:language> element follows the long description block. It contains the IETF language tag for the language that the work is in.

  2. If a book contains files that are in a variety of languages or dialects, then <dc:language> is set to the predominant language of the book.

Book transcription and page scan sources

  1. The <dc:source> elements represent URLs to sources for the transcription the ebook is based on, and page scans of the print sources used to correct the transcriptions.

  2. <dc:source> URLs are in https where possible.

  3. A book can contain more than one such element if multiple sources for page scans were used.

    1. If the ebook is a collection in which different parts appear across different page scan sources (like a short story or poetry collection), an XML comment is included above each <dc:source> element specifying which part of the ebook is included in the following source URL.

      <!--The Man of the Crowd, Eleanora, The Oval Portrait--> <dc:source>https://archive.org/details/worksofedgaralla01poeeuoft</dc:source> <!--The Gold-Bug, A Tale of the Rugged Mountains, Mesmeric Revelation--> <dc:source>https://archive.org/details/worksofedgaralla02poeeuoft</dc:source>

Additional book metadata

  1. <meta property="se:url.encyclopedia.wikipedia"> contains the English Wikipedia URL for the book. This element is not present if there is no English Wikipedia entry for the book.

  2. <meta property="se:url.vcs.github"> contains the SE GitHub URL for this ebook. This is calculated by taking the string https://github.com/standardebooks/ and appending the SE identifier, without https://standardebooks.org/ebooks/, and with forward slashes replaced by underscores.

  3. <meta property="belongs-to-collection" id="collection-N"> contains the name of the collection the ebook belongs to.

    1. The value for this element must be the same for all ebooks in the collection.

    2. The id attribute is collection-N where N is a positive integer starting at 1.

    3. The element is further refined by a <meta property="collection-type" refines="#collection-N"> element with the value of set or series. See the EPUB spec for more details.

    4. <meta property="se:is-a-collection">true</meta> is present if the ebook is a collection of items which a reader may wish to search by item title. For example, this would include ebooks like a collection of short stories, or a collection of short works by an author from antiquity.

Book production notes

  1. The <meta property="se:production-notes"> element contains any of the ebook producer’s production notes. For example, the producer might note that page scans were not available, so an editorial decision was made to add commas to sentences deemed to be transcription typos; or that certain archaic spellings were retained as a matter of prose style specific to this ebook.

  2. The <meta property="se:production-notes"> element is not present if there are no production notes.

Readability metadata

These two elements are automatically computed by the se prepare-release tool.

  1. The <meta property="se:word-count"> element contains an integer representing the ebook’s total word count, excluding some SE files like the colophon and Uncopyright.

  2. The <meta property="se:reading-ease.flesch"> element contains a decimal representing the computed Flesch reading ease for the book.

General contributor rules

The following apply to all contributors, including the author(s), translator(s), and illustrator(s).

  1. If there is exactly one contributor in a set (for example, only one author in a possible set of authors, or only one translator in a possible set of translators) then the <meta property="display-seq"> element is omitted for that contributor.

  2. If there is more than one contributor in a set (for example, multiple authors, or multiple translators) then the <meta property="display-seq"> element is specified for each contributor in that set, with a value equal to their position in the SE identifier.

  3. The EPUB spec specifies that in a set of contributors, if at least one has the display-seq property, then other contributors in the set without the display-seq value are ignored. For SE purposes, this also means they will be excluded from the SE identifier.

  4. By SE convention, contributors with <meta property="display-seq">0</meta> are excluded from the SE identifier.

  5. It is not uncommon for one contributor to have multiple roles; for example, an author (aut) who also illustrated (ill) the book. In these cases, additional roles are assigned using additional role properties.

Example

<dc:creator id="author">Jonathan Swift</dc:creator> ... <meta property="role" refines="#author" scheme="marc:relators">aut</meta> <meta property="role" refines="#author" scheme="marc:relators">ill</meta> <meta property="role" refines="#author" scheme="marc:relators">win</meta> <meta property="role" refines="#author" scheme="marc:relators">wpr</meta>

The author metadata block

  1. <dc:creator id="author"> contains the author’s name as it appears on the cover.

  2. If there is more than one author, the first author’s id is author-1, the second author-2, and so on.

  3. <meta property="file-as" refines="#author"> contains the author’s name as filed alphabetically. This element is included even if it’s identical to <dc:creator>.

  4. <meta property="se:name.person.full-name" refines="#author"> contains the author’s full name, with any initials or middle names expanded, and including any titles. If the author uses a pseudonym, then this should be the full pseudonym, not the author’s real name. This element is not included if the value is identical to <dc:creator>.

  5. <meta property="alternate-script" refines="#author"> contains the author’s name as it appears on the cover, but transliterated into their native alphabet if applicable. For example, Anton Chekhov’s name would be contained here in the Cyrillic alphabet. This element is not included if not applicable.

  6. <meta property="se:url.encyclopedia.wikipedia" refines="#author"> contains the URL of the author’s English Wikipedia page. This element is not included if there is no English Wikipedia page.

  7. <meta property="se:url.authority.nacoaf" refines="#author"> contains the URI of the author’s Library of Congress Names Database page. It uses a plain http: prefix, and does not include the .html file extension. This element is not included if there is no LoC Names database entry.

  8. <meta property="role" refines="#author" scheme="marc:relators"> contains the MARC relator tag for the roles the author played in creating this book.

This example shows a complete author metadata block for Short Fiction, by Anton Chekhov:

<dc:creator id="author">Anton Chekhov</dc:creator> <meta property="file-as" refines="#author">Chekhov, Anton</meta> <meta property="se:name.person.full-name" refines="#author">Anton Pavlovich Chekhov</meta> <meta property="alternate-script" refines="#author">Анто́н Па́влович Че́хов</meta> <meta property="se:url.encyclopedia.wikipedia" refines="#author">https://en.wikipedia.org/wiki/Anton_Chekhov</meta> <meta property="se:url.authority.nacoaf" refines="#author">http://id.loc.gov/authorities/names/n79130807</meta> <meta property="role" refines="#author" scheme="marc:relators">aut</meta>

The translator metadata block

  1. If the work is translated, the <dc:contributor id="translator"> metadata block follows the author metadata block.

  2. If there is more than one translator, then the first translator’s id is translator-1, the second translator-2, and so on.

  3. Each block is identical to the author metadata block, but with <dc:contributor id="translator"> instead of <dc:creator id="author">.

  4. The MARC relator tag is trl: <meta property="role" refines="#translator" scheme="marc:relators">trl</meta>.

  5. Translators often annotate the work; if this is the case, the additional MARC relator tag ann is included in a separate <meta property="role" refines="#translator" scheme="marc:relators"> element.

The illustrator metadata block

  1. If the work is illustrated by a person who is not the author, the illustrator metadata block follows.

  2. If there is more than one illustrator, the first illustrator’s id is illustrator-1, the second illustrator-2, and so on.

  3. Each block is identical to the author metadata block, but with <dc:contributor id="illustrator"> instead of <dc:creator id="author">.

  4. The MARC relator tag is ill: <meta property="role" refines="#illustrator" scheme="marc:relators">ill</meta>.

The cover artist metadata block

The “cover artist” is the artist who painted the art the producer selected for the Standard Ebook cover.

  1. The cover artist metadata block is identical to the author metadata block, but with <dc:contributor id="artist"> instead of <dc:creator id="author">.

  2. The MARC relator tag is art: <meta property="role" refines="#artist" scheme="marc:relators">art</meta>.

Metadata for additional contributors

Occasionally a book may have other contributors besides the author, translator, and illustrator; for example, a person who wrote a preface, an introduction, or who edited the work or added endnotes.

  1. Additional contributor blocks are identical to the author metadata block, but with <dc:contributor> instead of <dc:creator>.

  2. The id attribute of the <dc:contributor> is the lowercase, URL-safe, fully-spelled out version of the MARC relator tag. For example, if the MARC relator tag is wpr, the id attribute would be writer-of-preface.

  3. The MARC relator tag is one that is appropriate for the role of the additional contributor. Common roles for ebooks are: wpr, ann, and aui.

  4. If a contributor is a collaborator on part of the book, for example if they share a byline on a short story, the ctb MARC relator tag is used, and the contributor is given display-seq set to 0 to prevent them from appearing in the book’s overall byline.

Transcriber metadata

  1. If the ebook is based on a transcription by someone else, like Project Gutenberg, then transcriber blocks follow the general contributor metadata blocks.

  2. If the transcriber is anonymous, the value for the producer’s <dc:contributor> element is Anonymous.

  3. If there is more than one transcriber, the first transcriber is transcriber-1, the second transcriber-2, and so on.

  4. The <meta property="file-as" refines="#transcriber-1"> element contains an alpha-sorted representation of the transcriber’s name.

  5. The MARC relator tag is trc: <meta property="role" refines="#transcriber-1" scheme="marc:relators">trc</meta>.

  6. If the transcriber’s personal homepage is known, the element <meta property="se:url.homepage" refines="#transcriber-1"> is included, whose value is the URL of the transcriber’s homepage. The URL must link to a personal homepage only; no products, services, or other endorsements, commercial or otherwise.

Sponsor metadata

  1. If an ebook has a financial sponsor, then the sponsor block follows the transcriber block.

  2. The <meta property="file-as" refines="#sponsor"> element contains an alpha-sorted representation of the sponsor’s name.

  3. The MARC relator tag is spn: <meta property="role" refines="#sponsor" scheme="marc:relators">spn</meta>.

  4. If the sponsor’s personal homepage is known, the element <meta property="se:url.homepage" refines="#sponsor"> is included, whose value is the URL of the sponsor’s homepage. Since sponsors may be corporations, linking to a corporate homepage is permitted; however linking to specific product or service is disallowed.

Producer metadata

These elements describe the SE producer who produced the ebook for the Standard Ebooks project.

  1. Producer names must sound like complete real names, i.e., they must have at least a first initial and full last name. Anonymous producers are allowed, and if the producer is anonymous then the value for the producer’s <dc:contributor> element is Anonymous.

  2. If there is more than one producer, the first producer is producer-1, the second producer-2, and so on.

  3. The producer metadata block is identical to the author metadata block, but with <dc:contributor id="producer-1"> instead of <dc:creator id="author">.

  4. If the producer’s personal homepage is known, the element <meta property="se:url.homepage" refines="#producer-1"> is included, whose value is the URL of the transcriber’s homepage. The URL must link to a personal homepage only; no products, services, or other endorsements, commercial or otherwise.

  5. The MARC relator tags for the SE producer usually include all of the following:

    • bkp: The producer produced the ebook as role.

    • blw: The producer wrote the blurb (the long description) as role.

    • cov: The producer selected the cover art as role.

    • mrk: The producer wrote the HTML markup for the ebook as role.

    • pfr: The producer proofread the ebook as role.

    • tyg: The producer reviewed the typography of the ebook as role.

The ebook manifest

The <manifest> element is a required part of the EPUB spec that defines a list of files within the ebook.

  1. The manifest is in alphabetical order.

  2. The id attribute is the basename of the href attribute.

  3. Files which contain SVG images have the additional properties attribute with the value svg in their manifest item.

  4. The manifest item for the table of contents file has the additional properties attribute with the value nav.

  5. The manifest item for the cover image has the additional properties attribute with the value cover-image.

The ebook spine

The <spine> element is a required part of the EPUB spec that defines the reading order of the files in the ebook.

Accessibility metadata

Accessibility metadata is added to bring the final ebook into conformance with the EPUB Accessibility spec, with the following considerations.

  1. Accessibility metadata is arranged in the metadata file in groups by property, with items in each group ordered by their text values. The groups appear in this order:

    1. a11y:certifiedBy

    2. schema:accessMode

    3. schema:accessModeSufficient

    1. schema:accessibilityFeature

    2. schema:accessibilityHazard

    3. schema:accessibilitySummary

  2. If the ebook has images not including the cover, titlepage, and publisher logo, then the following metadata is included in addition to any boilerplate accessibility metadata:

    <meta property="schema:accessMode">visual</meta> <meta property="schema:accessibilityFeature">alternativeText</meta>

    The cover, titlepage, and publisher logo are ignored because they are present in most ebooks; if they were to be counted in accessibility metadata, that metadata would appear in most ebooks, making it meaningless.