Metadata
Metadata in a Standard Ebook is stored in the ./src/epub/content.opf
file. The file contains some boilerplate that an ebook producer won’t have to touch, and a lot of information that they will have to touch as an ebook is produced.
Follow the general structure of the content.opf
file generated by se create-draft
. Don’t rearrange the order of anything in there.
General URL rules
URLs used in metadata are https where possible.
URLs used in metadata do not contain query strings, or if a query string is required, only contain the minimum necessary query string to render the base resource.
URLs used for Project Gutenberg page scans look like:
https://www.gutenberg.org/ebooks/<BOOK-ID>
.URLs used for HathiTrust page scans look like:
https://catalog.hathitrust.org/Record/<RECORD-ID>
.URLs used for Google Books page scans look like:
https://books.google.com/books?id=<BOOK-ID>
.URLs used for Internet Archive page scans look like:
https://archive.org/details/<BOOK-ID>
.
The ebook identifier
The
<dc:identifier>
element contains the unique identifier for the ebook. The identifier is the Standard Ebooks URL for the ebook, prefaced byurl:
.
Forming the SE URL
The SE URL is formed by the following algorithm.
(Note: Strings can be made URL-safe using the se make-url-safe
tool.)
Start with the URL-safe author of the work, as it appears on the titlepage. If there is more than one author, continue appending subsequent URL-safe authors, separated by an underscore. Do not alpha-sort the author name.
Append a forward slash, then the URL-safe title of the work. Do not alpha-sort the title.
If the work is translated, append a forward slash, then the URL-safe translator. If there is more than one translator, continue appending subsequent URL-safe translators, separated by an underscore. Do not alpha-sort translator names.
If the work is illustrated, append a forward slash, then the URL-safe illustrator. If there is more than one illustrator, continue appending subsequent URL-safe illustrators, separated by an underscore. Do not alpha-sort illustrator names.
Finally, do not append a trailing forward slash.
Publication date and release identifiers
There are several elements in the metadata describing the publication date, updated date, and revision number of the ebook. Generally these are not updated by hand; instead, the se prepare-release
tool updates them automatically.
<dc:date>
is a timestamp representing the first publication date of this ebook file. Once the ebook is released to the public, this value doesn’t change.<meta property="dcterms:modified">
is a timestamp representing the last time this ebook file was modified. This changes often.
Book titles
Books without subtitles
The
<dc:title id="title">
element contains the title.The
<meta property="file-as" refines="#title">
element contains alpha-sorted title, even if the alpha-sorted title is identical to the unsorted title.
Books with subtitles
The
<meta property="title-type" refines="#title">main</meta>
element identifies the main part of the title.A second
<dc:title id="subtitle">
element contain the subtitle, and is refined with<meta property="title-type" refines="#subtitle">subtitle</meta>
.A third
<dc:title id="fulltitle">
element contains the complete title on one line, with the main title and subtitle separated by a colon and space, and is refined with<meta property="title-type" refines="#fulltitle">extended</meta>
.All three
<dc:title>
elements have an accompanying<meta property="file-as">
element, even if thefile-as
value is the same as the title.
Books with a more popular alternate title
Some books are commonly referred to by a shorter name than their actual title. For example, The Adventures of Huckleberry Finn is often simply known as Huck Finn.
The
<dc:title id="title-short">
element contains the common title. It is refined with<meta property="title-type" refines="#title-short">short</meta>
and<meta property="file-as">
.
Books published with multiple titles
Some books may have been published under more than one official title. This is not the same as a book being more commonly known by a popular title. For example, The Mark of Zorro was originally serialized as The Curse of Capistrano.
The
<meta property="dcterms:alternative" refines="#title">
element contains the alternate title. It is not refined with<meta property="file-as">
.
Books with numbers or abbreviations in the title
Books that contain numbers or abbreviations in their title may be difficult to find with a search query, because there can be different ways to search for numbers or abbreviations. For example, a reader may search for Around the World in Eighty Days by searching for “80” instead of “eighty”.
If a book title contains numbers or abbreviations, a
<meta property="dcterms:alternate" refines="#title">
element is placed after the main title block, containing the title with expanded or alternate spelling to facilitate possible search queries.
Book subjects
The <dc:subject>
element
<dc:subject>
elements describe the categories the ebook belongs to.
Each
<dc:subject>
has theid
attribute set tosubject-#
, where # is a number starting at1
, without leading zeros, that increments with each subject.The
<dc:subject>
elements are arranged sequentially in a single block.<dc:subject>
values are sourced from Library of Congress Subject Headings.If the transcription for the ebook comes from Project Gutenberg, the values of the
<dc:subject>
elements come from the Project Gutenberg “bibrec” page for the ebook. Otherwise, the values come from the Library of Congress catalog listing for the book.After the block of
<dc:subject>
elements there is a block of<meta property="authority" refines="#subject-N">
and<meta property="term" refines="#subject-N">
element pairs.<meta property="authority" refines="#subject-N">
contains the source for the category. For Library of Congress categories, the value isLCSH
.<meta property="term" refines="#subject-N">
contains the term ID for that subject heading.For subject headings that are proper names, the Library of Congress uses the NACOAF identifier in place of a regular subject heading. In such cases, the terms are set to a NACOAF identifier (a string starting with
n
), but the authority is still set toLCSH
.Term IDs of subject headings that do not have LCSH identifiers or NACOAF identifiers in the Library of Congress system are set to Unknown.
Examples
This example shows how to mark up the subjects for A Voyage to Arcturus, by David Lindsay:
SE subjects
Along with the Library of Congress categories, a set of SE subjects is included in the ebook metadata. Unlike Library of Congress categories, SE subjects are purposefully broad. They’re more like the subject categories in a small bookstore, as opposed to the precise, detailed, hierarchical Library of Congress categories.
SE subjects are included with one or more
<meta property="se:subject">
elements.There is at least one SE subject.
SE subjects are in alphabetical order.
All SE subjects
Adventure
Autobiography
Biography
Children’s
Comedy
Drama
Fantasy
Fiction
Horror
Memoir
Mystery
Nonfiction
Philosophy
Poetry
Satire
Science Fiction
Shorts
Spirituality
Travel
Required SE subjects for specific types of books
Ebooks that are collections of short stories have the SE subject
Shorts
as one of the SE subjects.Ebooks that are young adult or children’s books have the SE subject
Children’s
as one of the SE subjects.
Book descriptions
An ebook has two kinds of descriptions: a short <dc:description>
element, and a much longer <meta property="se:long-description">
element.
The short description
The <dc:description>
element contains a short, single-sentence summary of the ebook.
The description is a single complete sentence ending in a period, not a sentence fragment or restatement of the title.
The description summarizes the main theme or plot thread in the book, in an active voice, without using proper names.
For collections, compilations, and omnibuses, a sentence fragment is acceptable as a description.
The description is typogrified, i.e. it contains Unicode curly quotes, em dashes, and the like.
The long description
The <meta property="se:long-description">
element contains a much longer description of the ebook.
The long description is a non-biased, encyclopedia-like description of the book, including any relevant publication history, backstory, or historical notes. It is as detailed as possible without giving away plot spoilers. It does not impart the producer’s opinions of the book, or include content warnings. Think along the lines of a Wikipedia-like summary of the book and its history, but under no circumstances can a producer copy and paste from Wikipedia! (Wikipedia licenses articles under a CC license which is incompatible with Standard Ebooks’ CC0 public domain dedication.)
The long description is typogrified, i.e. it contains Unicode curly quotes, em dashes, and the like.
The long description is in escaped HTML, with the HTML beginning on its own line after the
<meta property="se:long-description">
element.Long description HTML follows the general code style conventions.
The first occurrence of the author’s name is linked to the Standard Ebooks author page. For example, for Arthur Conan Doyle this would look like
<a href="https://standardebooks.org/ebooks/arthur-conan-doyle">Arthur Conan Doyle</a>
. If the long description references other authors, books and story collections that already have pages on Standard Ebooks then the first occurrence of these are linked as well.The long description does not contain external links other than links to other Standard Ebooks books or authors.
Book language
The
<dc:language>
element follows the long description block. It contains the IETF language tag for the language that the work is in.If a book contains files that are in a variety of languages or dialects, then
<dc:language>
is set to the predominant language of the book.
Book transcription and page scan sources
The
<dc:source>
elements represent URLs to sources for the transcription the ebook is based on, and page scans of the print sources used to correct the transcriptions.<dc:source>
URLs are in https where possible.A book can contain more than one such element if multiple sources for page scans were used.
If the ebook is a collection in which different parts appear across different page scan sources (like a short story or poetry collection), an XML comment is included above each
<dc:source>
element specifying which part of the ebook is included in the following source URL.
Additional book metadata
<meta property="se:url.encyclopedia.wikipedia">
contains the English Wikipedia URL for the book. This element is not present if there is no English Wikipedia entry for the book.<meta property="se:url.vcs.github">
contains the SE GitHub URL for this ebook. This is calculated by taking the stringhttps://github.com/standardebooks/
and appending the SE identifier, withouthttps://standardebooks.org/ebooks/
, and with forward slashes replaced by underscores.<meta property="belongs-to-collection" id="collection-N">
contains the name of the collection the ebook belongs to.The value for this element must be the same for all ebooks in the collection.
The
id
attribute iscollection-N
whereN
is a positive integer starting at1
.The element is further refined by a
<meta property="collection-type" refines="#collection-N">
element with the value ofset
orseries
. See the EPUB spec for more details.<meta property="se:is-a-collection">true</meta>
is present if the ebook is a collection of items which a reader may wish to search by item title. For example, this would include ebooks like a collection of short stories, or a collection of short works by an author from antiquity.
Book production notes
The
<meta property="se:production-notes">
element contains any of the ebook producer’s production notes. For example, the producer might note that page scans were not available, so an editorial decision was made to add commas to sentences deemed to be transcription typos; or that certain archaic spellings were retained as a matter of prose style specific to this ebook.The
<meta property="se:production-notes">
element is not present if there are no production notes.
Readability metadata
These two elements are automatically computed by the se prepare-release
tool.
The
<meta property="se:word-count">
element contains an integer representing the ebook’s total word count, excluding some SE files like the colophon and Uncopyright.The
<meta property="se:reading-ease.flesch">
element contains a decimal representing the computed Flesch reading ease for the book.
General contributor rules
The following apply to all contributors, including the author(s), translator(s), and illustrator(s).
If there is exactly one contributor in a set (for example, only one author in a possible set of authors, or only one translator in a possible set of translators) then the
<meta property="display-seq">
element is omitted for that contributor.If there is more than one contributor in a set (for example, multiple authors, or multiple translators) then the
<meta property="display-seq">
element is specified for each contributor in that set, with a value equal to their position in the SE identifier.The EPUB spec specifies that in a set of contributors, if at least one has the
display-seq
property, then other contributors in the set without thedisplay-seq
value are ignored. For SE purposes, this also means they will be excluded from the SE identifier.By SE convention, contributors with
<meta property="display-seq">0</meta>
are excluded from the SE identifier.It is not uncommon for one contributor to have multiple roles; for example, an author (
aut
) who also illustrated (ill
) the book. In these cases, additional roles are assigned using additionalrole
properties.
Example
The author metadata block
<dc:creator id="author">
contains the author’s name as it appears on the cover.If there is more than one author, the first author’s
id
isauthor-1
, the secondauthor-2
, and so on.<meta property="file-as" refines="#author">
contains the author’s name as filed alphabetically. This element is included even if it’s identical to<dc:creator>
.<meta property="se:name.person.full-name" refines="#author">
contains the author’s full name, with any initials or middle names expanded, and including any titles. If the author uses a pseudonym, then this should be the full pseudonym, not the author’s real name. This element is not included if the value is identical to<dc:creator>
.<meta property="alternate-script" refines="#author">
contains the author’s name as it appears on the cover, but transliterated into their native alphabet if applicable. For example, Anton Chekhov’s name would be contained here in the Cyrillic alphabet. This element is not included if not applicable.<meta property="se:url.encyclopedia.wikipedia" refines="#author">
contains the URL of the author’s English Wikipedia page. This element is not included if there is no English Wikipedia page.<meta property="se:url.authority.nacoaf" refines="#author">
contains the URI of the author’s Library of Congress Names Database page. It uses a plainhttp:
prefix, and does not include the.html
file extension. This element is not included if there is no LoC Names database entry.<meta property="role" refines="#author" scheme="marc:relators">
contains the MARC relator tag for the roles the author played in creating this book.
This example shows a complete author metadata block for Short Fiction, by Anton Chekhov:
The translator metadata block
If the work is translated, the
<dc:contributor id="translator">
metadata block follows the author metadata block.If there is more than one translator, then the first translator’s
id
istranslator-1
, the secondtranslator-2
, and so on.Each block is identical to the author metadata block, but with
<dc:contributor id="translator">
instead of<dc:creator id="author">
.The MARC relator tag is
trl
:<meta property="role" refines="#translator" scheme="marc:relators">trl</meta>
.Translators often annotate the work; if this is the case, the additional MARC relator tag
ann
is included in a separate<meta property="role" refines="#translator" scheme="marc:relators">
element.
The illustrator metadata block
If the work is illustrated by a person who is not the author, the illustrator metadata block follows.
If there is more than one illustrator, the first illustrator’s
id
isillustrator-1
, the secondillustrator-2
, and so on.Each block is identical to the author metadata block, but with
<dc:contributor id="illustrator">
instead of<dc:creator id="author">
.The MARC relator tag is
ill
:<meta property="role" refines="#illustrator" scheme="marc:relators">ill</meta>
.
The cover artist metadata block
The “cover artist” is the artist who painted the art the producer selected for the Standard Ebook cover.
The cover artist metadata block is identical to the author metadata block, but with
<dc:contributor id="artist">
instead of<dc:creator id="author">
.The MARC relator tag is
art
:<meta property="role" refines="#artist" scheme="marc:relators">art</meta>
.
Metadata for additional contributors
Occasionally a book may have other contributors besides the author, translator, and illustrator; for example, a person who wrote a preface, an introduction, or who edited the work or added endnotes.
Additional contributor blocks are identical to the author metadata block, but with
<dc:contributor>
instead of<dc:creator>
.The
id
attribute of the<dc:contributor>
is the lowercase, URL-safe, fully-spelled out version of the MARC relator tag. For example, if the MARC relator tag iswpr
, theid
attribute would bewriter-of-preface
.The MARC relator tag is one that is appropriate for the role of the additional contributor. Common roles for ebooks are:
wpr
,ann
, andaui
.If a contributor is a collaborator on part of the book, for example if they share a byline on a short story, the
ctb
MARC relator tag is used, and the contributor is givendisplay-seq
set to0
to prevent them from appearing in the book’s overall byline.
Transcriber metadata
If the ebook is based on a transcription by someone else, like Project Gutenberg, then transcriber blocks follow the general contributor metadata blocks.
If the transcriber is anonymous, the value for the producer’s
<dc:contributor>
element isAnonymous
.If there is more than one transcriber, the first transcriber is
transcriber-1
, the secondtranscriber-2
, and so on.The
<meta property="file-as" refines="#transcriber-1">
element contains an alpha-sorted representation of the transcriber’s name.The MARC relator tag is
trc
:<meta property="role" refines="#transcriber-1" scheme="marc:relators">trc</meta>
.If the transcriber’s personal homepage is known, the element
<meta property="se:url.homepage" refines="#transcriber-1">
is included, whose value is the URL of the transcriber’s homepage. The URL must link to a personal homepage only; no products, services, or other endorsements, commercial or otherwise.
Sponsor metadata
If an ebook has a financial sponsor, then the sponsor block follows the transcriber block.
The
<meta property="file-as" refines="#sponsor">
element contains an alpha-sorted representation of the sponsor’s name.The MARC relator tag is
spn
:<meta property="role" refines="#sponsor" scheme="marc:relators">spn</meta>
.If the sponsor’s personal homepage is known, the element
<meta property="se:url.homepage" refines="#sponsor">
is included, whose value is the URL of the sponsor’s homepage. Since sponsors may be corporations, linking to a corporate homepage is permitted; however linking to specific product or service is disallowed.
Producer metadata
These elements describe the SE producer who produced the ebook for the Standard Ebooks project.
Producer names must sound like complete real names, i.e., they must have at least a first initial and full last name. Anonymous producers are allowed, and if the producer is anonymous then the value for the producer’s
<dc:contributor>
element isAnonymous
.If there is more than one producer, the first producer is
producer-1
, the secondproducer-2
, and so on.The producer metadata block is identical to the author metadata block, but with
<dc:contributor id="producer-1">
instead of<dc:creator id="author">
.If the producer’s personal homepage is known, the element
<meta property="se:url.homepage" refines="#producer-1">
is included, whose value is the URL of the transcriber’s homepage. The URL must link to a personal homepage only; no products, services, or other endorsements, commercial or otherwise.The MARC relator tags for the SE producer usually include all of the following:
bkp
: The producer produced the ebook asrole
.blw
: The producer wrote the blurb (the long description) asrole
.cov
: The producer selected the cover art asrole
.mrk
: The producer wrote the HTML markup for the ebook asrole
.pfr
: The producer proofread the ebook asrole
.tyg
: The producer reviewed the typography of the ebook asrole
.
The ebook manifest
The <manifest>
element is a required part of the EPUB spec that defines a list of files within the ebook.
The manifest is in alphabetical order.
The
id
attribute is the basename of thehref
attribute.Files which contain SVG images have the additional
properties
attribute with the valuesvg
in their manifest item.The manifest item for the table of contents file has the additional
properties
attribute with the valuenav
.The manifest item for the cover image has the additional
properties
attribute with the valuecover-image
.
The ebook spine
The <spine>
element is a required part of the EPUB spec that defines the reading order of the files in the ebook.
Accessibility metadata
Accessibility metadata is added to bring the final ebook into conformance with the EPUB Accessibility spec, with the following considerations.
Accessibility metadata is arranged in the metadata file in groups by property, with items in each group ordered by their text values. The groups appear in this order:
a11y:certifiedBy
schema:accessMode
schema:accessModeSufficient
schema:accessibilityFeature
schema:accessibilityHazard
schema:accessibilitySummary
If the ebook has images not including the cover, titlepage, and publisher logo, then the following metadata is included in addition to any boilerplate accessibility metadata:
The cover, titlepage, and publisher logo are ignored because they are present in most ebooks; if they were to be counted in accessibility metadata, that metadata would appear in most ebooks, making it meaningless.