diff --git a/Makefile b/Makefile index bc0a3c9..4b0fe8b 100644 --- a/Makefile +++ b/Makefile @@ -27,7 +27,7 @@ SOURCES = $(DOCNAME).tex # List of image files to be included in submitted package (anything that # can be rendered directly by common web browsers) -FIGURES =fig-ext-ids.pdf +FIGURES =fig-ext-ids.pdf voprov_example.png # List of PDF figures (figures that must be converted to pixel images to # work in web browsers). diff --git a/data-origin.bib b/data-origin.bib index 5fc8d79..76dc8c7 100644 --- a/data-origin.bib +++ b/data-origin.bib @@ -52,13 +52,13 @@ @misc{std:registry Year = {2014} } -@misc{std:2019ivoa.spec.1021O, +@misc{std:2025ivoa.spec.0116O, Author= {Francois Ochsenbein et al.}, Organization = {IVOA}, Title = {VOTable Format Definition}, - Version = {1.4}, + Version = {1.5}, Url = {https://www.ivoa.net/documents/VOTable/}, - Year={2019} + Year={2025} } @misc{std:2008ivoa.specQ0222P, diff --git a/data-origin.tex b/data-origin.tex index 52657ca..9c98803 100644 --- a/data-origin.tex +++ b/data-origin.tex @@ -3,6 +3,7 @@ \lstset{flexiblecolumns=true} \usepackage{todonotes} \usepackage{array} +\usepackage{float} \marginparwidth=4cm \title{Data Origin in the VO} @@ -10,13 +11,11 @@ % see ivoatexDoc for what group names to use here \ivoagroup{DCP} -%\author[????URL????]{G.Landais} + \author{G.Landais} \author{A.Muench} \author{M.Demleitner} \author{R.Savalle} -%\author{looking for contributors} -%\author{????Fred Offline????} \editor{G.Landais} @@ -28,7 +27,7 @@ \begin{document} \begin{abstract} -Data Origin in the VO specifies a set of metadata items that define basic +Data Origin in the VO identifies a set of metadata items that define basic provenance information, as well as their representation in documents produced by Virtual Observatory (VO) services. This will improve traceability for VO users, help them to understand result sets and facilitate data reuse and citation. @@ -49,7 +48,7 @@ \section{Introduction} The Virtual Observatory (VO) provides an advanced framework to search for, query, and consume astronomical data. The specification of Data Origin proposed here for VOTable output includes both metadata originating at the data producer (e.g, author, space agency, observatory) and at the data centre (publisher) hosting the resource. At this point, depending on the implementation, users can find the information conveyed in Data Origin in the data centre web pages (landing pages) or in the VO Registry. For citation, the ADS (NASA Astrophysics Data System) offers comprehensive bibliographic capabilities, including the production of BibTeX records for publications known to ADS. However, there are no VO standards to communicate this type of information yet. -%However, there are standards for how to locate these types of information, and often it is not available machine-readably. + A list of basic data origin metadata, reliably findable in a convenient location (i.e., the VOTable produced by a query) will help users to properly cite or @@ -129,9 +128,9 @@ \subsection{Workflow bibliography} \section{State of the Art} -Neither VOTable \citep{2019ivoa.spec.1021O} nor IVOA data access protocols at this point provide standard facilities for conveying Data Origin information. While protocols such as TAP \citep{2019ivoa.spec.0927D} have standard interfaces to retrieve table metadata (e.g., unit, type and description of columns) or metadata on service endpoints (``capabilities'') by virtue of providing VOSI \citep{2017ivoa.spec.0524G} endpoints, for basic metadata like authors or publication dates, clients have to consult the VO Registry. Even that may be difficult, because you cannot in general obtain its IVOA identifier from a service itself. +Neither VOTable \citep{2025ivoa.spec.0116O} nor IVOA data access protocols at this point provide standard facilities for conveying Data Origin information. While protocols such as TAP \citep{2019ivoa.spec.0927D} have standard interfaces to retrieve table metadata (e.g., unit, type and description of columns) or metadata on service endpoints (``capabilities'') by virtue of providing VOSI \citep{2017ivoa.spec.0524G} endpoints, for basic metadata like authors or publication dates, clients have to consult the VO Registry. Even that may be difficult, because you cannot in general obtain its IVOA identifier from a service itself. -HiPS \citep{2017ivoa.spec.0519F} is a more recent protocol which includes for each dataset a list of standardized metadata. HiPS metadata includes authors, publication year, data centre identifier or licenses. +HiPS \citep{2017ivoa.spec.0519F} is a protocol which includes for each dataset a list of standardized metadata. HiPS metadata includes authors, publication year, data centre identifier or licenses. \begin{figure} \centering @@ -161,7 +160,7 @@ \subsection{Data Origin in IVOA Registry} The IVOA Registry uses a unique identifier, the IVOID \citep{2016ivoa.spec.0523D}, as the primary key for its resource -collection. By the above considerations, this IVOID is not suitable as a means of citation.%, because it is a technical identifier with no provisions for persistence today. remove 2025-11-03 +collection. By the above considerations, this IVOID is not suitable as a means of citation. Both the Registry's metadata schema and the DataCite \citep{std:DataCite40} metadata schema have been @@ -185,7 +184,7 @@ \subsection{Data Origin and Provenance} get the main entity from a resource that is the data of origin, typically in one step (an entity generated by an activity that used another entity as origin). -This mapping is illustrate in a VO Provenance Model \citep{2020ivoa.spec.0411S} in appendix +This mapping is illustrated in a VO Provenance Model \citep{2020ivoa.spec.0411S} in appendix %The Provenance Data Model \citep{2020ivoa.spec.0411S} is based on Entities, Agents and Activities as defined in the W3C Provenance model. The model's main focus is the detailed documentation of workflows. @@ -259,10 +258,10 @@ \subsection{Query information} particularly important. -\begin{table} +\begin{table}[h] \hbox to\textwidth{\hss \begin{tabular}{|l|>{\raggedright}p{7cm}|l|} \hline -\textbf{\vrule width0pt height 12pt depth 7pt Key} & \textbf{Description} & \textbf{Dublin Core}\\ \hline +\textbf{\vrule width0pt height 12pt depth 7pt Item} & \textbf{Description} & \textbf{Dublin Core}\\ \hline % removed ivoid & IVOID of underlying data collection & R & \\ \hline publisher & Data centre that produced the VOTable & publisher\\ \hline %rename 23-nov-2023 version & Software version (*) & & \\ \hline @@ -285,14 +284,12 @@ \subsection{Query information} query part of the URL here. More complex scenarios like UWS are not covered by this document.} \end{tabular}\hss} -\caption{\xmlel{INFO} names available for specifying the query that -generated a VOTable} +%\caption{\xmlel{INFO} names available for specifying the query that generated a VOTable} +\caption{\xmlel{INFO} names for specifying the query that generated a VOTable} \label{tab:query-names} \end{table} - - \subsection{Dataset Origin} \label{sec:dataset-origin} Dataset origin complements the query-related information to improve the @@ -302,14 +299,13 @@ \subsection{Dataset Origin} must be taken that the in-response metadata reflects the metadata available there at the time the response is produced. - Table~\ref{tab:origin-names} lists the origin-related metadata items defined here. -\begin{table} +\begin{table}[!h] \hbox to\textwidth{\hss \begin{tabular}{|l|>{\raggedright}p{7cm}|l|} \hline -\textbf{\vrule width0pt height 12pt depth 7pt Key} & \textbf{Description} & \textbf{Dublin Core}\\ \hline +\textbf{\vrule width0pt height 12pt depth 7pt Item} & \textbf{Description} & \textbf{Dublin Core}\\ \hline % removed 23-nov-2023 publication\_id & Dataset identifier that can be used for citation& M & identifier\\ \hline data\_ivoid & IVOID of underlying data collection & \\ \hline ivoid & (deprecated) use data\_ivoid & \\ \hline @@ -323,8 +319,6 @@ \subsection{Dataset Origin} resource\_version & Dataset version & \\ \hline %rename 23-nov-2023 rights & (*) Licence URI & R & rights\\ \hline rights\_uri & Licence URI (*) & rights\\ \hline -% removed 23-nov-2023 rights\_type & (*) Licence type (eg: CC-by, CC-0, private, public) & & \\ \hline -%rename 23-nov-2023 copyrights & Copyright text & & \\ \hline rights & Licence or Copyright text & rights\\ \hline creator & \raggedright The person(s) mainly involved in the creation of the resource; generally, the author(s) @@ -342,9 +336,8 @@ \subsection{Dataset Origin} cites & An Identifier (ivoid, DOI, bibcode) of a resource being in a ``cites'' (**) relationship to the originating resource & relation\\ \hline -is\_derived\_from & An Identifier (ivoid, DOI, bibcode) of a resource - being in an ``is\_derived\_from'' (**) relationship - to the originating resourcd & relation\\ \hline +is\_derived\_from & An Identifier (ivoid, DOI, bibcode) of a referenced resource + that was used to produce the current resource (**) & relation\\ \hline % remove 23-nov-2023 %publication\_date & Date of publication (DALI timestamp) & R & \\ \hline %resource\_date & Date of the original publication (DALI timestamp) & R & date\\ \hline @@ -372,22 +365,28 @@ \subsection{VOTable serialization} providers to describe individual tables. This is particularly suitable for protocols like Simple Cone Search. -The basic serialization uses INFO tags to populate Data Origin (see the example of a ConeSearch result in appendix \ref{sec:appendixA}). +%The basic serialization uses INFO tags to populate Data Origin (see the example of a ConeSearch result in appendix \ref{sec:appendixA}). +\paragraph{The basic serialization} uses INFO tags to populate DataOrigin using the 'name' attribute with the items listed in Table\ref{tab:origin-names} or Table\ref{tab:query-names} (see the example of a ConeSearch result in appendix \ref{sec:appendixA}). +%uses INFO tags to populate Data Origin items using the attribute 'name' (see the example of a ConeSearch result in appendix \ref{sec:appendixA}). INFO tags are allowed in VOTable under \xmlel{VOTABLE} or in \xmlel{RESOURCE} elements. It is expressly allowed to supply data origin in individual \xmlel{TABLE} or \xmlel{RESOURCE} elements in more complex VOTables. +As a best practice, the global items listed in Table \ref{tab:query-names} should be placed directly at the root of the VOTable document. If a VOTable document contains several resources or tables, the items listed in Table \ref{tab:origin-names} can be placed in their respective resources or tables. + +As a service to human readers, it is recommended to put descriptions, possibly derived from definitions provided in this document, into the bodies of the INFO elements. + This specification does not at this point constrain the multiplicities of individual INFO items, and clients should not fail hard if any given INFO item occurs multiple times. -Complex queries (for instance, resulting from ADQL JOIN-s) need an advanced output serialization to gather the full metadata of all contributing resources. + + +\paragraph{Complex queries} (for instance, resulting from ADQL JOIN-s) need an advanced output serialization to gather the full metadata of all contributing resources. Mechanisms to manage this requirement are being developed in the IVOA (MIVOT). The mechanisms defined here are generally still applicable in these cases, but the authors acknowledge that they are certainly stretched to their limits in such cases. -As a service to human readers, it is recommended to put descriptions, possibly derived from definitions provided in this document, into the bodies of the INFO elements. - %\section{Data Origin in Registry} REMOVE 2025-11-03 %The VO registry schema, which contains most of the Data Origin information, is completed by metadata described in VOResource \citep{2018ivoa.spec.0625P}. @@ -401,8 +400,8 @@ \subsection{VOTable serialization} \section{Appendix, Cone search serialization}\label{sec:appendixA} Simple Conesearch with its VOTable serialization. Data Origin are specified using INFO. \begin{lstlisting}[basicstyle=\footnotesize\ttfamily] - +