Data curation at CDS

Last updated on 2023-11-06 | Edit this page

Overview

Questions

  • What happens to your data after submission to VizieR?
  • What is the data curation?

Objectives

  • Create FAIR tables, integrating “key” columns
  • Summarize all the steps happening behind the curtains once your data are submitted, and before their full integration into the VO

Overview


Once the data have been submitted on the CDS servers, the VizieR team will check that the data are compatible with standards. Once the data have been accepted, the CDS team will also add some valuable and relevant information such as metadata and links to other catalogues. This can lead to interactions with the authors, but we are trying to minimize the level of interaction.

Figure -- Summary data journey from a publication to VizieR and then EOSC: fourth step of the journey - step curation and verification of the data, right after - step data published in a refereed paper, step preparation of the data, step submission of the data
Figure: Journey from a publication to EOSC, step 4 “curation & verification”

Behind the scenes: verifications


In addition to the semi-automated verifications already done by the programs during the different steps of the ingestion, more in-depth verifications are done by the CDS team focusing on the reliability and the quality of the catalogues.

Important points to check

In the following, we present some corrections applied to real datasets.

Here are the 5 important points that would save us some time:

  • Units
  • Parameters description
  • Coordinates
  • Identifiers
  • Common key between tables

Verifications: Example 1 - Units

One key point is to the check the units.

Units corrected

In the example below the original unit for a cylindrical volume of a region (column Size from the figure below) was wrongly set to cm-3.

Screenshot -- Table with wrong units as displayed in paper
Figure: Before – Units as written in original paper (screenshot)

Our team picked it up and wrote to the author and made the description and unit correction (field V from the figure below).

Screenshot -- VizieR table with units corrected
Figure: After – Units corrected in VizieR table (screenshot)

Verifications: Example 2 - Coordinates

After the units, the coordinates are the most important data the VizieR team try to gather and curate. It is indeed the most common way to search for data in VizieR.

Coordinates corrected

Here is an example of coordinates with discrepancies when the declination is at 0 degree.

Screenshot -- Table with wrong coordinates as made available in paper
Figure: Before – Coordinates as written in original paper (screenshot). The following columns are displayed: Seq (catalog index number); BGPS source identifier; Hour, Minute and Second of Right Ascension (J2000); Sign, Degree, Arcminute, Arcsecond of the Declination (J2000).

Once the error detected by our team (missing minus sign for some Declinations), the positions were then updated, two years after the data ingestion in VizieR.

Screenshot -- VizieR table with coordinates corrected
Figure: After – Coordinates corrected in VizieR table (screenshot)

When there are none, positions can be added from other catalogues or from SIMBAD if available. Alternatively, we ask for them (sometimes we have an answer).

Coordinates added

In the following example, we can see that no coordinate was provided in the original table.

Screenshot -- Table without coordinates as available in paper
Figure: Before – Columns as written in original table (screenshot)

Using SIMBAD or otherwise the HASH PN databases (when no SIMBAD corresponding match has been found – SimbadName empty), we were able to complement this table with positions.

Screenshot -- Table with SimbadName and coordinates informations added to the original columns
Figure: After – Coordinates added in VizieR table (screenshot). The 4 columns in color are computed by VizieR, and not part of the original data.

Verifications: Example 3 - Identifiers

The third important thing for our team are the identifiers.

SIMBAD names added + misprint on names corrected

To retrieve coordinates and easy the cross identification between SIMBAD and VizieR, a proper identification is needed.

Here is an example of truncated SDSS names… Impossible to retrieve except by coordinates that we luckily have in this case.

Screenshot -- Table with truncated names as identifiers in paper
Figure: Before – Table as written in the original paper. The following columns are displayed: Type, Name, f_Name; Hour, Minute and Second of Right Ascension (J2000); Sign, Degree, Arcminute, Arcsecond of the Declination (J2000); Teff, log(g), l_[Fe/H], [Fe/H], l_[C/Fe], [C/Fe], l_[C/Fe]c, [C/Fe]c, l_A(C), A(C), l_[Ba/Fe], [Ba/Fe], f_[Ba/Fe], l_[Eu/Fe], [Eu/Fe], f_[Eu/Fe]; Class, Bin, f_Bin, Out, Ref.

So the SimbadName has been added after the process for SIMBAD where misprints on coordinates have been detected (identified by the column f_Name set to o below for Name = SDSS J1137+2553, and highlighted in both figures Before/After). For this object with coordinates pointing to nothing, the right ones have been found thanks to the bibcode given in the table.

Screenshot -- VizieR table with SIMBAD-names added and misprint on names (in the declinaison) corrected
Figure: After – Example of names recognized by SIMBAD added to the original table submitted to VizieR (screenshot)

Verifications: Example 4 - Odd values

We add mimimum and maximum values of numerical columns. It allows us to detect some oddities and it is helpful also for the astronomer who will validate the VizieR catalogue afterwards.

Min-max values added

Screenshot -- VizieR ReadMe file with minimum and maximum values added to the numerical fields
Figure: Example of minimum and maximum values (in brackets) added to a ReadMe file (screenshot)

We also add links between tables in VizieR. For instance, if an author said that magnitudes come from a certain survey, we actually point to that survey so we can verify the values. If a table contains galaxy clusters and another the membership, we can add the number of galaxy members per cluster, assuming the cluster names are the same in both tables.

Adding those links helps us to detect errors and missing data.

Verifications: Example 6 - Missing common key

Last but not least, to add links between tables we need a common key (e.g identifier, coordinates …).

Cross-identification between tables

In the two figures below, we can see an example taken from a paper with two tables (Tables A and B) with two similar first columns in both:

  • Name of the stellar system to which the star belongs
  • Name of the star

However, it is not obvious that Bel10018 (SimbadName: [BFO2002] UMi 10018) mentionned in Table A corresponds to COS 347 in Table B.

Screenshot -- Table A as displayed in paper
Figure: Before – Extract of Table A from paper (screenshot). The following columns are displayed: Name of stellar system, Star name, RA (J2000), DEC (J2000), Bmag, Vmag, Rmag, CoFe, e_CoFe, NiFe, e_NiFe.
Screenshot -- Table B as displayed in paper
Figure: Before – extract of Table B from paper (screenshot). The following columns are displayed: Name of the stellar system, Star name, Reference for HRS abundances, Cr-HRS, e_Cr-HRS, Co-HRS.

As there are no common identifier or coordinates repeated in the second table, the only alternative would have been to go through the list of references cited (3rd column of Table B) to get the coordinates and identify the object one by one. Therefore, the CDS team contacted the author to get the names and positions for Table B and create a better link between the two tables as displayed below.

Screenshot -- Table B updated as available on VizieR
Figure: After – extract of Table B as available in VizieR (screenshot)

Errata


As said before, the VizieR database is evolving every day: with new catalogues being added or old ones being updated.

Tables updated

In the example below, one table from the original catalogue was updated, to reflect the changes published in an erratum.

Screenshot -- Table from catalogue updated to be consistent with erratum publication
Figure: Example of a table updated following erratum publication (screenshot)

Data available to all


Once the data are public, they are accessible as plain files in FTP directories at CDS and other participating data centers (e.g. at CfA/Harvard (USA) or NOAJ/ADAC (Japan)), as well as all VO compatible services.

Figure -- Summary data journey from a publication to VizieR and then EOSC: fourth step of the journey - step curation and verification of the data, right after - step data published in a refereed paper, step preparation of the data, step submission of the data
Figure: Journey from a publication to EOSC, step 4 “curation & verification”

Summary: What happens to your data at CDS?


Key Points

Once the catalogues are submitted, a delay is needed for VizieR curation and validation before full ingestion!

The validation process involves some:

  • Verifications leading to corrections: ~ 30% of the references
  • Main corrections: identifiers, coordinates, units …

You cannot Find, Access and Re-use data if the coordinates/identifiers are not right!

Next chapters


In the next chapters, you will learn what happen to your fully ingested data when they continue their journey into the Virtual Observatory and up to EOSC.