Data curation at CDS
Last updated on 2023-11-06 | Edit this page
Overview
Questions
- What happens to your data after submission to VizieR?
- What is the data curation?
Objectives
- Create FAIR tables, integrating “key” columns
- Summarize all the steps happening behind the curtains once your data are submitted, and before their full integration into the VO
Overview
Once the data have been submitted on the CDS servers, the VizieR team will check that the data are compatible with standards. Once the data have been accepted, the CDS team will also add some valuable and relevant information such as metadata and links to other catalogues. This can lead to interactions with the authors, but we are trying to minimize the level of interaction.
Behind the scenes: verifications
In addition to the semi-automated verifications already done by the programs during the different steps of the ingestion, more in-depth verifications are done by the CDS team focusing on the reliability and the quality of the catalogues.
Important points to check
In the following, we present some corrections applied to real datasets.
Here are the 5 important points that would save us some time:
- Units
- Parameters description
- Coordinates
- Identifiers
- Common key between tables
Verifications: Example 1 - Units
One key point is to the check the units.
Units corrected
In the example below the original unit for a cylindrical volume of a region (column Size from the figure below) was wrongly set to cm-3.
Our team picked it up and wrote to the author and made the description and unit correction (field V from the figure below).
Verifications: Example 2 - Coordinates
After the units, the coordinates are the most important data the VizieR team try to gather and curate. It is indeed the most common way to search for data in VizieR.
Coordinates corrected
Here is an example of coordinates with discrepancies when the declination is at 0 degree.
Once the error detected by our team (missing minus sign for some Declinations), the positions were then updated, two years after the data ingestion in VizieR.
When there are none, positions can be added from other catalogues or from SIMBAD if available. Alternatively, we ask for them (sometimes we have an answer).
Coordinates added
In the following example, we can see that no coordinate was provided in the original table.
Using SIMBAD or otherwise the HASH PN databases (when no SIMBAD corresponding match has been found – SimbadName empty), we were able to complement this table with positions.
Verifications: Example 3 - Identifiers
The third important thing for our team are the identifiers.
SIMBAD names added + misprint on names corrected
To retrieve coordinates and easy the cross identification between SIMBAD and VizieR, a proper identification is needed.
Here is an example of truncated SDSS names… Impossible to retrieve except by coordinates that we luckily have in this case.
So the SimbadName has been added after the process for SIMBAD where misprints on coordinates have been detected (identified by the column f_Name set to o below for Name = SDSS J1137+2553, and highlighted in both figures Before/After). For this object with coordinates pointing to nothing, the right ones have been found thanks to the bibcode given in the table.
Verifications: Example 4 - Odd values
We add mimimum and maximum values of numerical columns. It allows us to detect some oddities and it is helpful also for the astronomer who will validate the VizieR catalogue afterwards.
Min-max values added
Verifications: Example 5 - Link between tables
We also add links between tables in VizieR. For instance, if an author said that magnitudes come from a certain survey, we actually point to that survey so we can verify the values. If a table contains galaxy clusters and another the membership, we can add the number of galaxy members per cluster, assuming the cluster names are the same in both tables.
Adding those links helps us to detect errors and missing data.
Link between tables added
In the example below, the column Nz (Number of high-z galaxies in a given cluster) has been added to the original Table 1 to create a link with the relics-z table.
By clicking on the value “14” from the column Nz for the cluster “plckg004-19”, one will get automatically the corresponding rows from the relics-z table, without any extra filtering, as illustrated below.
Verifications: Example 6 - Missing common key
Last but not least, to add links between tables we need a common key (e.g identifier, coordinates …).
Cross-identification between tables
In the two figures below, we can see an example taken from a paper with two tables (Tables A and B) with two similar first columns in both:
- Name of the stellar system to which the star belongs
- Name of the star
However, it is not obvious that Bel10018 (SimbadName: [BFO2002] UMi 10018) mentionned in Table A corresponds to COS 347 in Table B.
As there are no common identifier or coordinates repeated in the second table, the only alternative would have been to go through the list of references cited (3rd column of Table B) to get the coordinates and identify the object one by one. Therefore, the CDS team contacted the author to get the names and positions for Table B and create a better link between the two tables as displayed below.
Errata
As said before, the VizieR database is evolving every day: with new catalogues being added or old ones being updated.
Tables updated
In the example below, one table from the original catalogue was updated, to reflect the changes published in an erratum.
Data available to all
Once the data are public, they are accessible as plain files in FTP directories at CDS and other participating data centers (e.g. at CfA/Harvard (USA) or NOAJ/ADAC (Japan)), as well as all VO compatible services.
Summary: What happens to your data at CDS?
Key Points
Once the catalogues are submitted, a delay is needed for VizieR curation and validation before full ingestion!
The validation process involves some:
- Verifications leading to corrections: ~ 30% of the references
- Main corrections: identifiers, coordinates, units …
You cannot Find, Access and Re-use data if the coordinates/identifiers are not right!
Next chapters
In the next chapters, you will learn what happen to your fully ingested data when they continue their journey into the Virtual Observatory and up to EOSC.