Data curation at CDS
Last updated on 2023-11-06 | Edit this page
Overview
Questions
- What happens to your data after submission to VizieR?
- What is the data curation?
Objectives
- Create FAIR tables, integrating “key” columns
- Summarize all the steps happening behind the curtains once your data are submitted, and before their full integration into the VO
Overview
Once the data have been submitted on the CDS servers, the VizieR team will check that the data are compatible with standards. Once the data have been accepted, the CDS team will also add some valuable and relevant information such as metadata and links to other catalogues. This can lead to interactions with the authors, but we are trying to minimize the level of interaction.
Behind the scenes: verifications
In addition to the semi-automated verifications already done by the programs during the different steps of the ingestion, more in-depth verifications are done by the CDS team focusing on the reliability and the quality of the catalogues.
Verifications: Example 2 - Coordinates
After the units, the coordinates are the most important data the VizieR team try to gather and curate. It is indeed the most common way to search for data in VizieR.
When there are none, positions can be added from other catalogues or from SIMBAD if available. Alternatively, we ask for them (sometimes we have an answer).
Verifications: Example 3 - Identifiers
The third important thing for our team are the identifiers.
SIMBAD names added + misprint on names corrected
To retrieve coordinates and easy the cross identification between SIMBAD and VizieR, a proper identification is needed.
Here is an example of truncated SDSS names… Impossible to retrieve except by coordinates that we luckily have in this case.
So the SimbadName has been added after the process for SIMBAD where misprints on coordinates have been detected (identified by the column f_Name set to o below for Name = SDSS J1137+2553, and highlighted in both figures Before/After). For this object with coordinates pointing to nothing, the right ones have been found thanks to the bibcode given in the table.
Verifications: Example 4 - Odd values
We add mimimum and maximum values of numerical columns. It allows us to detect some oddities and it is helpful also for the astronomer who will validate the VizieR catalogue afterwards.
Verifications: Example 5 - Link between tables
We also add links between tables in VizieR. For instance, if an author said that magnitudes come from a certain survey, we actually point to that survey so we can verify the values. If a table contains galaxy clusters and another the membership, we can add the number of galaxy members per cluster, assuming the cluster names are the same in both tables.
Adding those links helps us to detect errors and missing data.
Link between tables added
In the example below, the column Nz (Number of high-z galaxies in a given cluster) has been added to the original Table 1 to create a link with the relics-z table.
By clicking on the value “14” from the column Nz for the cluster “plckg004-19”, one will get automatically the corresponding rows from the relics-z table, without any extra filtering, as illustrated below.
Verifications: Example 6 - Missing common key
Last but not least, to add links between tables we need a common key (e.g identifier, coordinates …).
Cross-identification between tables
In the two figures below, we can see an example taken from a paper with two tables (Tables A and B) with two similar first columns in both:
- Name of the stellar system to which the star belongs
- Name of the star
However, it is not obvious that Bel10018 (SimbadName: [BFO2002] UMi 10018) mentionned in Table A corresponds to COS 347 in Table B.
As there are no common identifier or coordinates repeated in the second table, the only alternative would have been to go through the list of references cited (3rd column of Table B) to get the coordinates and identify the object one by one. Therefore, the CDS team contacted the author to get the names and positions for Table B and create a better link between the two tables as displayed below.
Errata
As said before, the VizieR database is evolving every day: with new catalogues being added or old ones being updated.
Data available to all
Once the data are public, they are accessible as plain files in FTP directories at CDS and other participating data centers (e.g. at CfA/Harvard (USA) or NOAJ/ADAC (Japan)), as well as all VO compatible services.
Summary: What happens to your data at CDS?
Key Points
Once the catalogues are submitted, a delay is needed for VizieR curation and validation before full ingestion!
The validation process involves some:
- Verifications leading to corrections: ~ 30% of the references
- Main corrections: identifiers, coordinates, units …
You cannot Find, Access and Re-use data if the coordinates/identifiers are not right!
Next chapters
In the next chapters, you will learn what happen to your fully ingested data when they continue their journey into the Virtual Observatory and up to EOSC.