Long-format, wide-format, and wide-names

Dictionary v2.2.3 Documentation v2.1.0

Wide-format and long-format data

Wide- and long-formats are two broad categories for describing how to store or record data. For our purposes, long-format describes a situation with one row for each observation, measure, or data point. Wide-format describes a situation where all observations, measures, or data points of one day are recorded in one row.

For example, below is a long-format table for recording wastewater surveillance data:

date	measure	value	unit
dec 17	covid N1	32.1	gc/mL
dec 17	PMMoV	16.2	gc/mL
dec 17	pH	8.2	unitless
dec 20	covid N1	34.6	gc/mL
dec 20	PMMoV	19.6	gc/mL
dec 20	pH	7.9	unitless

The same table but in a wide-format is shown here:

date	covid N1 (gc/mL)	PMMoV (gc/mL)	pH
dec 17	32.1	16.2	8.2
dec 20	34.6	19.6	7.9

With each additional field for recording data points, the wide-format table becomes wider, and the long-format table become longer.

Issues with moving between formats

Data visualizations and analysis workflows commonly use wide-data, and it can be faster to input data in wide-format because there is less repetition. This is because date, sample ID, and other fields are not reentered across rows. Long-format is preferred for storage and aggregation.

The diversity in how wide-data is recorded can impact transparency and interoperability. Particularly, this creates problems when merging data from multiple sources. When data is merged between labs, wide-format tables can have hundreds of columns and missing data in many fields. When transforming data from one format to another, there may be differences in naming conventions. This can lead to mistakes in how the data is mapped, and create errors in the database where information is being stored.

PHES-ODM approach to wide- and long-formats

The PHES-ODM is designed as a series of long-format tables, but aims to provide robust infrastructure to support wide-formats. This support includes managing mapping between long- and wide-formats. To streamline mapping between formats, we have created a standardized approach to generating wide-format column names (or wide-names).

For long-formatted data in the ODM, names are all made using camel case (ex: exampleName). Each long format name is a ‘part’, which are the building blocks of the data model. All wide-format names are snake case (ex: example_name). Each section of the snake case wide name is an individual part that is combined to form the fully-specified wide-name. Using this approach, wide-names can be parsed automatically to move and transform data between formats without introducing errors.

Wide-names

The exact formula used to generate a wide-name in the ODM depends on the kind of data point you are recording. The three kinds of data and metadata for these column names are either attributes, measures, or protocol steps. An attribute column is metadata. These columns provide context about a measure, protocol, or sample. Measure columns report a specific measure, either from a site, sample, or population. Lastly, protocol columns report a specific protocol step or detail for a measure.

Attributes

The wide-name formula for attributes is

tableShortName_attribute

**Figure 2 - Wide- to Long-format for Attributes**

For example, to record the sample material in long-format you would navigate to the samples table and populate the saMaterial column. In wide-format, all data are recorded in a single table. The table and attribute are combined for the wide-name, creating the name

sm_saMaterial

for this attribute. Here sm is the short name for the samples table, and saMaterial is the column name for the sample matrix or material.

Measures

The wide-name formula for a measure or measure-value is

compartment_specimen_fraction_measure_unit_aggregation_index_attribute

where the attribute at the end is always either value, purpose, or qualityFlag.

**Figure 3 - Wide- to Long-format for Measures**

The ODM long-format specifies a measure column in the measures table, as well as a column for units and one for aggregations. The wide-format collapses all these columns into a single value column.

For example, the wide-name for mean chemical oxygen demand from a wastewater sampling site, recorded in milligrams per litre is

wat_si_NR_cod_mgL_m_NR_value

Here the measure is taken from wastewater/water (wat), it’s a measure of the site (si), the fraction analyzed is not reported (NR), it is a measure of chemical oxygen demand (cod), the units are mg/L (mgL), it’s a mean measure (m), the index is not reported (NR), and lastly this is the value for this measure (value).

Protocol steps

Like measures, protocol step wide-names record the value field. With protocols, the wide-name formula depends on whether the step uses a measure or a method. The formula if using a method is:

tableShortName_partTypeShortName_method_attribute

The formula if using a measure is:

tableShortName_partTypeShortName_measure_unit_aggregation_index_attribute

where in both instances, the table short name always specifies the protocolSteps table (ps), and the partType short name specifies whether this is a measure (mes) or a method (met). The attribute is always value.

**Figure 4 - Wide- to Long-format for protocol**

For example, the wide-name for the sequencing strategy protocol step (ie. a protocol step using a method) for a wastewater program would be

ps_met_seqStrat_value

The protocol steps table short name indicates the long-format table (ps), the protocol step uses a method part type (met), it reports the sequencing strategy (seqStrat), and the column records the value (value).

There are various pros and cons to each format, but generally it is ideal for data to be stored and analyzed in a long-format table, while it can be easier to record the data in a wide-format table. Differences in how naming conventions work across formats can lead to mistakes in how the data is mapped, stored and used.

Similarly, the wide-name for a storage temperature (i.e. a protocol step using a measure), would be

ps_mes_temp_cel_sin_NR_value

The protocol steps table short name indicates the long-format table (ps), the protocol step uses a measure part type (mes), it reports the temperature (temp) in degrees Celsius (cel), it is a single measure (sin, i.e. not aggregated), the index is not reported (NR), and the column records the value (value).

For combined long- and wide-formats

The PHES-ODM is designed as a series of long-format tables, but it is our aim to be as robust and easy to use as possible. To avoid issues of incongruity when moving between wide-format data recording templates and the final long-format data storage tables, we have designed a ~~specific series~~ of formulae to generate wide-format column names (or wide-names) using the part IDs from long-format tables, headers, and values. This may, wide-names can be parsed automatically to reliably ~~move~~ translate data between formats ~~without introducing errors~~.

Many environmental surveillance and research programs use combined long- and wide-format tables. To account for this, wide-names can contain a “see header” part for each piece of their formulas. These reference the header they point to, promoting data readability for machines and humans.

Let’s return to the first example tables from the start of this page:

date	compartment	specimen	fraction	measure	value	unit	aggregation
dec 17	water	sample	liquid	covid N1	32.1	gc/mL	single
dec 17	water	sample	liquid	PMMoV	16.2	gc/mL	single
dec 17	water	sample	liquid	pH	8.2	unitless	single
dec 20	water	sample	liquid	covid N1	34.6	gc/mL	single
dec 20	water	sample	liquid	PMMoV	19.6	gc/mL	single
dec 20	water	sample	liquid	pH	7.9	unitless	single

This is long-format data, and I’ve added some additional columns to record the fraction analyzed, the compartment, the specimen, etc. Now if we were to move to a wide-format, we could use the formulae described above and generate this kind of table:

date	wat_sa_ liq_covN1_ gcMl_sin_ NR	wat_sa_ liq_pmmov_ gcMl_sin_ NR	wat_sa_ liq_pH_ unitless_ sin_NR
dec 17	32.1	16.2	8.2
dec 20	34.6	19.6	7.9

Now let’s look at an example of a mixed long- and wide-format table. Something like this might arise where the initial table was developed for ODM long-format, but ultimately it was decided that the program wanted to record multiple measures in a single row. For this, we have what looks like standard ODM long-format for the first eight columns, but then the wide-format for the last two.

date	compartment	specimen	fraction	measure	value	unit	aggregation	PMMoV (gc/ML)	pH (unitless)
dec 17	water	sample	liquid	covid N1	32.1	gc/mL	single	16.2	8.2
dec 20	water	sample	liquid	covid N1	34.6	gc/mL	single	19.6	7.9

In this example, you can see that compartment, specimen, and fraction analyzed are applied also to the PMMoV and pH columns, as well as the measure column. In these instances a measure of PMMoV where the fraction analyzed in specified in a different column would use the wide-name:

wat_sa_hFr_covN1_gcml_m_value

Here the hFr indicates by its placement and the letters used that the fraction analyzed is specified elsewhere.

Another example is for data is reported in long-format for one measure, but wide-format for all others. In this situation the fraction analyzed, the measure, the units, and the aggregation are all specified in other columns. Rather than simply reporting value as you would an attribute, you would use the wide name:

wat_sa_hFr_hMr_hUn_hAg_NR_value

to better link all the metadata pieces in other columns to this specific measurement. So to recreate the above mixed-format table with the ODM coded wide-names, we would get:

mr_ report Date	mr_ compartment	mr_ specimen	mr_ fraction	mr_ measure	hCo_hSp_ hFr_hMr_ hUn_hAg_ NR_value	mr_ unit	mr_ aggregation	hCo_hSp_ hFr_pmmov _gcMl_sin_ NR_value	hCo_hSp_ hFr_pH_ unitless_ sin_NR_ value
dec 17	water	sample	liquid	covid N1	32.1	gc/mL	single	16.2	8.2
dec 20	water	sample	liquid	covid N1	34.6	gc/mL	single	19.6	7.9

Wide-name exceptions

There is a diversity of wide-templates across labs. Sometimes these headers can combine two ODM attributes, or two or more methods or measures. This complicates mapping to the ODM, but unique wide-names can be generated for these as well.

Attribute naming exceptions

For combined attributes, the regular attribute formula (tableShortName_attribute) becomes

tableShortName_n_aggregation_attribute1_..._attributeN

Where n is an integer indicating the number of attributes being combined together, the aggregation is a boolean aggregation of either OR or AND to indicate whether the column records all of the attribute, or one or another. Finally, the attributes are listed in snake case. This wide-names is still machine readable.

For example, a wide-column name where sample collection period and collection number are reported together would be:

sm_2_AND_collPer_collNum

For _AND_ style wide-names, the default separator must be a period (.). So for the is example, where we are recording the sample collection period and number together, a value of a 24-hour period with 12 collections would be 24.12.

Measure and method naming exceptions

For combined measures or methods, the ..._n_... approach is also used.

For measures, the formula is: compartment_specimen_fraction_n_aggregation(bool)_measure1_measure2_..._measureN_unit_aggregation_index_attribute, where all measures must share the same units and aggregation. For example, a column to measure the presence of “other” mutations, alleles, or variants uses the wide-name wat_sa_liq_3_OR_otherM_otherA_otherV_gcMl_m_value.

For Methods the use case is slightly more constrained. Protocol steps are rarely combined into a single field. Protocol steps using one or more measures which report a single value would be highly unlikely to exist. Protocol steps using methods combinations only work and make sense because the inputs are categorical and defined. The suggested structure for a protocol step using only methods (and no measures) would follow the formula tableShortName_partTypeShortName_n_aggregation_method1_..._methodN_attribute.

For example a column where one specifies the protocol step for either pcr or sequencing strategy in a wastewater program would be called ps_met_wat_sa_liq_2_OR_pcrmeth_seqStrat_value.

In instances of OR wide names, the values sets for the different measures or methods are distinct, and the sets information for the input can be used to determine which method or measure is being reported.

Tutorials

Here’s an example to walk you through moving from long-format to wide-format. Try to go through slowly and see if you can determine the correct answer before you read it.

Building a wide-format ODM headers

We’re going to transform an example long-format measure table to wide-format. In figure 5 we see the list of all the possible headers in the measures table. Not all of the headers are mandatory, so you may choose not to use them all. Our tutorial here is just to write out what the wide name would be for each of these headers.

**Figure 5 - Headers in the Measures Table**

Everything in this table is an attribute, except for (eventually) the value field. We know that attributes use the structure:

tableShortName_attribute

So with that in mind, see if you can type up or write out what you think the ODM wide name would be for all of these headers - skipping the value header. Keep in mind that for headers that are foreign keys (noted FK in figure 5), that the table short name used should be the table in which those attributes are a primary key. You may want to consult the ERD and the parts list to double check which tables host the primary keys and what the short names are for those tables.

Once you have tried coming up with the wide names, check your answers against the equivalent wide names listed below:

mr_measureRepID
pr_protocolID
sm_sampleID
pa_purpose
po_polygonID
st_siteID
ds_datasetID
ms_meaureSetRepID
mr_aDateStart
mr_aDateEnd
mr_reportDate
pa_specimen
pa_fraction
pa_group
pa_class
pa_measure
pa_unit
pa_aggregation
pa_nomenclature
mr_index
mr_mesureLic
mr_reportable
mr_organizationID
mr_contactID
mr_refLink
mr_lastEdited
mr_notes

Building a wide-format ODM table

Now lets try an example using a shortened version of the measures table, populated with some fake data.

date	compartment	specimen	fraction	measure	value	unit	aggregation	index
dec 17	water	sample	liquid	Mpox	32.1	gc/mL	single	1
dec 17	water	sample	mixed	PMMoV	16.2	gc/mL	single	2
dec 17	air	sample	NA	s2083i omicron-variant gene target	8.2	gc	single	NA
dec 20	water	site	liquid	Flow rate	34.6	ML/day	arith. mean	NA
dec 20	surface	sample	NA	Influenza virus A1	19.6	gc/mL	single	NA
dec 20	water	sample	liquid	pH	7.9	unitless	single	2

Everything in this mini-table is part of a measurement-value wide-name, except for date. We know that measurements use the structure:

compartment_specimen_fraction_measure_unit_aggregation_index_attribute

So with that in mind, see if you can type up or write out what you think the ODM wide name would be for all of these headers - including date as an attribute. You may want to consult the ERD and the parts list to double check which tables host the primary keys and what the short names are for those tables.

The answer key for the wide version of this mini-table is here:

mr_ reportDate	wat_sa_ liq_mpox _gcml_sin_ 1_value	wat_sa_ mix_pmmov _gcml_sin_ 2_value	air_sa_ NA_s2083i_ gc_sin_NA _value	wat_si_ liq_floRate _mld_me_ NA_value	surf_sa_ NA_fluA1_ gcml_sin_ NA_value	wat_sa_ liq_pH_ unitless_ sin_2_ value
dec 17	32.1	16.2	8.2	NA	NA	NA
dec 20	NA	NA	NA	34.6	19.6	7.9

How-To Guides

The wideNames table has been added for the v2.3 release, and includes the wide-names that are in current use of which we are aware. The table has also been set up with drop down menus, similar to the data input templates. This way, users can fill out the information for a given header, and have the ODM-formatted wide name generated automatically.

Wide-name Templates

Please check out the wide names table and tamplate here [LINK TBA], and don’t forget to let us know what wide names you are using, so that we can add them to the list.