Today, billions of people around the world use digital networks. In doing so, large amounts of metadata are constantly being generated. The term “transparent citizen” is sometimes used to describe the resulting data protection risk.
Through the scraping of metadata by artificial intelligence, our current Information systems can predict the behavior of individuals. This, therefore, constitutes a serious threat to the privacy of citizens and to democracy. Still, metadata in itself isn’t a bad thing. In this article, we’ll look at what is metadata.
What is the difference between metadata and data?
Metadata: This term refers to information that complements actual data. Often, metadata provides more detail about the context of the content or gives instructions on how to handle the data. Thus, metadata plays a major role both in IT and in traditional data processing (such as library catalogs or the postal system).
To better understand the term metadata, let’s take a concrete example: you send a letter by post. The document contained in the envelope then corresponds to the actual primary data. These data are private and legally protected against access by third parties. The secrecy of correspondence applies.
The envelope contains the metadata of the letter. This is additional data that accompanies the primary data:
- Address and sender
- Stamp, postmark
- If necessary, additional markings such as bar codes
As you can see, all of this data is what makes it possible to send the letter in the first place. The letter’s metadata may be visible to anyone outside. They are thus not protected by the secrecy of correspondence, even if this also concerns them.
How dangerous is metadata?
In itself, there is no problem if the individual’s metadata is readable.
For example, if a third party has had access to one of your envelopes, this is usually not a cause for concern. However, the story gets complicated if this person has access to all of your envelopes, in order to store and study them. Hence, models have emerged that say a lot about an individual’s behavior: who communicated with whom and when? Networks and communication chains can thus be identified.
The distinction between data and metadata is fluid. The classification depends on the context and the respective perspective. Here is another example. A book contains primary data, such as the title of the book and its content. In addition, a series of metadata is available when publishing a book:
- Author
- Publishing house
- Date and place of publication
- Edition
- ISBN
Imagine that the metadata of many publications is gathered in a database. The information relating to the publication would be primary data. Additionally, there would be a new set of metadata for each post. For example, the database could record when an entry was added and by what user for each post.
What types of metadata exist and how are they used?
Metadata is present in all areas of data storage and processing. The use of metadata cannot be conclusively described. Here we mention three main areas of use:
1. Provide the context of the information
Metadata often describes the process that led to the creation of the information. For example, think about the geographic coordinates with which digital photographs are labeled. The context, once lost, may not be able to be reconstructed and is therefore saved.
2. Keep the possibility of retrieving information that otherwise would have to be computed in a complex way.
Think about the playing time of a video. It is embedded as time information in the video file. If we didn’t record it, we would have to calculate it. A realistic approach would be to count the number of frames and divide it by the frame rate, which is a relatively large effort.
3. Link information together to facilitate research and discovery
The main objective here is to support human-readable information with machine-readable data. The goal is to make connections between information using automated processes. Structured data is used in particular, which is linked together to form a “semantic web”.
Metadata describing digital images
Images taken with digital cameras and smartphones contain a variety of metadata. On one hand, these are technical data, such as the dimensions of the image, the camera used, the focal length, etc. They are defined in the EXIF standard and are automatically created by the camera. In addition, the IPTC standard defines metadata which describes the content of the photo and which is entered by the user.
Standard | Image metadata | Creation |
---|---|---|
EXIF | Image information, such as dimensions, color space, color channels, etc.; photographic information, such as exposure time, aperture, ISO, etc. | Automatic during recording |
IPTC | Keywords, copyright, location and time information, content descriptions, etc. | Manual by user |
Care should be taken when sharing digital images: sometimes the metadata of the photos can reveal private information about the author. Many apps and social networks automatically clean up images when uploaded. However, it should not be trusted and it is better to use a special tool to remove the information from the image.
Metadata embedded in digital videos
A video file usually consists of a container that contains various data. The primary data of a video is the encoded video and audio content. In addition, there is some metadata:
- Video playback time
- Data rate and image dimensions
- Information on the audio and video codec used
- The subtitle, possibly in different languages
Metadata associated with files
In a digital system, a file consists of two main data: the contents of the file and its name. Each file is also associated with a series of metadata. The metadata files are managed by the operating system and are also called “file attributes”. Here’s an overview of common file metadata:
File metadata | Description |
---|---|
Timestamp | For creation, modification, and last opening |
Storage place | The file path in the file system |
Property | Owner and group |
File access rights | Read, write, execute; each for the owner, the group, the others |
In addition to file attributes, some file types include specific metadata. These are managed by the respective application program. Even with this metadata, there is a risk of disclosure of confidential information during transmission.
Metadata generated when sending emails
Just like the classic postal letter, an email has two main elements:
- The body of the email
- The header of the email
Here, the body contains the actual message, which corresponds to the letter contained in the envelope. The header contains the addresses of the sender and recipient, like the envelope. Some of the information in the header can be easily tampered with so that the email appears to come from another sender to the recipient. A trick often used during spoofing attacks.
The email header usually contains a lot of other metadata like:
- Various timestamps
- Information on the formatting and encoding of the message
- Stations through which the email passed during transmission
- Email evaluation by spam filters
- Information indicating whether the email has been scanned by an antivirus
Header metadata is written and read by the server software and application programs. The information gathered during this process reveals a lot about an email and the path it has taken through the Internet.
It is possible, among other things, to state the authenticity and confidentiality of an email. Additionally, the header can contain the hostname of the user’s device and equally reveal where it was sent from.
Metadata generated when visiting a website
From a technical point of view, visiting a website involves fetching an HTML document. The user’s browser retrieves the document from a server at the specified address. The HTTP or HTTPS protocol is used for this purpose.
In addition to the actual HTML document that is displayed in the browser, metadata called the HTTP header is transferred. The HTTP headers are comparable to the fields of the header of the email. They contain information about the encoding, transmission, encryption, and compression of the HTTP connection.
In addition, metadata is generated during the transfer, which accumulates on the server. This includes the log files in which accesses to the server are recorded and which are necessary for the analysis of the log files. For each access, another line is written in the log file. Also, the browser usually triggers other queries to a DNS server. Metadata is also generated and possibly stored and evaluated by the server operator.
Oddly, in addition to the already mentioned HTTP header, there is also the HTML header. While the first refers to the connection, the second contains metadata describing the content of the document.
Here is an example of a response preview from an HTTP server. The intro lines are the HTTP header. Next comes the HTML source code with the HTML-Head and HTML-Body elements:
HTTP/1.1 200 OK
Date: Mon, 01 Feb 2021 12:13:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 148
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Accept-Ranges: bytes
Connection: close
<html>
<head>
<title>page example</title>
</head>
<body>
<p>human readable text in the documents body</p>
</body>
</html>
Importance of website metadata for online SEO
In this section, we’ll focus on the metadata that is embedded in an HTML document. We leave aside the HTTP metadata already mentioned, as well as the metadata generated by the server, such as log files. Usually, HTML metadata is embedded in the header of the HTML document.
Many elements found in the HTML header are directly used for search engine optimization. Search engine robots crawl the content of an HTML document. The human-readable part present in the body of the HTML is extracted and indexed. In addition, there is special metadata that is intended exclusively for bots.
In what follows, we distinguish between “classic” and “modern” variants.
Mapping website metadata with classic HTML header elements
Classic HTML header elements include the title and a handful of meta tags. The title is also visible to the user in different forms. For example, it is displayed in bookmarks or in the header of the browser tab. The other classic “<meta>” tags are used exclusively for search engine optimization. Here is an overview of the main classic elements of the HTML header:
Meta | Description | Importance |
---|---|---|
<title> | The document title is displayed in the search results | very high |
<meta name=”description”> | Document description, displayed in search results | very high |
<meta name=”keywords”> | Document keywords are not displayed in search results. | Low |
<meta name=”robots”> | Instructions to search engine robots to process the document | very high |
Map website metadata with modern HTML header elements
Besides the classic HTML header elements, there are various other elements that are used today to include metadata on a website. Search engine operators and big tech companies are constantly defining new metadata. The “<meta>” and “<link>” elements are ideal for this, as they can be expanded.
Here’s a look at commonly used modern website metadata:
Meta | Description | Importance |
<link rel=”canonical”> | Canonical tags to avoid duplicate content | Critical, if there is duplicate content |
<link rel=”alternate” hreflang=”fr”> | Specify alternate language versions of the same document via Hreflang | Optional |
<meta property=”og:…”> | Open Graph for social media posts | Optional |
With the ‘<meta>’ element, the specific type of metadata is determined via the ‘name’ attribute. The ‘rel’ attribute is used in the same way for the ‘<link>’ element. Depending on the metadata standard used, there are two alternate spellings for the ‘<meta>’ element. We summarize them here:
Spelling | Metadata standard |
---|---|
<meta name=””> | HTML5 |
<meta property=””> | RDFa |
<meta itemprop=””> | HTML Microdata |
Define website metadata with Open Graph
Open Graph is a protocol developed by Google to enrich a web document with metadata.
Open Graph data provides information that displays as a preview when sharing the document on social media. In this way, it is possible to specify optimized images, titles, and descriptive texts.
This makes sense because, depending on the platform, specific restrictions apply in terms of text length, image dimensions, etc. This protocol is widely used by Facebook and Twitter. Here’s a look at the essential Open Graph metadata:
Open Graph metadata | Explication |
---|---|
<meta property=”og:title”> | Object title |
<meta property=”og:type”> | The type of object, such as an image, web document, video, etc. |
<meta property=”og:image”> | An image representing the object |
<meta property=”og:url”> | The canonical URL of the object |
advice: If you are having errors while sharing your web content on Facebook, the problem is often related to incorrect Open Graph specifications. In this case, a simple trick can sometimes be useful: log into your Facebook account and use the Share Debugger to ask Facebook to proofread the information from Open Graph.
Define website metadata with rich cards
Besides Open Graph, another metadata standard developed by Google comes in the form of Rich Cards. Rich Cards enrich a web document with structured metadata. Thus, a restaurant’s website can be supplemented with information regarding location, prices, opening hours, etc. Rich Card information can be placed in the header or in the body of HTML.
Technically, Rich Cards are derived from the metadata standard Schema.org. Different formats are used to tag metadata. Besides the old standards RDFa and Microdata, the current standard JSON-LD is the most appropriate. The use of JSON-LD is officially recommended by Google.