Data integrity assessment for maritime anomaly detection Journal Pre-proof Data integrity assessment for maritime anomaly detection Clément Iphar, Cyril Ray, Aldo Napoli PII: DOI: Reference: S0957-4174(20)30045-2 https://doi.org/10.1016/j.eswa.2020.113219 ESWA 113219 To appear in: Expert Systems With Applications Received date: Revised date: Accepted date: 2 April 2019 28 December 2019 17 January 2020 Please cite this article as: Clément Iphar, Cyril Ray, Aldo Napoli, Data integrity assessment for maritime anomaly detection, Expert Systems With Applications (2020), doi: https://doi.org/10.1016/j.eswa.2020.113219 This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd. Highlights • AIS data assessment through data integrity and veracity methods is efficient • A proper maritime situation awareness requires a variety of flags to be assessed • Maritime traffic anomaly visualisation is important for efficient decisionmaking 1 Data integrity assessment for maritime anomaly detection Clément Iphar Centre for research on risks and crises (CRC). MINES ParisTech - PSL Research University. Sophia Antipolis, France Cyril Ray French Naval Academy Research Institute (IRENav). Brest, France Aldo Napoli Centre for research on risks and crises (CRC). MINES ParisTech - PSL Research University. Sophia Antipolis, France Abstract In the last years, systems broadcasting mobility data underwent a rise in cyberthreats, jeopardising their normal use and putting both users and their environment at risk. In this respect, anomaly detection methods are needed to ensure an assessment of such systems. In this article, we propose a rulebased method for data integrity assessment, with rules built from the system technical specifications and by domain experts, and formalised by a logic-based framework, resulting in the triggering of situation-specific alerts. A use case is proposed on the Automatic Identification System, a worldwide localisation system for vessels, based on its poor level of security which allows errors, falsifications and spoofing scenarios. The discovery of abnormal reporting cases aims to assist marine traffic surveillance, preserve the human life at sea and mitigate hazardous behaviours against ports, off-shore structures and the environment. Keywords: Data falsification, integrity assessment, AIS. ∗ Corresponding author: Clément Iphar. Phone: +39 0187 527 391. 1 rue Claude Daunesse, 06560 Sophia Antipolis, France Email addresses: clement.iphar@mines-paristech.fr (Clément Iphar), cyril.ray@ecole-navale.fr (Cyril Ray), aldo.napoli@mines-paristech.fr (Aldo Napoli) Preprint submitted to Expert Systems with Applications January 18, 2020 1. Introduction Volume, Velocity, Variety and Veracity are the four traditional challenges associated with Big Data. The Volume refers to the total amount of data to be processed, which is ever increasing, with each day companies collecting 5 petabytes of data and the total amount of data created overcoming the exabyte (McAfee & Brynjolfsson, 2012). The Velocity concerns the ability to handle, gather and exploit data, and is more and more important as the volume of data increases. The Variety challenge covers the various formats that can be taken by data (images, text messages, signal, amongst others), and the rise of 10 digital information, at the origin of the explosion of data volumes, generated data of various types that have to be handled in an efficient way. The Veracity challenge is linked to the inner value of data, representing the fact for a piece of information to be truthful, so to correctly depict the phenomena measured or represented in the way it is expected to do. 15 Nowadays, in the flows of data being created, the will to extract meaningful information though data science methods has risen. However, the quality of this information is tightly linked with the quality of the data this information is extracted from, and data quality assessment became key features in the conception of information systems. Trust and confidence in data are the cornerstone 20 of the trust of the user in the outcomes of any analysis. Therefore, information systems must incorporate a layer of data integrity analysis, that will bring the user to a knowledgeable understanding of the pieces of information that are eventually presented to him or her. The development of cybersystems creates the need for the development of 25 means aiming at protecting those systems and being able to respond to an attack, as well as assessing the potential issues resulting from a variety of attacks. Those means, as a whole, constitute cybersecurity. A cyberattack usually consists in the access, the change, the diffusion or the destruction of potentially sensitive information (Toumsi & Rais, 2018). Money extortion, intelligence or 3 30 the interruption of usual business processes are usually the main reasons for cyberattacks, although in some cases the attribution of the attacks is not clear (Rid & Buchanan, 2015), despite its importance in implementing further security layers. Such attacks can be ideologically motivated (Holt et al., 2019), and it is difficult to measure the extent to which cyberattacks occur as firms tend 35 to under-report such attacks, and only make it public when investors already suspect with a high likelihood its existence (Amir et al., 2018). The attacks can target critical infrastructures of countries (Maglaras et al., 2018) or the daily life of citizens with cyberthreats existing in various areas such as domotics (Arabo, 2015) or street furniture (Comert et al., 2018). Issues about cyberattacks on 40 mobile objects have been thoroughly studied for cars (Petit & Shladover, 2015), airplanes (Waheed & Cheng, 2017) and vessels (Costé, 2018). Grounded in the theory of Situation Awareness developed by Endsley (1995), which is based on a descriptive view of decision making, the detecting and the classification of abnormal behaviours is a key task of any situational aware- 45 ness system, for several reasons such as the extraction of relevant contextual information and the proper monitoring of both self-reporting systems and noncooperative systems. The eventual purpose of this data processing is to design a decision-making system that provides an operator, which is in charge of monitoring the system, with qualitative information in a quantitatively measured 50 fashion. The qualitative factor represents the usefulness of each piece of information and the quantitative factor is the amount of information that will be presented to the operator. The operator must therefore get information with a quality which is good enough to make a decision but also to understand the underlying meaning of the data handled, through evaluation criteria; but the 55 operator must at the same time only be presented an amount of information sufficient to take an informed decision but reasonable with respect to the cognitive capacities of a person. Cooperative mobile data witnessed a recent rise in several fields such as pedestrians, goods transportation, cars, vessels, airplanes. These data are sub- 60 ject to anomalies, misuses and falsification, and anomaly detection methods can 4 in this respect be used in order to assess these data. Data streams from sensors have various qualities, and the assessment of this data quality with respect to the nature of the sensor is necessary in order to construct analysis frames that take into consideration all available information so that falsification issues 65 and attack issues can be clearly discriminated. Since falsifications and attacks address different issues originating from different sources, it is particularly important to be able to differentiate them as soon as possible so that the relevant methods can be applied for data analysis. In general, machine learning techniques are widely used for data analysis 70 (Kotsiantis et al., 2006). Those techniques include regression, classification, clustering, deep learning, image processing or natural language processing. Since the topic of maritime cybersecurity issues has few available and usable data for the construction of a model for the training of an algorithm, this field of study is prone to the use of alternative methods that do not require such training 75 dataset. In this work, a base of rules in description logics is used in order to assess data. The approach suits cases where the understanding of the situation must be contextualised in an inference-based system. Description logics, by their nature and their large use in ontology building (Baader et al., 2004), enables a formal and unambiguous description of expert rules. This base of rules enables 80 a better interpretability and a better understanding of the results with respect to other techniques of machine learning, as it is possible to directly link one rule to an actual natural language situation. In this paper, a base of rules have been built with the help of several military experts. With the multiplication of low-cost sensors, surveillance systems are on a 85 rise, particularly collaborative systems that require little equipment. In the maritime domain, the Automatic Identification System (AIS) is a legally-enforced system put in place by the International Maritime Organization (IMO, 2003). As a large source of data on maritime navigation, this system is widely used for the understanding of the maritime situation. Its high rate of transmission 90 and vast network of receiving antennas allow a large harvest of AIS messages that enable a precise tracking of vessels both on short and large geographic and 5 temporal scales. However, this system is very weakly secured and therefore is prone to issues and attacks such as erroneous information, data falsification and data spoofing. In spite of those issues, its data is largely used as a basis for 95 maritime-based studies, without seeing its data quality questioned somewhat. In this respect, a data integrity analysis would allow to put into perspective the blind use of AIS information and highlight the main issues that the system face, so that action can be taken to mitigate the risks linked to an improper use of such maritime information and on a larger scale being in grade of assess- 100 ing the type of issue faced by the system so that an user could take targeted action. Research have demonstrated that AIS is vulnerable, prone to spoofing (Bhatti & Humphreys, 2017), with missing (Lecornu et al., 2013), collided (Last et al., 2015), erroneous (Harati-Mokhtari et al., 2007) and falsified (Katsilieris et al., 2013) messages. Although few cases are reported (for example in gCap- 105 tain (2018) in East China sea, in Wired (2017) off the Russian coast and in Llyodslist (2019) in the strait of Hormuz), it has a concrete impact on maritime navigation. The research question that arises and which is addressed by this paper is how it is possible to conceive an information system for decision support in- 110 tegrating anomaly detection and falsification-discovering mechanisms based on data quality in order to alert the user that the pieces of information displayed are possibly non-genuine. This research has been applied to an AIS dataset constituted of messages received by our antenna and parsed with an in-house parser. 115 In the following of this paper, Section 2 introduces the added-value of mechanism of anomaly detection in the decision-support system, addressing the issues of trust in information, data quality dimensions and methods for the detection of anomalous events. Section 3 presents the AIS (Automatic Identification System), which is the most important source of information on vessels at sea. As 120 our study is based upon this system, its issues in relation with data quality are presented. Section 4 presents the proposed methodology for the design of an information system which assesses data, extracts relevant pieces of informa6 tion and presents them to an operator as a decision support tool, taking into consideration the specificities of the system studied, going from the assessment 125 of data fields to the evaluation of some selected scenarios. Section 5 explains the implementation of the methodology and the way data is processed and results are presented from an architectural point of view. Section 6 illustrates, before concluding remarks, the results of some data analyses conducted with a 6-months AIS dataset, and a discussion on the conception, the results and the 130 application of this system in a real-world case. 2. Anomaly detection for enhancing decision-support systems In order to be efficient, an anomaly detection process must assess data within a predetermined frame which allows a classification of issues under a clear terminological framework. This section presents the basic definitions of trust, which 135 leads to the selection of a subset of dimensions relevant for anomaly detection, presented and contextualised so that they can fit the frame of anomalous events. 2.1. Trust in information The notion of trust in the source of information is important, as strong attention on the trustworthiness of the different sources is given by the users. 140 Those sources can be a written document, humans or machines, easy or hard to access. The pieces of information can also be first-hand or second-hand. There is however no clear and straightforward definition of trust, and it tends to vary between people, or between domains (Blomqvist, 1997). The simple access to the source is not sufficient to assess trustworthiness, 145 and ideally the way in which sources are accessed by people must enable them to form an opinion about the source, and therefore to assess its trustworthiness (Hertzum et al., 2002). In a case where a user cannot collect information about the source, an absence of trust or even distrust can appear. With the development of computers and services, there is a tendency, in order to find 150 information, to rely more and more on data and applications of the Internet. It 7 is possible to find an abundance of information, however in the wide spectrum of data sources, many information may present contradictory opinions about the same topic. So the users have to seek for hints in assessing the trustworthiness of online information. 155 Trust is fundamentally a social relation. As demonstrated by Denize & Young (2007), trust is thoroughly embedded in the processes of information exchange, communication and decision-making. Machines and sensors, which support these processes, shouldn’t be trusted. However with the development of digital technologies, the users behave towards them in a way close to the one they 160 would have behaved with another human being (albeit not similarly), and people rely more and more on information given by electronic devices (McKnight, 2005). So the use of technology is directly affected by the trust that the user has in it (Kelton et al., 2008). As technology is an addition of physical components and of programs encoded, both the digital (software) and the physical (hardware) 165 parts can be assessed. As the digital part is composed of information, the trust in the technology corresponds to a trust in information (Kelton et al., 2008). Trust in that context can be expressed through dimensions, representing data quality at large, which are presented in Section 2.2. For instance in (Costé et al., 2016), two dimensions have emerged as being important: the trustworthiness, 170 which is the degree in which will the sensor be truthful, and the competence, which is the level of expertise of a sensor or system component in the proper subject. 2.2. Data quality and its dimensions The quality of data can be divided in two parts, the external quality (the 175 quality from the point of view of the user) and internal quality (qualities from the point of view of the supplier). Internal quality generally lies on concision, clarity, generality, cohesion and simplicity (Devillers, 2004). For the transmission of data quality information, metadata are often used. Their use and understanding is however not easy, even 180 for experts. A description of internal quality can be performed by answering 8 the question: how can I measure the quality of my data and how can I signify it? . The internal quality is an absolute technical quality. External quality covers, amongst others, ease of use, reliability, accuracy, conformity to the expectations, robustness and openness, so external quality can 185 be considered as being the fitness for use, which worth answering the question: what are the needs of the user on data quality and information quality and how can I give it in order to prevent them from having an abusive use of them? . Because of the multiple and various needs, external quality is more difficult to assess, as it implies the linking of data and its use, the expectations of the 190 data users and the concerns of data producers (Vasseur et al., 2005). External quality is a relative use quality, measuring the ability to fulfill a particular need. Agumya & Hunter (1998) demonstrated that there is a strong link between the fitness for use, the acceptable risk and the risk response. Pierkot et al. (2011) defines external quality as “the suitability of the specifications to the 195 user’s requirements. It is measured by the difference between the resource wished for by the user and the resource which has actually been produced ”. Data quality has been separated into twenty dimensions by Wang & Strong (1996), organised in four categories: • the accessibility of data: accessibility degree, access security and cost- 200 effectiveness • the accuracy of data: accuracy degree, believability, completeness, objectivity, reputation, traceability, variety of data sources • the relevancy of data: approximate amount of data, ease of operation, flexibility, relevancy degree, timeliness, value-added 205 • the representation of data: conciseness and consistency, ease of understanding, interpretability For some activities, poor data quality can be a risk worsening factor, and endanger them. As the decision one takes is based on information that is available, poor quality data can then lead to poor decisions. In their use of information, 9 210 decision-makers can be influenced by several variables: their experience level, information overload and time constraints. Information overload happens when the amount of information is too important for the time available to respond. Consequently, the global quality decreases when there is not enough time for processing the incoming data. In this scope, it is particularly important to re- 215 duce the information load of the decision-maker, in order to draw the attention and focus on important features that need human operators. 2.3. Data integrity A distinction between the integrity assessment and the veracity assessment of information may arise, as in the case of AIS messages, the integrity assessment 220 represents the value associated to the trust we have that information within an AIS message accurately depicts the behaviour of the vessel with respect to the other messages that we receive and process, whereas the veracity assessment represents the intrinsic trustworthiness that we associate with the fact that the message is genuine and its pieces of information are true. In this respect, 225 integrity relates to the nature of a piece of information with respect to a reference whereas veracity is mainly linked to the relation of data to the world. Veracity represents the fact for a datum to be truthful, i.e. to correctly depict the World in a way it is expected to (Iphar et al., 2019). Consequently, the evaluation of integrity though data assessment techniques is a means for the understanding 230 a the overall problem that is data veracity. Nevertheless, due to the semantic proximity of those terms, and given that this distinction is not the main locus of this contribution, both integrity and veracity assessments will be referred as integrity assessments in the remaining of this paper. 2.4. Anomalous events and anomaly detection 235 Anomaly detection is an important part of data-related studies and is often based on aforementioned data quality dimensions. Associated to any study, a normality must be established as the assessment of an anomalous thing is relative, and a distance must be chosen for distance computation. In addition, 10 threshold triggering criteria must be put in place, enabling an actual discrimi240 nation of anomalies. Several anomalies are distinguishable: the point, contextual and collective anomalies. In a point anomaly, an individual instance is considered as being anomalous with respect to the rest of data, in a contextual anomaly, an instance is not anomalous in a general assessment but becomes anomalous when the 245 context is cleared and in a collective anomaly the data considered separately are not anomalous by themselves, but their occurrence together makes an anomalous collection (Chandola et al., 2009). In anomaly assessment, pattern discovery is crucial as a pattern is by definition constructed by recurring elements, the repetition of which is predictable 250 (Martineau & Roy, 2011). The terms of anomaly, non-standard, outlier or unusual can be used for each piece of information out of the frame, so which does not belong or seem not to belong to one of the clusters formed by the pattern analysis. The patterns can be a statistical distribution, a succession of events as a sequence or a cluster. If the pattern evolves over time it follows a dynamic 255 model, if it does not it is said static. Machine learning, statistical methods and neural networks are amongst the usable methods for pattern discovery. 3. Use and weaknesses of a maritime identification system The application case of this paper relies on vessels and maritime data. More particularly, the data analysed is sent by a specific system, the Automatic Iden- 260 tification System, implemented by the International Maritime Organization and with enforced use worldwide. This section aims at presenting this system with its uses and its misuses. The section ends with a positioning of the system with respect to the anomaly detection features as developed in Section 2. 3.1. A system for maritime data broadcasting 265 The Automatic Identification System is an information system for vessels transmitting information about the position, the kinematics, the physical characteristics of the vessel, its identity and information related to the safety of 11 navigation. Today, besides its initial purpose of collision avoidance, it has a widespread use (Fournier et al., 2018). The AIS helps mariners to better know 270 their environment, it is used by coastal authorities to be aware of the traffic off their coast, by countries to be able to know the location of the vessels having their pavilion, by companies in order to monitor their fleet and by analysts or by researchers as a useful tool for the understanding of maritime traffic and its various hazards. 275 The Automatic Identification System was put in place by the Safety Of Life At Sea (SOLAS) convention, and some ships from the signatory countries are concerned by the deployment of this system. The SOLAS convention states that “all ships of 300 gross tonnage and upwards engaged on international voyages and cargo ships of 500 gross tonnage and upwards not engaged on international 280 voyages and passenger ships irrespective of size shall be fitted with an automatic identification system” (IMO, 2004). Following this definition, all seagoing vessels are not obliged to carry the AIS, therefore relying only on this system provides a partial view of the maritime traffic. However, it is possible for vessels to carry the system although it is not compulsory for them 285 The transmission of AIS data is done in the Very High Frequency (VHF) bandwidth, on two worldwide dedicated wavelengths: 161.975 MHz and 162.025 MHz. In order to transmit and receive AIS signals, some dedicated devices have been put in place since the introduction of the system. Four main kinds of devices can be distinguished: class A transceivers (on the vessels for which 290 AIS is compulsory), class B transceivers (on the vessels for which AIS is not compulsory), multi-channel receivers and radio scanner receivers (Iphar, 2017). At first, the system was only terrestrial, with transmission occurring from one vessel to another, or between a shore station and a vessel, in a range of distance which is limited by the curvature of the Earth (circa 40 nautical miles 295 in optimal conditions (ESA, 2012) for class A vessels), or the transmission power (5 to 10 nautical miles (Serry & Lévêque, 2015) for class B vessels). Recently, the development of low orbit satellites enabled to receive messages even far from the coastline, as it uploads and stores the received messages then download 12 information as soon as a coast line and a shore station is reached. 300 The development of the Internet gave an even more important step forward in the knowledge of maritime situation as websites display AIS information from all over the world1 . So where ships previously disappeared beyond the skyline from a terrestrial point of view, they can now be tracked in the whole world by every person who can access the Internet network. 305 The rate of transmission, or the reporting interval of AIS message largely varies according to the type of vessel, its speed and the type of message sent and ranges, for a class A vessel, from 2 seconds to 3 minutes for positioning report messages. In one day, the European Maritime Safety Agency (EMSA) receives about 9 million terrestrial AIS and 7 million satellite AIS messages, from over 310 96,000 vessels detected by more than one source (EMSA, 2019) and Natale et al. (2015) estimates that in a month, and 130,000 vessels of all categories are sending those messages. AIS messages have been designed to carry messages of various types, each one carrying a given type of information. In this respect, 27 different messages have 315 been designed, each one having its own layout of data fields nature according to the type of information it is supposed to carry. The study of Tunaley (2013) proposes a separation in six categories of messages, namely standard, aid to navigation, timing, safety, binary and others. The data inside AIS messages can basically be divided into three main cate- 320 gories: static, dynamic and voyage-related (Lundkvist et al., 2008). Static data are data fields which are not intended to change, or at least to seldom change, such as call sign, name of the vessel, length and beam, or the type of ship. Dynamic data are the pieces of information contained in the data fields which are expected to change over time, displaying a physical motion, such as the 325 latitude, longitude, course over ground or speed over ground. Voyage-related data are pieces of information that are expected to change often, at each new voyage, such as the draught, the destination, the estimated time of arrival or 1 e.g. marinetraffic.com, aishub.net, amongst others 13 the hazardous nature of the cargo. 3.2. The weaknesses of AIS 330 The AIS is an open system conceived and motivated by international authorities so that it could be used by the greatest possible amount of users. However this openness led to the lack of control of the system, and there are several ways in which the AIS fails to transmit genuine data: (1) issues due to the intrinsic weaknesses of the system, (2) errors in the messages, (3) falsified data in the 335 messages (Ray et al., 2015) and (4) AIS signal spoofing (Balduzzi et al., 2014b). Those four ways are presented in this subsection. 3.2.1. The AIS has intrinsic weaknesses Those weaknesses are linked to the system itself, without implying human interaction. The two main families of those intrinsic issues are missing data and 340 message collision (Iphar, 2017). The system in itself can fail in transmitting information. Some transponders fail to reach all the requirements set by the International Telecommunications Union, and some ships display large blank areas. This missing data, as shown in Lecornu et al. (2013), weakens the exploitation of AIS data by decreasing 345 the reliability, but does not prevent it. The AIS has some critical shortfalls in additions to problems such as limited bandwidth and range: limited retransmit capabilities for a few messages and no retransmit capabilities for the majority (McGillivary et al., 2009). Message collision is another weakness of AIS. A message collision occurs 350 when a message is overlapping another one, partially or completely. All AIS signals are not received by the receivers, as there is a loss percentage, particularly in the case of satellite transmission (Eriksen et al., 2006). When the installation is correct, with a good-level hardware and a good weather, most loss is due to VHF transmission. About 2% of messages are lost due to channel overload (Last 355 et al., 2015). But the biggest reason for message loss is the shadowing due to 14 obstacles (Last et al., 2015), either be on board the vessel (masks), or other vessels hiding more distant ones. 3.2.2. The system broadcasts errors A part of the information contained in AIS messages is entered manually 360 by the crew, both at the initialisation of the system for permanent data and at every new journey for journey-related data (Iphar, 2017). According to the study of Harati-Mokhtari et al. (2007), both static and dynamic data are subject to errors, and as each human-filled field is subject to errors, as well in static data such as identification number of the ship, name of the vessel that in dynamic data 365 such as the navigation status, the estimated time of arrival or the destination. Thus, the Maritime Mobile Service Identity (MMSI) number (main ship identifier used by the AIS) is false in an estimated 2% of the cases (HaratiMokhtari et al., 2007). Also, the type of the vessel is often unclear. As 6% do not define a type at all, 3% define their vessel simply as vessel (Windward, 370 2014). The name of the vessel is another issue, as 0.5% does not have a registered name, and some others exceed the allocated space in the field, which is 20 characters. Globally, only 41% of the ships report their destinations (Windward, 2014). 375 3.2.3. The system presents falsification cases Intentional falsification of the AIS signal can be done for instance by the crews on board the ships in order to modify or stop the message they send, in the very particular purpose of misleading the outside world (Iphar, 2017). At sea, only vessels, buoys and relevant aids to navigation features must 380 broadcast AIS messages. However, cases of fishing vessels putting AIS transceivers on fishing nets have been demonstrated (gCaptain, 2018), in order to force other vessels to modify their course off those nets. Identity theft also exists in the maritime domain (Windward, 2014). It corresponds to the fact to navigate with a MMSI number which is not the real 15 385 one, allocated and internationally recognised, but with the one of another vessel that actually exists somewhere else. Destination masking is also sometimes a falsification (Windward, 2014). As sometimes it can be considered as an error, some other cases are about a voluntary deficiency of information, done in order to sidestep the overview of the 390 global ships flows. Disappearances are also a kind of falsification, as ships turn off their AIS transponder in order to hide some of their activities, such as fishing in an unauthorised area, or trade illegal goods (Katsilieris et al., 2013) with other ships or on coasts. In this respect, five main issues are developed by Windward (2014): the 395 identity fraud, the concealing of destination, the fact to voluntarily stop the broadcast, the GNSS manipulation and the spoofing of the system, as the ability of an attacker to control a vessel under autopilot by spoofing the GNSS signal has been analysed and demonstrated in Bhatti & Humphreys (2017). 3.2.4. The system undergoes spoofing 400 The spoofing of messages is done by an external actor by the creation ex nihilo of false messages and their broadcast on the AIS frequencies (Balduzzi et al., 2014a). Those spoofing activities are done in order to mislead both the outer world and the crews at sea, by the creation of ghost vessels, of false closest point of approach trigger, a false emergency message or even a false cape (in 405 the case of a spoofed vessel). In the scope of spoofing capabilities, several threats can be taken into consideration: ship spoofing, aid to navigation spoofing, collision spoofing, weather forecasting, AIS hijacking and availability disruption threats (Balduzzi et al., 2014a). Cases presented in this section have been implemented in a proposed 410 software and self-built transmitter, with built AIS frames (Balduzzi et al., 2014a), the resulting trajectory of which is presented in Figure 1. In Figure 1, the result was received and displayed on the website marinetraffic.com, as a station of this network received the signal. Other kinds of attacks or tests have occurred to appear on this platform, where fake data (e.g. ships) 16 Figure 1: Example of a spoofed ship following a programmed path, from Balduzzi et al. (2014a) (in print, colour should be used for this Figure) 415 are persistent. An attacker would be able to counterfeit information to blame someone else about an event, for instance a voluntary oil spill in the open sea, or the intrusion of a enemy vessel in the waters of another nation. Availability disruption threats are three of a kind: slot starvation, frequency 420 hopping and timing attacks. Slot starvation consists in impersonating the maritime authority to reserve all the slots, thus all stations within coverage have no slot available for reservation and emission. Frequency hopping is the fact to instruct the AIS transceivers to change their transmission frequency, as it is possible by protocol specification for given areas in the World. In timing 425 attacks, the malicious user instructs transceivers to delay their transmission, by doing it repetitively, it prevents the system from functioning normally; and on the contrary, the attacker can command transceivers to send updates at a very high rate, thus overloading the channel. 3.3. Anomaly detection for AIS 430 As stated in Section 3.2, there is an issue about the Automatic Identification System in the way it transmits information in an unsecured way, with error, falsification and spoofing cases. As this system is widely used for navigation, security and evaluation of maritime domain activities such as fishing (Hu et al., 2016), vessel noise (Erbe et al., 2012), vessel emissions (Goldsworthy & 17 435 Goldsworthy, 2015), traffic modelling (Chen et al., 2015), emergency response (Schwehr & McGillivary, 2007) or animal collision (Wiley et al., 2011), one must ensure that the data used for such evaluations are genuine data, actually representing what it stands for. However, as AIS is open and multiple errors and misappropriate uses are possible, it is difficult to trust data transmitted by AIS. 440 Several methods are used and have been implemented for anomaly detection of maritime traffic using maritime communicating sensors, such as clustering and classification in which different behaviours are discriminated in different classes (Zissis, 2016), Bayesian networks in which vessel behaviours are categorised following the statistical-based theory of Bayes (Hadzagic & Jousselme, 2016), data 445 driven path-finding algorithms for vessel estimated time of arrival computation (Alessandrini et al., 2018), event calculus for pattern discovery (Pitsikalis et al., 2018), hidden Markov Models in which this probabilistic model is used in order to discriminate various vessel routes (Zouaoui-Elloumi, 2012) (Yaghoubi Shahir et al., 2014), unsupervised route extraction in which routes are extracted from 450 raw data based on vessel trajectories (Pallotta et al., 2013), genetic algorithms (Chen et al., 2014) or low-likelihood behaviour which is based on the measure of the behaviour expectancy from a vessel (Alessandrini et al., 2016). The development of those methods was facilitated by the rise of open data available from sea-going vessels (Kazemi et al., 2013). 455 In addition, in the specific case of AIS messages, the maritime environment constitutes a complex environment of study, with a great amount of elements consisting of an important amount of agents, the capabilities of which are restricted. As an example, vessel tracking is an essential and relatively well developed task for the understanding of maritime environment. This tracking is 460 in general based on the fusion of data from various sensors such as AIS signals, imaging devices or radar signals, but every single device has a coverage area that varies (because of masks or weather) and that is limited and thus limits the global knowledge of the situation. The perception of some elements that can be hazardous is however limited (cargo, identities of passengers, identities 465 of mariners for instance) which implies a limit of the detection of anomalies, 18 because an hypothetically perfect analysis would require a perfect knowledge of the various components serving as information sources on a perfectly known interpretative framework. In this perspective, the determination of quality dimensions as defined in 470 Section 2.2 is necessary, so as to ensure a proper assessment of AIS data. The data quality dimensions of accuracy, currentness, completeness, precision, consistency (Huh et al., 1990), integrity (Fox et al., 1994) and reliability (Brodie, 1980) have been highlighted as particularly important in the analysis of AIS issues by Iphar et al. (2015), and represent the cornerstone of the methodology 475 presented in Section 4. 4. A methodology for integrity assessment of maritime data As shown in Section 3.2, AIS messages present vulnerabilities in their structure and data, such as falsification, and those vulnerabilities can increase or lead to the creation of maritime risks. In this section, a method for assessing 480 integrity of AIS messages is presented. In this method, a thorough examination of AIS messages leads to the identification of 935 integrity items, which are elements in which AIS data may disagree. In the complex AIS structure, it would be an indicator of an integrity issue. A system of flags, based on the one hand on integrity items and on the other hand on non-AIS data (contextual data such 485 as fleet registers), has been developed, the goal of which being to highlight humanly understandable anomalies about the AIS, in the frame of some specified scenarios. Those flags are raised when a combination of integrity assessment item results are gathered. In parallel, the conjunction of some given flags will trigger some specific scenarios. The final purpose is to deliver, in near-real-time, 490 information with added-value to maritime authorities and rescue centres. 4.1. Integrity assessment of messages 4.1.1. A variety of message and data types As mentioned in section 3.1, the AIS messages are various in their nature, they can therefore be discriminated in various families, each one gathering sim19 495 ilar kind of messages, which will undergo similar integrity assessments as they will present similar data fields. Figure 2 presents several ways to perform an AIS messages classification. Figure 2: Variety of AIS messages (in print, colour should be used for this Figure) The left-hand side column of the Figure 2 displays the different possible kind of senders of messages. Indeed, some of the messages are only sent by 500 base stations (which are shore-based stations or other non-vessel stations), some others are only sent by mobile stations, while a large number of the messages can be sent by both base and mobile stations. Given this distribution, it is not expected that a single station (individuated by its MMSI number) sends messages which do not match its category. In addition, the same column shows 505 the messages sent specifically by class A stations (i.e. violet and blue ovals, not circled) and those sent only by class B stations (circled ovals). As vessels are 20 not expected to change their class, any MMSI is not expected to send any pair of (class A, class B) messages. The central column displays the variety of AIS messages, as several kinds 510 of AIS messages exist, and all messages belonging to the same family will tend to undergo similar studies. Moreover, when it comes to assessments involving several messages, any pair of similar messages will tend to propose similar items, as the same data fields that are involved in the comparison of the messages will be found in both couples of messages. 515 The right-hand side column of the Figure 2 shows three of the main messages families: the messages in which static data is provided, the messages in which positioning is involved and the messages in which a communication between two vessels is involved. For a message, the fact to have a positioning data (i.e. latitude and longitude fields) enables all position-related assessments. Similarly, 520 the fact to have static data enables identity-related assessments and the fact to have communication data (i.e. source and destination MMSI numbers) enables all kind of analyses linked to the identities and locations of those vessels. The messages in grey colour of Figure 2 do not belong to any of those three kinds of messages families. 525 Not only is there a diversity within AIS messages, but the data within can take several forms. Amongst the data fields, the diversity can be illustrated by the message number 5 (static and voyage related data message). The fields of the message with the parameter represented are presented in Table 1, alongside with the type of datum and the nomenclature value, the meaning of which will 530 be explained in section 4.1.2. Six data types are then discriminated, which are: numeric representing an identifier (such as identification numbers of the vessel), numeric representing a physical quantity (dimensions of the vessel, or speed in another message type), numeric representing a choice (in a list of choices, such as the navigational 535 status, where amongst others “0” stands for under way using engine, or “1” stands for at anchor ), textual, date and binary. Those data types are described by the AIS specification, and can be found 21 Field Data type Message ID Numeric representing an identifier Nomenclature Repeat Indicator Numeric representing a quantity 05B User ID Numeric representing an identifier 05C AIS version indicator Numeric representing a choice 05D IMO Number Numeric representing an identifier 05E Call Sign Textual 05F Name Textual 05G Type of ship and cargo type Numeric representing a choice 05H Overall dimension / reference for position Numeric representing a quantity 05I Type of electronic position fixing device Numeric representing a choice 05J ETA Date 05K Maximum Present Static Draught Numeric representing a quantity 05L Destination Textual 05M DTE Binary 05N Spare Binary 05O 05A Table 1: Different data types in AIS Message 5 during normal use conditions. However, two additional cases must be taken into consideration: empty fields and default values. Empty fields often occur when a 540 field has no value allocated, constituting an issue of data completeness. Default values exist in AIS messages and are also described by the system specifications. Any field with no allocated value will display the default value. For instance, in the case of message number 1, “181” is the default value for the longitude field, or “511” for the true heading data field (Raymond, 2016). 545 4.1.2. Integrity assessment items As displayed in Figure 3, four ways to discriminate the inner integrity of the data within the fields of the 27 AIS messages can be distinguished. The first level consists of the assessment of the integrity of each field of each message taken individually. The second level is found at the scale of one single message, 550 and assesses, in this very message, the integrity of all the fields with respect to one another. Given that messages of the same type have the same fields, it is possible to assess their integrity by comparing them, which makes the third level. Eventually, the fourth level consists in the comparison and the integrity assessment of the fields of different messages. Although pieces of information can 22 555 come from different messages, it is indeed possible to assess their integrity, due to the fact that some fields are either the same or linked or comparable. Those four ways will, in the following, be respectively referred as first-order, secondorder, third-order and fourth-order assessments. The first-order and secondorder assessments rely on one single message, and are therefore invariant with 560 the environment, whereas the third-order and fourth-order assessments need several messages in data history to be assessed (at least one other, up to an entire time series for one vessel), and the outcome of those assessments can vary according to the environment (which includes the sample size or the location of the message within the sample). Figure 3: The four-order assessment (in print, colour should be used for this Figure) 565 The assessment of data integrity is performed through integrity items, which are statements, simple and unambiguous, involving one or several data fields. Each statement involves one field or several fields, either in the same message, in several messages in which data could be in discordance with specifications or in which several pieces of information within the fields could disagree, i.e. dis- 570 playing two or more pieces of information that are not expected to be displayed in an expected functioning of the system. 23 In order to avoid any confusion in which data field is treated (as some fields are similar or identical in several message types) or which item is assessed (as some items, dealing with those similar or identical fields, will look alike), a 575 nomenclature has been set to uniquely identify each data field from each message type, and each item from each order of assessment. Table 2 presents the message number 1 (scheduled class A position report) with all its data fields, their size represented by the number of bits allocated and their associated nomenclature (message number concatenated with a letter corresponding to the order the field 580 in the message). Nomenclature N O of bits 01A 6 Message ID 01B 2 Repeat Indicator 01C 30 User ID 01D 4 Navigational Status Field Name 01E 8 Rate of turn 01F 10 Speed over ground 01G 1 Position Accuracy 01H 28 Longitude 01I 27 Latitude 01J 12 Course over ground 01K 9 True heading 01L 6 Time stamp 01M 2 Spatial manoeuvre indicator 01N 3 Spare 01O 1 RAIM-flag 01P 19 Communication state Table 2: Nomenclature of data fields of message 1 4.1.3. Assessment classification Two main families of assessments can be discriminated: those that assess conformity, i.e. the conformity to the AIS specifications of the AIS message, and those that assess coherence between different data fields in one or several 585 messages. The integrity assessment of AIS messages uses both coherence and conformity items, as they are complementary items for the understanding of the maritime situation. 24 The conformity items encompass all the first order items and a marginal part of second order items (e.g. the message number 24, which is a message 590 sent in two separate transmissions, so one can be received and not the other). In the first order items, the presence, in any field, of a default value does not constitute a conformity issue. However, what constitutes a conformity issues is the presence of an empty field where a value is expected. The coherence items encompass all the remaining second order items and all 595 the third and fourth order items. Within all coherence items, eleven families of items have been discriminated. Those families are presented in Table 3, with the orders to which those items can belong and a short description of their nature. 4.1.4. A logic-based formalism for integrity assessment After the determination of the item list, each item must be rigorously as- 600 sessed in order to check the conformity or the coherence of the fields within. A Boolean value is associated with the item to the message assessed, taking the value True or False, considering the assignment of this value as an answer to the question: Is the statement expressed in the item demonstrating an AIS-data integrity violation? 605 Therefore, of the application of the item to a message demonstrates an integrity issue, then the value True is allocated to this item for this message, else it is False. The essence of the item will not be assessed in some cases, for several reasons. 610 Should it happen, as the integrity of the system has not been violated, the value associated to this item is False. For instance, third order algorithms require at least one former message of the same type from the same sender, if it is the first message received from this station, it does not constitute an integrity violation, in spite of the fact that the item cannot be assessed. The same reasoning applies 615 for fourth order items with some rare messages: for instance, as the reception of a message number 13 is quite rare, the items involving message 13 data fields 25 Families # O1 O2 Conformity issues X X Inconsistent field values X O3 O4 X X Description Non compliance to the specifications Inconsistencies between two or more values are found, from the same message or from different messages Data field evolution X X The evolution of the value of a data field in several messages is not coherent Motion evolution X X Consecutive motion values between several data fields are not coherent Unusual values X X The value of one given field is not in accordance with the usual values this field takes when sent by this vessel in other messages Overabundant reporting X The vessel sends a number of messages which to too important with respect to its kinematic values and the specification-defined transmission rate, in absence of any message 23 Overabundant communi- X cation Two stations communicate too often between themselves Remote communication X A communication between two stations which are supposed to be too far away from one another Unexpected data field X X X change The value of one given field has unexpectedly changed with respect to the former message sent by this vessel Position fixing device issue X X X The vessel displays whereabouts which are not compatible with the declared used position fixing device Unexpected country loca- X X X tion The station is fixed and has whereabouts which are not in accordance with its country Inconsistent response X Either the data field is part of a response message, however, the message that triggered this response is nowhere to be found, or the data field is an inquiry and the response is nowhere to be found Table 3: Families of items and the assessment order(s) in which they are found will be seldom assessed, and as a consequence each time that no message 13 shows up, the value False will be assigned. Predicate logic present, under a formal form, the actions that lead to the 620 integrity determination of an item in a rigorous and unambiguous way. Relying 26 on three main elements: the data fields values, the syntax and the expert knowledge values, a logic-based formalism based on predicate logic has been chosen for item assessment. The data field values consist of the fields needed for the assessment of the 625 item. According to Section 4.1.1, various data types can be involved, and their number depends on the assessed item, as it can require either few data or several fields. The syntax is the whole of the logical elements that make the statements understandable and unambiguous. In this case, the selected elements are: ∃, 630 the existential quantifier, !, the uniqueness indicator for existential quantifier, ∀, the universal quantifier, `, the implication, ¬, the negation, ←, the attribution, ∈, the affiliation, ∪, the union, ∩, the intersection, > the True statement and ⊥, the False statement. The expert knowledge consists of a set of values that have been set for each 635 item in which it is necessary. Some items are straightforward, such as the ones assessing conformity, because with respect to the technical specifications, the data value is either in accordance or in disagreement. However, for the determination of items in which continuous data such as speed or location are used or for which distances are computed, a threshold value between the True 640 and the False value must be determined. In this perspective, the knowledge of an expert of maritime navigation is used for the establishment of those thresholds. From this point on, Mx stands for the set of all messages number x, m z stands for the result of the assessment of item stands for a single message, Rm z on message m, D stands for the set of data field values (a list of fields, set 645 in accordance with the need), TR is a time interval representing the chosen assessment reference time (TR standing for TRef erence ), TA is a time interval representing the current assessment time (TA standing for TAssessment )(i.e. in the analysis, all messages received during TA are assessed, using all the messages received during TR as our archived message database. An in-depth explanation 650 of this mechanism will be presented in Section 5.2.3). Two examples are provided here, one very simple and one more complex. In 27 the simple one, the purpose is to check if the field longitude (01I, as defined by nomenclature, cf. Section 4.1.2) is within [−90, 90] ∪ {91}, which is its expected range of values (because the extent of longitude values is between −90 and 90 655 and the default value is 91). In the other one, the purpose is to check whether the whereabouts, represented by the longitude (01H) and the latitude (01I), are in accordance with the kinematic values of the messages which are the course over ground (01J), the speed over ground (01F) and the rate of turn (01E). This item uses additional functions, named f and g in this item, for trajectory 660 planning (the description of which is not the purpose of this section). Example 1: Item 01S05: Value of the field 01I is less than -90 or greater than 90 and not equal to 91 ∀m(D, t) ∈ M1 , D = {id, lat}, t ∈ TA 01S05 ← ⊥) ((lat ∈ [−90, 90] ∪ lat = 91) ` Rm 665 01S05 ← >) (¬(lat ∈ [−90, 90] ∪ lat = 91) ` Rm Example 2: Item 01I05: 01H and 01I positional field values evolution is not consistent with kinetic values in 01F, 01E, 01J and time ∃f : [−180, 180] × [−90, 90] × [0, 102.2] × [0, 4.21] × [0, 360] × [−180, 180] × [−90, 90] → R+ 670 ∃g : [0, 102.2] × [0, 4.21] × N∗ × N∗ → R+ ∀m(D, t) ∈ M1 , D = {id, lon, lat, speed, rateturn, course}, t ∈ TA ((∃!m0 (D0 , t0 ) ∈ M1 , t0 ∈ TR , t0 < t, D0 = (id0 , mmsi0 , lon0 , lat0 ), mmsi = mmsi0 , min∀t0 ∈Ta (t0 − t)) ` 675 (Λ = f (lon, lat, speed, rateturn, course, lon0 , lat0 ), Ω = g(speed, rateturn, t0 , t) : 01I05 (Λ < Ω ` Rm ← ⊥), 01I05 (¬(Λ < Ω) ` Rm ← >))) (¬(∃!m0 (D0 , t0 ) ∈ M1 , t0 ∈ TR , t0 < t, D0 = (id0 , mmsi0 , lon0 , lat0 ), mmsi = 01I05 mmsi0 , min∀t0 ∈TR (t0 − t)) ` Rm ←⊥ 28 680 Although only two examples are shown in this section, all 935 items from all 27 messages have been formalised under this logic-based formalism. 4.2. Falsification scenarios 4.2.1. Integration of contextual information A sole focus on data coming from the AIS itself is interesting for the integrity 685 evaluation of the AIS system, and would be sufficient if AIS were an isolated system. However as a vessel carrying this system evolves in an environment subject to changes, it is always useful to rely on additional data, allowing a more accurate study. A complete understanding of a situation sometimes needs several points of view, and one sensor might not be sufficient to discriminate a 690 situation considered as normal from a situation considered as abnormal. Indeed, a situation considered as abnormal with respect to one given system data might be explained from another source, and vice versa, an expected situation from the point of view of the AIS can be highlighted as abnormal in light of external data. 695 Basically, every source having a common data field with the AIS can be used. As the system covers a wide range of information, those complementary sources can be varied, coming from different domains. In general, the establishment of an exhaustive list of such usable sources is not possible because of several reasons: the sources evolve, appear and go out of date in an unpredictable 700 pattern, it is not possible to be aware of all available sources on a given subject, and the need for the use of a given source varies largely according to the type of study conducted. Such contextual information can be split into vessel-oriented and navigationoriented data (cf. Section 5.1). Vessel-oriented information typically contains 705 reference information about ships (name, size, owner, . . . ) that enables a comparison with AIS data while navigation-oriented data focuses on geographic features helping to understand the ship’s navigation. This includes, for instance, traffic separation schemes, aids to navigation such as the fairways or the navigational lines but also coastline or the location of ports. 29 710 4.2.2. Selected falsification scenarios When an integrity assessment of the system is performed, falsifications scenarios can be considered. In general, systems can be falsified, therefore, pointing out the different cases in which such a falsification can happen is an important task. A falsification being the fact either to transmit erroneous data or to trick 715 the system by making it behave in a way it is not supposed to, a falsification scenario can take several forms and will be either one particular falsification, one particular way to change data, the ingestion of false data or forcing the system to behave the wrong way. A variety of scenarios are possible in the case of AIS falsification and spoof- 720 ing. A selection of representative falsification scenarios is presented in this Section2 . The scenarios are presented in the Table 4 with a short description of each of them. Case # Scenario name 1.1 MMSI Description 1.2 Identity issue 1.3 Identity change 1.4 Ubiquity issue 2.1 Wrong position 2.2 Kinematic inaccuracies 2.3 Disappearing/Reappearing vessel 2.4 Spontaneous unexpected appearing 3.1 Message 22 alert Station broadcasts a message number 22 3.2 Message 23 alert Station broadcasts a message number 23 Station has an irregular MMSI number Vessel displays an identity incompatible with complementary data sources Vessel has changed one of its identity data fields Station displays various whereabouts at the same time Vessel displays an impossible location Vessel positional values are in disagreement with kinematic values Vessel has unexpectedly disappeared for an unusual time Vessel has appeared in an unexpected area Table 4: Considered falsification scenarios 2A few others have been implemented in the frame of the DéAIS project (Ray et al., 2015) in which this work is included, including scenarios linked to the number of messages received by one station by unit of time, the analysis of the signal or the number of messages received from one given MMSI number (saturation) 30 The first category of cases deals with static information and identity data of the vessels (i.e. scenarios 1.x of Table 4). In this category, we gathered 725 the issues related to the MMSI number, the identity change (that might be normal but can be suspicious) and the ubiquity issues (which consists of the fact to receive positions that are too remote from one another, from one single MMSI, in a small timeframe). The second category gathers analyses upon all spatio-temporal information of AIS messages (i.e. scenarios 2.x of Table 4), 730 and the scenarios selected deal with the wrong position of a vessel (e.g. inland reporting), the fact to disappear and reappear in unexpected location (e.g. in the case of a voluntary switch off of the system), kinematic inaccuracies (position values in consecutive messages not in accordance with speed, course and turn values) or the fact to spontaneously appear in an unexpected location. Third, a 735 category considered in this paper concerns two AIS management messages which are amongst the most peculiar messages of the system: the message number 22 (channel management) and number 23 (group assign command). Those messages, only sent by base stations, send operational parameters to mobile stations which are of paramount importance: they assign and can change the 740 frequency of transmission (more particularly the transmission channel) in the case of message 22, and force a transmission interval or a forced quiet time to mobile stations in the case of message 23. Those messages can be sent to specified vessels (assigned mode) or to all vessels in coverage (broadcast mode). In this latter case, several vessels can be affected by a single management 745 message, that can heavily hinder the ability of the system to properly operate. 4.2.3. Definition of flags Boolean flags based on expert-based inference rules are used to highlight anomalies. Each flag stands for a fundamental explicit case of integrity breach in the data assessed, and takes the value True if a problem is spotted according to 750 the relevant associated items and False (default value) if no problem is spotted. In the scope of the study of the AIS, four kinds of flags have been defined, two belonging to the family of the flags linked to integrity assessment items 31 and system data (the ones for which the number does not vary): the integrity assessment items flags and the vessel type flags; and another two belonging 755 to the family of the flags directly linked to contextual data (so for which the number of flags varies with respect to available data): the scenario-specific flags and the maritime situational indicators flags. Two of those four classes of flags are presented in the following section. 4.2.4. Flag assessment 760 Flags linked to the integrity assessment items. In Section 4.1, a method for determining the integrity status of every single assessment item was defined. This method treated data fields separately, therefore it was not possible to easily extract any information from it. However, as it was showed in Section 4.1.3 that items can gather around categories, the extraction from each set of 765 items (corresponding to each message type) of issues that are humanly easily understandable, which are the flags presented in section 4.2.3, is of interest. Each flag stands for a specific issue in the analysis of AIS messages, and for each of the scenarios, a list of corresponding integrity assessment items have been established, the results of which must be evaluated in order to get the 770 outcome of the flag computation. The list of integrity items for each flag is fixed, and the selection of flags which directly use integrity items results is fixed for each scenario. As a consequence, the list of integrity items needed for each scenario can be easily deduced by gathering all items of every single flag of the given scenario. 775 For example, in the case of the remoteness flag (excessive communication distance), there are 17 different items corresponding to this flag in the case of message type number 1. If only one of those items display a True value, then the remoteness flag will be set to True. Scenario-specific flags. Those flags are totally dependent on the available 780 external datasets, and each flag will be tied to the content of the database itself. Therefore it is impossible to set a fixed list of those scenario-specific flags, as 32 their number and nature vary according to the available databases. The fact to use such contextual information is particularly important in order to be aware of the environment of the system, and the assessments provided are as various 785 as data coming from the system enable it. Each flag is associated with one particular assessment type involving both AIS data and contextual information (i.e. it is necessary to query both system and non-system data before assessing the item), then the computation of the result is performed by a specially designed algorithm. As a consequence, it is 790 not possible to write a general assessment program but it is needed to adjust the program to the data structure and type of the contextual dataset. An example of such scenario-specific flags is presented in the following of this section, with the involvment of a fleet register. Example: f fr consistency 795 This exemple assesses the conformity of AIS data with a given fleet register, in our case the European Union Fishing Vessel Fleet Register3 , which is publicly available and contains the list of EU fishing vessels. In this database, the fields in common with AIS are the call sign (which will serve as foreign key, usable for a join), the vessel name and the vessel dimensions (which will be the values 800 to be compared). Let B be the EU fishing vessel database, b be an element of B,  be a Boolean standing for the fact for B to be exhaustive (> = exhaustive), Distα be a semantic distance (here an Edit distance), Distβ be a Minkowski distance (here a Manhattan distance), Ξ and Υ be the respective expert-defined thresholds for 805 semantic and Minkowski distance for data compliance. ∀m(D, t) ∈ M5 , D = {id, callsign, name, dimensions}, t ∈ TA ((∃!b(Db ) ∈ B, Db = {callsignb , nameb , dimensionsb }, Distα (callsign, callsignb ) = 0) ` ((Distα (name, nameb ) < Ξ ∪ Distβ (dimensions, dimensionsb ) < Υ) ` 3 http://ec.europa.eu/fisheries/fleet/index.cfm 33 810 α (f f r consistency ← ⊥), (¬(Dist (name, nameb ) < Ξ ∪ Distβ (dimensions, dimensionsb ) < Υ)) ` (f f r consistency ← >)), (¬(∃!b(Db ) ∈ B, Db = {callsignb , nameb , dimensionsb }, Distα (callsign, callsignb ) = 0) ∪  = >) ` f f r consistency ← > 815 (¬(∃!b(Db ) ∈ B, Db = {callsignb , nameb , dimensionsb }, Distα (callsign, callsignb ) = 0) ∪  = ⊥) ` f f r consistency ← ⊥ Other flags. In our analysis, all vessels must not be considered the same, as 820 a fishing vessel is very different from a cargo vessel. Therefore, vessel type flags allow to discriminate vessels, so several vessel types have been set, and each one of those types has a flag which is False if the vessel is not of the type in question and True if the vessel is of the type in question. As the data type is part of AIS static message information, it is possible to assess it easily. In addition, 825 flags have been set to describe maritime situations occurring at the time of the message. Those flags, backed on Maritime Situational Indicators (MSIs, defined in Jousselme et al. (2016) as descriptive patterns of maritime activity such as “vessel in under way” or “vessel loitering”) allow to take into consideration the environment of the vessel, its location and the surrounding environment in order 830 to get a more comprehensive analysis of the situation assessed. 5. Implementation This section presents the implementation of the methodology introduced in Section 4. The first part of this section describes the reference dataset designed for the experiments. Then the architecture of the system is described. Finally, 835 the data processing method and the workflow of data within the developed information system are presented. 34 5.1. Data The dataset contains three categories of data: AIS data, vessel-oriented data and geographic data (mainly linked to navigation) and provides ship messages 840 issued from the Celtic sea, the north Atlantic ocean, the English Channel and Bay of Biscay (France). AIS data. The core of the data used for experiments is based on the 27 AIS messages types received by a terrestrial station located in Brest roadstead (France). The receiving station (VHF antenna, AIS receiver, Linux computer) 845 collects AIS messages from a great part of the roadstead, from the entering and exiting traffic and on the passing-by traffic in the Ushant Traffic Separation Scheme (TSS). Figure 4 shows on the left, the location of the receiver (yellow star) and its theoretical range (blue polygon). The right part of the figure shows the real spatial extent of localised AIS messages during a time span of 850 six months. The data (all messages) received by this antenna from October 1st , 2015 to March 31st , 2016 is used for our study. Figure 4: A view of the location of the geolocalised points in our AIS dataset messages (in print, colour should be used for this Figure) The dataset consists of circa 24 million messages, 94% of them being geolocalised messages and 5% being static information messages, as shown in Table 5, which also display the number and percentage of messages per type of emitting 855 station and families of messages. Message number 1 is by far the most used, representing 62% of all messages, before messages number 3 (13%), number 4 (12%), number 18 (4%) and number 5 (4%), which are the only messages to 35 have a frequency greater than 3%. In our dataset, 71% of the messages number 1 are within 10 km of the reception antenna. Message Family # Number % Total 24,033,893 100 Per Content Geospatial 22,493,074 93.6 Management 2,798 0.01 Static 1,084,275 4.5 Per Emitter Mobile station only 20,369,720 84.8 Base station only 2,803,972 11.7 Mobile and base stations 860,201 3.6 Per Message Type Standard 20,570,972 AToN 505,764 85.6 2.1 Timing 2,807,055 11.7 Safety 46  Binary 150,044 0.6 Other 12  Table 5: Number of messages received by a terrestrial receiver 860 Falsified AIS data. Genuine AIS messages of the dataset natively contain errors and misconfigurations. They also contain several falsifications (cf. Section 6). However, some behaviours involve rare or never received messages, other require a condition on data which is rare (for instance a weird-looking trajectory involving AIS location on shore such as the one presented in Figure 1). In 865 order to test, evaluate and validate algorithms and specific scenario cases under reference data, controlled degradation of data has been also performed (Iphar et al., 2019). Our approach relies on two degradations: first, original AIS data has been manually or automatically modified. Second, some AIS frames or sequences AIS frames were created intentionally and injected in the dataset as 870 described in Figure 6. Figure 5 shows a typical fake trajectory specially designed to activate many items and flags at once (wrong speed, heading, ubiquity, ...). The building and use of those fictive frames allows to generate any falsification scenario. Thanks to an emitter platform based on a Software Defined 36 Figure 5: A fake trajectory (visualisation based on OpenCPN (opencpn.org) and OpenStreetMap (openstreetmap.org)) (in print, colour should be used for this Figure) Radio (SDR) we designed similarly to Balduzzi et al. (2014a), false messages 875 can also be broadcast live with real AIS flow (Alincourt et al., 2016). During experiments, because of their potential threat to navigation, all falsified messages have been either broadcast within a laboratory platform with very low power or piped directly within our reference database (in the middle of real historical messages). 880 Vessel-oriented and geographic data. As stated in section 4.2.1, two kinds of complementary data are discriminated: vessel-oriented and navigation-oriented. In our study, because we use data from a Brittany-based station, some sets of data have a limited spatial extent around our point of interest, whereas other are at larger scale, even worldwide. All contextual data prepared for the dataset 885 has been temporally (when applicable) and spatially aligned with the extent of the AIS data assessed. First, receptor-specific information has been added (cf. Figure 4), such as the coverage areas of the receptor (the theoretical one, with respect to the local topography models and the Earth curvature, or the real one, with respect to data reception, which may vary with the meteorological condi- 890 tions, the season or the time in the day), as well as the location of AIS data 37 receptor itself. Additional data prepared for our study includes, for instance, Natura 2000 protected areas, anchorage and restricted areas, polygons of Brest port and roadstead, two fleet registers, the location of ports of Brittany, the coastline and the Ushant traffic separation scheme. An extended release of this 895 dataset also including local weather conditions and sea state is available (Ray et al., 2019). 5.2. Architectural principles Based on the methodology presented in section 4, an information system has been developed for the detection of AIS falsifications. The system is designed to 900 handle both real-time asynchronous and offline analysis of messages and works with both streamed and historical data. While in the experiments, only one receiver has been used, the information system has been designed to cope with multiple AIS receivers and one central database. Figure 6 shows the different components of the system. Figure 6: Information system (in print, colour should be used for this Figure) 38 905 5.2.1. Data integration Most AIS messages come without any timing information because the AIS has been initially designed as an anti-collision system to be used in real-time. First, a receiving station timestamps the messages in UTC format immediately upon reception. In a second step, the parser reads AIS frames, extracts pa- 910 rameters of the message and stores parsed information in the central database. Additionally, the parser exports the timestamped messages, in the raw and unparsed format received to the central server which also stores all the messages (i.e., all the 27 different message types described in the ITU-R.M 1371-4 or NMEA 4.0 specification) in text files (one file per day). This parser, written 915 in Java, extends and adapts the CC-BY-NC-SA 3.0 parser aismessages4 . The additional functionalities developed include: the connection to the database, ingestion, and outputting of UDP (User Datagram Protocol) streams and files (used by the AIS), automatic folder import, data logger for unparsed messages, data export translated to the standardised TAG Block format, data analytics 920 about the receiving flow of messages. 5.2.2. Data management For data storage and manipulation, a database management system was used, because of the ability of such systems to find, write, sort, modify or transform data in complex databases, while ensuring the user a level of robustness of 925 the analysis by avoiding partial assessment or information loss. The choice of the widespread and open source relational database management system PostgreSQL was made, using the SQL querying language, with the adjunction of the PostGIS extension, for the treatment of spatial features. The main contents of the database are presented in Figure 7. 930 The database gathers several main elements organised per database schemas: The AIS messages schema contains 27 tables corresponding to each message type and a table for all unparsed messages (error table). 4 https://github.com/tbsalling/aismessages 39 Contextual data schema consists of all vessel-oriented and geographic data useful for the analyses of scenario cases, as described in section 4.2.1. Amongst 935 data in this schema, the receptor data table store all information about the receiver used, such as the type of material used, the receiver location and its theoretical coverage. While theoretical coverage of a receiver is fixed, its practical coverage change every day especially because of weather conditions. A specific practical coverage called black hole has been proposed to highlight daily 940 uncovered areas from where no AIS messages are expected (Salmon et al., 2016). Figure 7: Database content (AIS message tables combine both real AIS data and falsified AIS data for assessment) (in print, colour should be used for this Figure) A third schema (analysis data) contains several “working tables” that gather all information related to data treatment (especially temporal parameters describe below), statistics about message flow and the results of analysis. Vessel tracks maintain an exhaustive list of seen vessels through a quadruplet of nom945 inative information (i.e. MMSI, name, IMO and callsign) especially for the flag f quadruplet (cf. Section 6.1.1). It was chosen to store every intermediate result in the database right away after each assessment. This enables the systematic database querying by the software particularly for comparisons between assessments. In total, the database stores: the item computation results (Boolean 950 values) and the flags after scenario computation (Boolean values). Tables related to risk levels and risk results (integer values), unrelated to the content of this article, are also present, in the prospect of a further analysis of the risk on 40 maritime navigation. 5.2.3. Asynchronous batch data processing 955 Message analysis and falsification detection constitutes the core component of the information system. The Python programming language was chosen for its development. Beyond the availability of several libraries (e.g. maths libraries for statistical computations) that favour our data analysis, Python is easy to handle and it enables database querying with embedded SQL. The program 960 works using an incremental and a sliding temporal window in which items and scenarios are assessed. The processing follow the four-order assessment model described in Figure 3 and is organised in two steps further detailed in this section. In a first step, the program queries the database to fetch all the needed information and messages in a given time window. In a second step, it executes 965 a batch processing of data to assess integrity issues and detect falsification cases by computing flags. Those two steps constitute a processing loop, as the process is triggered again after a waiting time, and a new computational first step occurs. The corresponding data analysis workflow is described in Section 5.2.4. AIS messages are collected continuously (average velocity of our receiver is 970 about 77 messages per minute), parsed and stored in a database together with analytic values of the flow. While it would be possible to stream and analyse every single message on-the-fly, it has been chosen to perform an asynchronous batch processing within temporal windows working on the database. This choice is adapted to a centralised database architecture where dupli- 975 cates can occur in the cases where the same message is received by different stations. It also enables the creation of temporal series of messages from the emitter, easing series-based analyses compared to historical data. The need for this approach is also directly drawn from the data treatment process, where a group of messages with timestamps between given bounds (for each of the 27 980 message types) are consecutively assessed for the items, the scenarios and for the flags. In such batch processing, newly arriving messages are collected into a group 41 of AIS messages. The whole group is then processed at a future time. The time when each group is processed can be determined in different ways. For 985 instance, it can be based on a fixed time interval (e.g. every minute) or on some triggered conditions (e.g. process the group according to the frequency of received messages, i.e. once a given amount of messages has arrived). Each processing loop is characterised by three timestamps, defining two temporal windows: the incremental and the sliding temporal windows. The lower 990 bound of the incremental temporal window consists in the lower temporal bound for queries on historical data (which are required in some items). The lower bound of the sliding window is the lower temporal bound of the timespan considered for data processing during the current loop. The upper bound of both incremental and sliding temporal windows is the upper temporal bound of the 995 timespan considered for data processing during the current loop. At each new loop, the bounds of the sliding window are renewed, in such a fashion that consecutive loops have consecutive temporal windows, in order to ensure a thorough assessment of all received messages. The system depends on a series of variables stored in configuration files, that 1000 are to be set beforehand (cf. Figure 6). Those parameters are mainly the item list, the scenario list and the temporal parameters (e.g. waiting time). As the AIS system itself it not fixed, the program is evolutive and was conceived in a way enabling an easy enhancement and evolution. Indeed, the list of items is not fixed over time, as the AIS system still evolves, new items 1005 and analysis might come up and be included in the program. 5.2.4. Data Analysis Workflow Message analysis and falsification detection component loops on several steps: first it updates temporal windows, then it computes item assessment and flag assessment which are detailed in this section. 1010 Item assessment. Amongst formalised integrity items (935 in total), a total number of 666 have been successfully implemented into our system. These 42 items are spread amongst the different levels of the four-order assessment model. Several elements of the database are involved in this assessment process, namely the AIS messages, the configuration tables (e.g. temporal window parameters) 1015 and results tables. Each assessment starts reading a configuration file in which a set of items to be computed are listed (up to 666 valid ones). The reading of each item triggers one of the four following cases depending on the item level: 1. The read item is either of order 1 and 2, so the querying to the database 1020 will involve only one message; 2. The read item is either of order 3 and 4 so several messages, possibly of different types will be queried; 3. The read item has a bad format (non-existent item). The algorithm, in this case, returns the information that this item has not been treated; 1025 4. The read item has no value. This ends the loop (the program stops or another loop is going to start with the next temporal window). In both (1) and (2) cases, a table of results for the item is created in the database. Then the relevant AIS message database is queried for the data fields of interest for this item and for the temporal span defined by the working 1030 window. Once the values are returned, a loop occurs on it, treating all the messages within the working temporal window one-by-one. From this point on, the processing varies with the order of the algorithm. In the (1) case, the values are directly assessed by the corresponding algorithm, and the result is filled in the item result table once all messages have 1035 been treated. In the (2) case, another query to AIS messages tables is necessary in order to get all the necessary pieces of information from other messages, which can come either from the same message type (in the case of order 3 item) or from another message type (in the case of order 4 item). Once the result of the query 1040 stored, all necessary data are present and the assessment can occur, followed by the filling in of the table gathering all results in the database, once all the 43 messages have been treated. Flag assessment. A flag assessment is based on the analysis of AIS messages (item assessment as described above) together with contextual data sources 1045 (geographic, navigation-related data...). This assessment is driven by temporal windows and by the configuration file describing the list of algorithms to run for each scenario. It also specifies the data sources to involve in computations (AIS messages, external information, various results of the item algorithms). The flag assessment results are stored in dedicated database tables. 1050 First, the temporal extent of the working window is obtained so that the messages to be assessed are known. A assessment is then performed on all falsification scenarios selected by the user. For each of the scenarios, a flag table (in the database) corresponding to the scenario in question is created, and the relevant item results are queried 1055 and stored in the program, enabling further flag studies. Then, for each of the messages in the working window, and for each of the various assessments that lead to the determination of a flag, the same process is repeated. In each of those processes, the scenario function calls the algorithms that correspond to the given assessment, and this algorithm successively calls relevant 1060 AIS and contextual data sources in order to perform a computation that leads to the assessment of the flag, and its eventual raising if the conditions are gathered. The flag results are then returned to the scenario function, and once all assessments for each message have been performed, all the computed flags are stored in the database, in a purposely created database table. 1065 5.2.5. Visualisation interface Geographic views and visual analytic capabilities have been demonstrated as offering a solution to the display of relevant information to the user, amongst the ever-growing amount of data that the systems have to process (e.g. Varga et al. (2017)). As cybersecurity issues are growing, visualisation emerges as a 1070 solution and it is particularly important to use it in a right way so that the 44 relevant data is presented to the user in an unambiguous way minimising false positive information. The visualisation interface introduced in Figure 6 is designed to highlight integrity issues detected by the system. The web-based interface (Figure 8) 1075 enables a dual display showing a map of the maritime traffic but also the list of AIS messages with detected anomalies and associated risks. The interface includes a cartographic layer, a few data analytics and a text-based listing of detected features. Figure 8: Web-based interface. Ships are coloured depending on the analysis of their AIS messages (in print, colour should be used for this Figure) The map relies on two layers: the cartographic layer and the data layer. 1080 The cartographic layer constitutes the background of the interface, consisting of Open Street Map (OSM) tiles, enhanced by Open Sea Map (OSeaM) features. The data layer is made of points which are the vessels that have been selected as deserving particular attention by the program. This interface takes as input data the results of item and flag assessments described in Section 5.2.4. 45 1085 6. Results This section presents some results of computations done on a subset of the real dataset presented in Section 5.1. First a selection of all the flags that have been implemented is presented. The result of the analysis on those subsets is presented, with the number of corresponding flags raised, and the visual feature 1090 that has been implemented in order to visualise alerts raised by the detection system. 6.1. On Various Scenarios 6.1.1. Selected flags As it is not possible to display the results of all the flags, a subset of flags 1095 has been selected, in order to cover the diversity of scenarios put in place. In total, 23 flags have been implemented and 8 have been chosen for testing and validation of results. Table 6 summarises the characteristics of those chosen flags. In addition to those flags, the flags linked to the vessel type have also been computed. In Table 6, the scenario column displays the number of the 1100 scenario described in Table 4. Flag name Sce. Description of anomaly f country 1.1 MMSI has an invalid country code f fr consistency 1.2 AIS inconsistent with available fleet register data f quadruplet 1.3 One element of identity quadruplet (MMSI num- f ubiquity 1.4 f outOfScope 2.1 Invalid vessel location coordinates f nextposition 2.2 Position not compatible with positional and kine- f disapreap 2.3 f suddenapp 2.4 ber, IMO number, Callsign, Name) has changed Vessel displays two distinct locations at the same time matic (speed, course, turn) of former AIS message Unexpected disappearance and reappearance after an unexpectedly long time First apparition of a vessel in a location where it is not expected for a vessel to appear for the first time Table 6: Description of the selected list of flags 46 6.1.2. Selected data Time slices of 6 hours have been chosen for data analysis. Data from seven consecutive days, covering a full week, has been randomly chosen, between Wednesday 14th October to Tuesday 20th October, 2015, for each day between 1105 06:00 and 12:00 hours. For each of those seven periods, the corresponding historical data preceding the beginning of the analysed time bracket are added to the dataset. In order to avoid a time-demanding computation during the assessment phase, historical data have been limited to two full days (48 hours). Required AIS and contextual data are also extracted from the dataset presented 1110 in section 5.1. 6.1.3. Data computation and discussion The results of the computation is presented in table 7, the lines representing the scenarios, the columns the days, and the values the number of flags raised. Table 8 shows the number of vessels of each type for each time section studied. Oct 14th Oct 15th Oct 16th Oct 17th Oct 18th Oct 19th Oct 20th 25,340 23,294 24,316 14,749 17,063 22,564 20,537 f country 3 10 849 6 5 33 484 f fr consistency 44 62 38 13 57 47 51 f quadruplet 0 0 0 0 0 0 0 f ubiquity 0 0 0 0 0 0 0 f outOfScope 0 0 0 0 0 0 0 f nextposition 55 30 31 20 37 72 83 f disapreap 3 4 4 5 3 2 2 f suddenapp 2 0 0 0 0 0 0 Flag Message number Table 7: Number of flags raised by session 1115 From the results, albeit only a fraction of all flags are presented, some considerations can be drawn. Some of the flags (in our case f quadruplet, f ubiquity and f outOfScope) have no occurrence during those seven days, which means that: no vessel changed identity, no vessel was present in another location (the use of this flag would be more useful with a worldwide network of stations, but 1120 two stations displaying the same identity at the same time in the Brest roadstead or off the Brittany coasts would have been detected) and no message from 47 Oct 14th Oct 15th Oct 16th Oct 17th Oct 18th Oct 19th Oct 20th 2,410 2,968 2,445 1,801 1,905 1,644 1,602 Cargo 522 583 552 409 406 294 313 Hazardous cargo 127 221 133 90 97 45 54 Passenger 412 529 384 278 221 253 361 Pl/f/s 1029 1049 1016 756 928 778 651 Other 194 309 179 120 98 196 105 Incorrect 126 277 181 148 155 78 118 Flag Vessel number Table 8: Number of vessel type flags raised by session, pl/f/s = pleasure, fishing or service vessels with out of bounds coordinate values. The flag f country presents large discrepancies from one day to another, with the Oct 16th and Oct 20th values presenting outstandingly high values with respect to the other days. This is due 1125 to the presence of military vessels in the Brest bay, cruising under the MMSI number “777777777”. The other flags remain in the same order of magnitude throughout the seven days of the study. As the flag counter is based on the messages received, sometimes the number of flags can be quite high, but only involving few vessels, possibly only one. From the Table 8 we can see that the 1130 proportions of vessels of each type remain consistent throughout the week, and that, for each day, the statistical mode is the pleasure, fishing and service class, probably because this class encompasses a large number of vessels. 6.1.4. Results visualisation The result of the flag analysis, stored in a dedicated table in the database, is 1135 the input for the visualisation interface presented in section 5.2.5. This feature displays all the vessels for which at least one flag have been raised on a map (Figure 9). The vessels are shown in different colours according to their vessel type and the user has several options, being able to display all the vessels in the neighbourhood of the selected vessel or the elements relative to the vessel itself. 1140 The user is also able to discard the vessel if he/she judges that the raising of the flag does not demonstrate a situation to look after. The corresponding entry in the database is not erased from the table, but is tagged as discarded and is not shown on the screen any longer. 48 Figure 9: Detection and visualisation of an alert (in print, colour should be used for this Figure) This interface aims at offering the people in charge of maritime monitoring 1145 a comprehensive overview of the maritime situation in their area of watch. 6.2. Discussion This information system has been made as a decision-support tool for the user, which could be a private ship-owning company, but more probably a stateestablished civilian or military facility in charge of the monitoring of the mar- 1150 itime traffic off the coasts of the country and in inland waters. The purpose is to bring to people in charge of maritime traffic supervision to a concrete understanding of the situation, and a good means to achieve this goal is to present the results of the computation under a visual form, as shown in section 5.2.5. The software proposed does not take decisions, its purpose is to be a tool in 1155 the hands of traffic monitoring personnel, made in order to warn the personnel on a specific amount of data they have to handle and process. The maritime traffic becoming even more important, the people in charge of monitoring face an ever-increasing number of data to assess, this tool, allowing the graphical display of suspicious vessels and their environment, helps the personnel to take 1160 a decision based only on useful information, and thus decreases their cognitive load. As the program is built on thresholds for item and flag raising, expert 49 knowledge is necessary to set them, and those thresholds can be adapted to local situations in case of an in situ software use. As the flags stand for maritime possible issue, the program helps in presenting those issues in an optimised 1165 way for the people in charge, however it remains the duty of the personnel to determine the normality or the abnormality of a situation, to consider a case as particularly hazardous or to discard it. The use of description logics enables an inference system to assess data quality. The computation of items is straightforward and follows rules set by 1170 domain experts, thus computational speed is reduced, allowing a real-time use of the system. Maritime experts were also involved in the description of the scenarios and the modelling of the interface, bringing their expertise in the generation of several use cases. Experiments based on a dataset collected by our means showed the pertinence and the efficiency of the approach for solving 1175 simulated falsification cases, generated from reported real cases. A limitation of this logic-based approach is its deterministic nature. Indeed, the outcome of item assessment are logical True or False values, allowing a fast data processing and triggering of alerts but not taking into consideration the uncertain nature of some pieces of information. Beyond fuzzy logic, the use of 1180 probabilistic models would enable a more precise modelling of the understanding of the maritime picture. However, the implementation of such a probabilistic approach is not straightforward and would require further research work in this field, involving a quantity of domain experts. The main obstacle towards such a system is the very nature of cyberthreats, for which few cases are reported, and 1185 therefore the construction of a knowledge base using machine learning methods is hardly realisable. In this respect, although the handling of uncertainty would undoubtedly enhance the understanding of maritime situations, a rule-based deterministic approach remains the most reliable solution as far as it is designed jointly with experts. 50 1190 7. Conclusions The work presented in this paper is part of the research in the fields of data integrity assessment, knowledge discovery and data science, with a domain exemplification in maritime situational awareness and maritime safety. The operational issue is a consequence of research questions raised after the demon- 1195 stration that cyber systems were prone to attacks, and a global understanding of data that those systems provide must be provided. In our use case, a global maritime location system which is intended to provide additional safety to navigation as well as useful information to the surroundings vessels and coastal stations was easily falsified. The objective was then to propose a methodology 1200 in order to point out cases of non-genuine data and provide a risk assessment of those cases. In order to do so, an approach based on the data quality dimensions was studied. Indeed, as information systems are data-based, they natively have data quality dimensions available to assess them. More precisely, in the diversity of 1205 data quality dimensions, integrity was discriminated as particularly important for a reliable assessment of data-based systems, and the assessment methodology is based on the development of integrity-based features assessing data veracity. As such an integrity-based assessment requires a profound understanding of the mechanisms that rule the system in question, a thorough analysis of the 1210 system have been done, taking into consideration the primary purpose of the system and the uses that have later appeared in order to understand the wills of the people that wrote the specifications. The technical part of the system was studied as it provides precious information about the inner construction thereof, and the data part of the system was scrutinised in order to find any 1215 kind of combination of pieces of information that could result in an integrity breach. From those integrity study results, and with the addition of non-system data such as fleet register data or navigation zones, flags were created, with the purpose pointing out data with issues with explicit statements, enabling the 51 1220 displaying of those vessels in a interactive map, allowing the user to concentrate on those vessels and use visual analytics tools to find a proper solution to the problem displayed. In the frame of this work, expert knowledge from the fields of civil activities such as merchant navy and military activities has been involved, with the col- 1225 laboration of officers of the French navy and cadets of the French naval academy, and with the collaboration of Cerema, a French cluster of public experts. This heterogeneous group of experts elaborated falsification cases which have been implemented and presented in this paper. Although the approach has been designed in an iterative way with profes- 1230 sional domain expert, a limitation of this work is that no tests with operational personnel were performed, which would be necessary for an operational validation. However, this paper validated the approach in terms of performance or response quality, in which all inventoried falsification cases are linked to their corresponding detectors, enabling the assessment of the scenarios presented in 1235 Section 6.1. The next step of this study will consist in the enhancement of this analysis with the notion of risk, and both the database and the program have been designed in foresight of this extension. Indeed, the various cases of problems pointed out by the system will end in different levels of risk and thus different 1240 levels of alerts to be set and presented to the operators, that will tend to change with respect to the type of vessel, the type of cargo, the location and the kinematics of the vessel and of the surrounding vessels, amongst other. This would be another step forward in the support of operators for decision-making at sea. Acknowledgments 1245 This research has been supported by The French National Research Agency (ANR) and co-funded by DGA (Directorate General of Armaments) under reference ANR-14-CE28-0028, in the frame of the DéAIS project, labelled by French clusters Pôle Mer Bretagne Atlantique and Pôle Mer Méditerranée. 52 References 1250 Agumya, A., & Hunter, G. J. (1998). Fitness for use: reducing the impact of geographic information uncertainty. In Proceedings of the URISA 98 Conference (pp. 245–254). Alessandrini, A., Alvarez, M., Greidanus, H., Gammieri, V., Fernandez Arguedas, V., Mazzarella, F., Santamaria, C., Stasolla, M., Tarchi, D., & Vespe, 1255 M. (2016). Mining vessel tracking data for maritime domain applications. In Proceedings of the 1st International ICDM Workshop on Maritime Domain Data Mining (MDDM 2016) (pp. 361–367). Institute of Electrical and Electronics Engineers - IEEE. doi:10.1109/ICDMW.2016.20. Alessandrini, A., Mazzarella, F., & Vespe, M. (2018). Estimated time of ar- 1260 rival using historical vessel tracking data. IEEE transactions on intelligent transportation systems, . doi:10.1109/TITS.2017.2789279. Alincourt, E., Ray, C., Ricordel, P.-M., Dare-Emzivat, D., & Boudraa, A. (2016). Methodology for AIS signature identification through magnitude and temporal characterization. In Proceedings of the OCEANS 2016 SHANG- 1265 HAI Conference. Institute of Electrical and Electronics Engineers (IEEE). doi:10.1109/oceansap.2016.7485420. Amir, E., Levi, S., & Livne, T. (2018). Do firms underreport information on cyber-attacks? evidence from capital markets. Review of Accounting Studies, 23 , 1177–1206. doi:10.1007/s11142-018-9452-4. 1270 Arabo, A. (2015). Cyber security challenges within the connected home ecosystem futures. Procedia Computer Science, 61 , 227–232. doi:10.1016/j.procs. 2015.09.201. Baader, F., Horrocks, I., & Sattler, U. (2004). Description logics. In S. Staab, & R. Studer (Eds.), Handbook on Ontologies (pp. 3–28). Springer-Verlag Berlin. 53 1275 Balduzzi, M., Pasta, A., & Wilhoit, K. (2014a). A security evaluation of AIS Automated Identification System. In Proceedings of the 30th Annual Computer Security Applications Conference ACSAC’14 (pp. 436–445). New York, NY, USA: ACM. doi:10.1145/2664243.2664257. Balduzzi, M., Wilhoit, K., & Pasta, A. (2014b). A Security Evaluation of AIS . 1280 Technical Report Trend Micro. Bhatti, J., & Humphreys, T. E. (2017). Hostile control of ships via false gps signals: Demonstration and detection. NAVIGATION, Journal of The Institute of Navigation, 64 , 51–66. Blomqvist, K. (1997). The many faces of trust. Scandinavian Journal of Man- 1285 agement, 13 , 271 – 286. doi:10.1016/S0956-5221(97)84644-1. Brodie, M. L. (1980). Data quality in information systems. Information & Management, 3 , 245 – 258. doi:10.1016/0378-7206(80)90035-X. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41 . doi:10.1145/1541880.1541882. 1290 Chen, C.-H., Khoo, L. P., Chong, Y. T., & Yin, X. F. (2014). Knowledge discovery using genetic algorithm for maritime situational awareness. Expert Systems with Applications, 41 , 2742–2753. doi:10.1016/j.eswa.2013.09. 042. Chen, J., Lu, F., & Peng, G. (2015). A quantitative approach for delineating 1295 principal fairways of ship passages through a strait. Ocean Engineering, 103 , 188–197. doi:10.1016/j.oceaneng.2015.04.077. Comert, G., Pollard, J., Nicol, D. M., Palani, K., & Vignesh, B. (2018). Modeling cyber attacks at intelligent traffic signals. Transportation Research Record , 2672 , 76–89. doi:10.1177/0361198118784378. 1300 Costé, B. (2018). Détection contextuelle de cyberattaques par gestion de confiance à bord d’un navire. Ph.D. thesis IMT Atlantique Bretagne-Pays de la Loire. 54 Costé, B., Ray, C., & Coatrieux, G. (2016). Modle et mesures de confiance pour la sécurité des systmes dinformations. Ingénierie des systmes dinformation, 1305 2 , 1–24. doi:10.3166/ISI.22.2.1-24. Denize, S., & Young, L. (2007). Concerning trust and information. Industrial Marketing Management, 36 , 968 – 982. doi:10.1016/j.indmarman.2007.06. 004. Devillers, R. (2004). Conception d’un système multidimensionnel d’information 1310 sur la qualité des données géographiques. Ph.D. thesis Université Laval, Canada / Université de Marne-la-Vallée, France. EMSA (2019). Emsa facts and figures 2018. Report, European Maritime Safety Agency, 44p. Endsley, M. R. (1995). Toward a theory of situation awareness in synamic 1315 systems. Human Factors, 37 , 32–64. Erbe, C., MacGillivray, A., & Williams, R. (2012). Mapping cumulative noise from shipping to inform marine spatial planning. The Journal of the Acoustical Society of America, 132 , 423–428. doi:10.1121/1.4758779. Eriksen, T., Høye, G., Narheim, B., & Meland, B. J. (2006). Maritime traffic 1320 monitoring using a space-based AIS receiver. Acta Astronautica, 58 , 537–549. doi:10.1016/j.actaastro.2005.12.016. ESA (2012). fic. Space Station Keeps Watch on World’s Sea Traf- URL: http://www.esa.int/Our_Activities/Space_Engineering_ Technology/Space_Station_keeps_watch_on_world_s_sea_traffic. 1325 Fournier, M., Casey Hilliard, R., Rezaee, S., & Pelot, R. (2018). Past, present, and future of the satellite-based automatic identification system: areas of applications (2004–2016). WMU Journal of Maritime Affairs, 17 , 311–345. doi:10.1007/s13437-018-0151-6. 55 Fox, C., Levitin, A., & Redman, T. (1994). 1330 quality dimensions. The notion of data and its Information Processing and Management, 30 , 9–19. doi:10.1016/0306-4573(94)90020-5. gCaptain (2018). Ais problems revealed in east china sea. 27 December 2018, by Laura Kovary. published the URL: https://gcaptain.com/ ais-problems-revealed-in-east-china-sea/. 1335 Goldsworthy, L., & Goldsworthy, B. (2015). Modelling of ship engine exhaust emissions in ports and extensive coastal waters based on terrestrial AIS data An Australian case study. Environmental Modelling & Software, 63 , 45–60. doi:10.1016/j.envsoft.2014.09.009. Hadzagic, M., & Jousselme, A.-L. (2016). Contextual anomalous destination 1340 detection for maritime surveillance. In M. Vespe, & F. Mazzarella (Eds.), Proceedings of the Maritime Knowledge Discovery and Anomaly Detection Workshop JRC Conference and Workshop Reports (pp. 62–65). Harati-Mokhtari, A., Wall, A., Brooks, P., & Wang, J. (2007). Automatic Identification System (AIS): A Human Factors Approach. Journal of Navigation, 1345 60 , 373–389. Hertzum, M., Andersen, H. H., Andersen, V., & Hansen, C. B. (2002). Trust in information sources: seeking information from people, documents, and virtual agents. Interacting with Computers, 14 , 575 – 599. doi:10.1016/ S0953-5438(02)00023-1. 1350 Holt, T. J., Stonhouse, M., Freilich, J., & Chermak, S. M. (2019). Examining ideologically motivated cyberattacks performed by far-left groups. Terrorism and Political Violence, . doi:10.1080/09546553.2018.1551213. Hu, B., Jiang, X., de Souza, E., Pelot, R., & Matwin, S. (2016). Identifying fishing activities from ais data with conditional random fields. In Proceed- 1355 ings of the 2016 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE. 56 Huh, Y. U., Keller, F. R., Redman, T. C., & Watkins, A. R. (1990). Data quality. Information and Software Technology, 32 , 559–565. doi:10.1016/ 0950-5849(90)90146-I. 1360 IMO (2003). Guidelines for the installation of a shipborne automatic identification system (AIS). Circular IMO. IMO (2004). International Convention for the Safety of Life at Sea. Technical Report IMO. Iphar, C. (2017). 1365 Formalisation of a data analysis environment based on anomaly detection for risk assessment Application to Maritime Domain Awareness. Ph.D. thesis PSL Research University - MINES ParisTech. Iphar, C., Jousselme, A.-L., & Ray, C. (2019). Pseudo-synthetic datasets in support to maritime surveillance algorithms assessment. In proceedings of the VERITA Workshop, 19ieme Journées Francophones Extraction et Gestion 1370 des Connaissances (EGC) 2019 . Iphar, C., Napoli, A., & Ray, C. (2015). Detection of false AIS messages for the improvement of maritime situational awareness. In Proceedings of the Oceans’2015 Washington Conference. Marine Technology Society and the IEEE Oceanic Engineering Society IEEE. 1375 Jousselme, A.-L., Ray, C., Camossi, E., Hadzagic, M., Claramunt, C., Bryan, K., Reardon, E., & Ilteris, M. (2016). Maritime use case description, H2020 datAcron deliverable D5.1. Katsilieris, F., Braca, P., & Coraluppi, S. (2013). Detection of malicious AIS position spoofing by exploiting radar information. In Proceedings of the 16th 1380 International Conference on Information Fusion. Kazemi, S., Abghari, S., Lavesson, N., Johnson, H., & Ryman, P. (2013). Open data for anomaly detection in maritime surveillance. Expert Systems with Applications, 40 , 5719–5729. doi:10.1016/j.eswa.2013.04.029. 57 Kelton, K., Fleischmann, K. R., & Wallace, W. A. (2008). Trust in digital 1385 information. Journal of the American Society for Information Science and Technology, 59 , 363–374. doi:10.1002/asi.v59:3. Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). Machine learning: a review of classification and combining techniques. Artificial Intelligence Review , 26 , 159–190. doi:10.1007/s10462-007-9052-3. 1390 Last, P., Hering-Bertram, M., & Linsen, L. (2015). How automatic identification system (AIS) antenna setup affects AIS signal quality. Ocean Engineering, 100 , 83–89. doi:10.1016/j.oceaneng.2015.03.017. Lecornu, L., Montagner, J., & Puentes, J. (2013). Reliability evaluation of incomplete AIS trajectories. In Proceedings of the COST MOVE Workshop 1395 on Moving Objects at Sea. Llyodslist (2019). Seized UK tanker likely ’spoofed’ by iran. lished the 16 August 2019, by Michelle Wiese Bockmann. pubURL: https://lloydslist.maritimeintelligence.informa.com/LL1128820/ Seized-UK-tanker-likely-spoofed-by-Iran. 1400 Lundkvist, M., Jakobsson, L., & Modigh, R. (2008). Automatic identification system (ais) and risk-based planning of hydrographic surveys in swedish waters. In Proceedings of the FIG Working Week 2008 . Maglaras, L., Ferrag, M. A., Derhab, A., Mukherjee, M., Janicke, H., & Rallis, S. (2018). Threats, countermeasures and attribution of cyber attacks on 1405 critical infrastructures. EAI Endorsed Transactions on Security and Safety, 5 . doi:10.4108/15-10-2018.155856. Martineau, E., & Roy, J. (2011). Maritime Anomaly Detection: Domain Introduction and Review of Selected Literature. Technical Report Defence Research and Development Canada. Technical Memorandum - DRDC Valcartier TM 1410 2010-460 - October 2011. 58 McAfee, A., & Brynjolfsson, E. (2012). Big data: the management revolution. Harvard Business Review , 90 , 60–66. McGillivary, P. A., Schwehr, K. D., & Fall, K. (2009). Enhancing ais to improve whale-ship collision avoidance and maritime security. In Proceedings of the 1415 OCEANS 2009 Biloxi Conference. IEEE. McKnight, H. (2005). Trust in Information Technology. The Blackwell Encyclopedia of Management, 7 , 329–331. Natale, F., Gibin, M., Alessandrini, A., Vespe, M., & Paulrud, A. (2015). Mapping Fishing Effort through AIS Data. PLOS ONE , 10 . doi:10.1371/ 1420 journal.pone.0130746. Pallotta, G., Vespe, M., & Bryan, K. (2013). Vessel Pattern Knowledge Discovery from AIS Data: A Framework for Anomaly Detection and Route Prediction. Entropy, 15 , 2218–2245. doi:10.3390/e15062218. Petit, J., & Shladover, S. E. (2015). Potential cyberattacks on automated vehi- 1425 cles. IEEE Transactions on Intelligent Transportation Systems, 16 , 546–556. doi:10.1109/TITS.2014.2342271. Pierkot, C., Zimányi, E., Lin, Y., & Libourel, T. (2011). Advocacy for external quality in gis. In Proceedings of the 4th International Conference on GeoSpatial Semantics GeoS’11 (pp. 151–165). Berlin, Heidelberg: Springer-Verlag. 1430 URL: http://dl.acm.org/citation.cfm?id=2008664.2008678. Pitsikalis, M., Kontopoulos, I., Artikis, A., Alevizos, E., Delaunay, P., Pouessel, J.-E., Dréo, R., Ray, C., Camossi, E., Jousselme, A.-L., & Hadzagic, M. (2018). Composite event patterns for maritime monitoring. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence. 1435 Ray, C., Dréo, R., Camossi, E., Jousselme, A.-L., & Iphar, C. (2019). Heterogeneous integrated dataset for maritime intelligence, surveillance, and reconnaissance. Data in Brief , 25 . doi:10.1016/j.dib.2019.104141. 59 Ray, C., Iphar, C., Napoli, A., Gallen, R., & Bouju, A. (2015). DeAIS project: Detection of AIS Spoofing and Resulting Risks. In Proceedings of 1440 the OCEANS 2015 Genova Conferece. IEEE. doi:10.1109/OCEANS-Genova. 2015.7271729. Raymond, E. S. (2016). Aivdm/aivdo protocol decoding. URL: http://catb. org/gpsd/AIVDM.html. Rid, T., & Buchanan, B. (2015). Attributing cyber attacks. Journal of Strategic 1445 Studies, 38 , 4–37. doi:10.1080/01402390.2014.977382. Salmon, L., Ray, C., & Claramunt, C. (2016). Continuous detection of black holes for moving objects at sea. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on GeoStreaming IWGS ’16 (pp. 2:1–2:10). New York, NY, USA: ACM. URL: http://doi.acm.org/10.1145/3003421. 1450 3003423. doi:10.1145/3003421.3003423. Schwehr, K. D., & McGillivary, P. A. (2007). Marine Ship Automatic Identification System (AIS) for Enhanced Coastal Security Capabilities: An Oil Spill Tracking Application. In Proceedings of the OCEANS Vancouver 2007 Conference. IEEE. doi:10.1109/OCEANS.2007.4449285. 1455 Serry, A., & Lévêque, L. (2015). Le système d’identification automatique (AIS). Netcom, 29 , 177–202. doi:10.4000/netcom.1943. Toumsi, W., & Rais, H. (2018). A survey on technical threat intelligence in the age of sophisticated cyber attacks. Computers & Security, 72 , 212–233. doi:10.1016/j.cose.2017.09.001. 1460 Tunaley, J. K. (2013). Utility of Various AIS Messages for Maritime Awareness. In 8th ASAR Workshop. Longueuil, Canada. Varga, M., Winkelholz, C., & Träber-Burdin, S. (2017). The application of visual analytics to cyber security. In Proceedings of the NATO STO IST-143 Lecture Series on Cyber Security Science & Engineering. 60 1465 Vasseur, B., Jeansoulin, R., Devillers, R., & Frank, A. U. (2005). Evaluation de la qualité externe de l’information géographique : une approche ontologique. In R. Devillers, & R. Jeansoulin (Eds.), Qualité de l’information géographique : traité IGAT (pp. 285–301). Herms Science. Waheed, M., & Cheng, M. (2017). A system for real-time monitoring of cyber- 1470 security events on aircraft. In Proceedings of the IEEE/AIAA 36th Digital Avionics Systems Conference (DASC). Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12 , 5–33. doi:10.1080/07421222.1996.11518099. 1475 Wiley, D. N., Thompson, M., Pace, R. M., & Levenson, J. (2011). Modeling speed restrictions to mitigate lethal collisions between ships and whales in the Stellwagen Bank National Marine Sanctuary, USA. Biological Conservation, 144 , 2377–2381. doi:10.1016/j.biocon.2011.05.007. Windward (2014). AIS Data on the High Seas: An Analysis of the Magnitude 1480 and Implications of Growing Data Manipulation at Sea. Technical Report Windward. Wired (2017). When a tanker vanishes, all the evidence points to russia. published the 21 September 2017, by Matt Burgess. URL: https://www.wired. co.uk/article/black-sea-ship-hacking-russia. 1485 Yaghoubi Shahir, H., Glasser, U., Nalbandyan, N., & Wehn, H. (2014). Maritime Situation Analysis: A Multi-vessel Interaction and Anomaly Detection Framework. In Proceedings of the 2014 IEEE Joint Intelligence and Security Informatics Conference (pp. 192–199). IEEE. doi:10.1109/JISIC.2014.36. Zissis, D. (2016). Detecting anomalies in streams of ais vessel data. In M. Vespe, 1490 & F. Mazzarella (Eds.), Proceedings of the Maritime Knowledge Discovery and Anomaly Detection Workshop JRC Conference and Workshop Reports (pp. 36–38). 61 Zouaoui-Elloumi, S. (2012). Reconnaissance de comportements de navires dans une zone portuaire sensible par approches probabiliste et événementielle : Ap1495 plication au Grand Port Maritime de Marseille. Ph.D. thesis Mines ParisTech. 62 Conflict of Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 1500 Credit author Statement Clément IPHAR: Conceptualization; Data curation; Formal analysis; Methodology; Software; Validation; Visualization; Writing - original draft Cyril RAY: Data curation; Funding acquisition; Project administration; Supervision; Writing - review & editing 1505 Aldo NAPOLI: Funding acquisition; Supervision; Writing - review & editing 63