Tracking and identification for football video analysis using deep learning

We describe the technique used to train and customize deep learning models to detect, track, and identify soccer players, who are recorded during soccer games using custom camera settings. The player detection model is customized to allow the detection of person class objects from video input. Two newly developed filters, spatial feature filters, and bounding box location filters have described that help in classifying players and audiences. A new tacking paradigm is illustrated to generate tracks of soccer players with fewer swaps, thereby reducing efforts of human annotators in later stages. A new method of identifying every player by detecting player t-shirt numbers has been developed and illustrated. This method provides tracks with high confidence and identity to most of the player corresponding to individual t-shirt number. Finally, we provide a unique result assessment technique to judge the performance of the complete model.


INTRODUCTION
Object detection and tracking have emerged as an important area of Computer Vision (CV) which has found application in many interesting areas such as autonomous driving, robotics, medical surgery, and many more. The sports industry is also adopting new developments for recording and analyzing the performance of sports personals. There have been recent advancements in the way the sports events are captured using a single, multiple, and custom cameras setups. Traditionally, the captured sport event data was post-processed by human annotators to pin-point important actions, track players, and provide key information, using which the coaches could analyze the team's strategy and provide an individual rating to players on their performance during the sports events. But the traditional approach is very time consuming, since the process is very repetitive, and the annotation needs to be performed for almost all the frames. With advancements in CV and Artificial Intelligence (AI), some of the tasks performed by human annotations can be automated, thus speeding up the process. Performance analysis of sporting events, football, has been carried out for many years and various new techniques are employed to generate quick and accurate results. Some of the main actions in a football match that need to the annotated are player tracking, ball tracking, and action recognition. These tasks are challenging due to the nature of the game and camera/s used to capture the match raise the difficulty level.
Extensive research has been carried out to improve the multi-object detection of different shapes and resolutions. New developments in AI has provided many trackers based on deep learning techniques. Most of the tracking algorithms use spatial/temporal features [1]- [3] matching score as key criteria, along with additional criteria such as Intersection of Union, Kalman Filter prediction, and inter-frame distance to generate tracking results. Fewer methods make use of recurrent neural networks (RNNs) [4], [5] to gain an advantage over temporal features and also generate attention [6], [7]. These methods work well when objects are spatially differentiable. Multi-object becomes challenging when objects tracked are quite similar in spatial domain and location of these objects are overlapping or very close to one another. Due to the nature of football game, the players to be tracked wear the same jersey and the overlapping of players (same team or opposite team) are very frequent that can continue for many frames. We make use of overlap between all detected bounding boxes and employ the feature matching to only candidates of the individual track to generate fewer swaps and switches, maintaining high confidence in the short-term.
Our entire work is divided into 4 sections, player detection, player-audience classification, player tracking, and number tracking. Since we are using a custom setup camera system, the object detection models must be optimized to detect person class objects in all the frames. Using static bounding box location and spatial features, the detected person class objects are classified into players and non-players (audience). This helps to reduce the number of objects to be tracked. Using a custom tracker, the objects are tracked every frame to produce online tracking results. This result is later used in detecting player t-shirt numbers and to join any gaps in the tracks.

PLAYER DETECTION
Player detection is one of the challenging tasks, due to the nature of camera settings used to capture the entire football match. For a successful detection of players using any deep learning-based model, the ground truth annotations play a vital role. In each football match recorded by Statmetrix camera settings, there are usually 22 players from 2 teams which are the primary targets to be detected. Apart from players, they are many other person class objects on the pitch such as audience, non-playing players, and referee. So, annotation of all objects in the videos requires huge man-hours and hence an alternative solution with existing techniques must be used/customized. Yolo [8] is one of the famous deep learning models used to detect person class objects in images. The frame dimension of Statmetrix videos are in the range of 1K -2K pixels and 4K -6K pixels in the height and the width, respectively covering the entire field in one frame. Resizing the full-frame image to Yolo network size (different network size of 32 multiples) and detecting person class objects results in problems such as single bounding box for multiple person objects and irregular bounding boxes as represented in Figure 1. This is due to the uneven resizing of the original image of ratio 1:3(h: w) into square images during inference. To overcome the problem generated while using full-size images, we split the images into sizes that are in the ratio close to 1:1.5 (h: w). Along with splitting, increasing the network size allows the person class objects that far away from the camera to be detected. This is because the anchors used in the Yolo model cannot detect the features when resized to such low resolutions. At a higher network size, the feature dimensions are preserved and detected by the trained anchors, thus increasing the overall Average Precision (AP). Using different spits to provide better results compared to full-size image and provide improved AP over the ground truth annotation (only considering player bounding box, and rest are neglected while calculating AP during inference testing) as shown in Table 1, where AC1 and AC5 are the football match videos used for testing of 5 minutes duration each. The network size and the quantity of spit used in the inference are highlighted in Table 1, which provides high AP in the best possible time duration.

PLAYER-AUDIENCE CLASSIFICATION
The overall aim is to detect 11 players of each team for the successful generation of tracking results. Processing unwanted boxes cost processing time and unwanted errors are introduced during tracking players. This emphasizes the need for an algorithm to filter-out the non-players and audience from the detected person bounding boxes for each frame. We have employed two strategies to remove non-players by using spatial features and bounding box locations.

Spatial Feature Filter (SFF)
In the spatial feature filter method, we make use of the spatial differences between players and audience. We collected over 10K images of players and non-players to train a deep learning model based on ResNet-50 architecture to classify the images into 2 categories: player and audience. This filter method accurately identifies the audience who is close to the camera, while produce low confidence to classify images that are far away from the camera. This is mainly due to the resolution of images, as far away images are of resolution 15x40 pixels. Also, the spatial difference between player and audience tends to reduce far away from the camera, and hence spatial feature filter alone cannot be employed for the task as shown in Figure 2.  Figure 3 Flow chart of audience removal using bounding box technique Figure 3 illustrates the outline algorithm use to filter-out non-players and audience per frame in a football match video using the bounding box filter method. The main idea behind the algorithm is the nature of the game and how the action (movements) of player and non-player change during a football match. Audiences tend to have constant location compared to a player during a football match and this location displacement is considered as a key to classify players from all detected person class objects.

Bounding Box Filter (BBF)
In the first stage, all the frames recorded during a football game are processed to generate the bounding boxes for a person's class. The detections are in the format of ( 1 , 1 , 2 , 2 ) and these detections are later converted to generate the center location of each bounding box in the format of ( , ).The bounding box location of the entire match is overlapped to generate an activity map or heat map of the entire game. This map indicates the activity of persons in the football ground in terms of displacement as shown in Figure 4. Depending upon the user requirement the threshold can be set to filter the bright spots. A default value of 0.5 can be used, which indicates a bounding box has the same location during 50% of the entire game and thus the location can be considered as a non-player bounding box. A mask of non-players and audiences are generated by selecting all pixels that have a pixel value greater than the threshold set by the user. Using ( , ) information of the bounding box, the non-players are filtered out. All ( , ) that fall within the mask region are considered as non-players and the rest of the detections are considered as players. This drastically reduces the amount of detection per frame to be processed to get the necessary tracking information. The resulting mask used to remove the audience from the detection made using Player detection model and the performance of Spatial feature filter, Bounding box filter, and the combination is represented in Table 2 and the overall result is shown in Figure 5, were extra bounding box consist of referee's and audiences.

SPLIT FEATURE EXTRACTION AND MATCHING
Tracking players is one of the most crucial tasks in our work, as all statistical data of players such as running, scoring goals, kicking are associate with proper tracking. The video recorded by Statmetrix camera incorporates a unique Field of View (FOV), where the players are recorded at an angle facilitating two different overlapping scenarios. Firstly, the overlapping player completely occludes another player, where the extracted feature represents only one player which is similar to other tracking problems. On the contrary, due to camera angle, overlapping players tend to occlude player body rather than an entire player and this occlusion happens very frequently. The feature extracted under this scenario does not provide high confidence to any single player. We employed a Split-feature matching system to retain typical full-body feature extraction and gain an advantage over overlapping scenarios.
A split feature matching model is trained to provide player matching results in terms of percentage as shown in Figure 6. The model is trained with ResNet-50 backbone [9] to spatial extract features and Siamese architecture style is used to match the temporal features. The model consists of splitter that splits the test and target image into two parts, Top and Bottom, to extract features that are matched individually to provide a value between 0-1 which is the matching percentage between the test and the target image. Since there are two matching results available after matching, the best matching score is considered as a feature matching ( ) score by assigning equal weightage to both feature matching scores. In the split feature match technique, variable weightage can be assigned which is not possible for full-body matching.

TRACKING
The primary task of our work is to track the football players accurately to be used by football teams to analyze the performance of individual players. Due to the nature of the football game, the tracking becomes highly difficult in the crowed scenarios and regions that are far away from the camera. Under such situations, human annotators will resolve the tracking issues during post-processing. Hence the main goal of tracking is to produce tracks of high confidence and reduce swaps as far as possible. MOT scores are considered as standard metrics to compare different tracking models and judge the best tracker for a given challenge. But in MOT metrics [10], events such as swap and rename/switch are given the same weightage.

Figure 6 Split Feature matching
This metrics becomes unacceptable in scenarios where the main requirement is to generate tracks with a smaller number of Id swaps and an acceptable number of ID renames. Since to resolve Id rename, human annotator needs to focus in two frames where new Id is created, while in the case of Id swap the human annotator needs to focus on every individual frame to locate the swap frame and then correct the Id. Thus, generating confident short track is better than longer tracks with swaps. some of the keywords that are used in our work are highlighted below.
• Id swap: Id of two players are exchanged.
• Id Rename/Switch: Player Id is changed to a new Id.
• Id Copy: A single Id is assigned to two different players.

Assignment
Due to the nature of the game, the direction of players changes rapidly, and the action performed by the players changes the bounding box dimensions which some time cause error in Kalman filter prediction ( ). Therefore, considering only Kalman prediction for assignment is not feasible. In our method, more weightage is given to previous frame detection, i.e. bounding box of available tracks, compared to Kalman filter predictions. Hence, for an assignment, we introduce gating distance ( ) that provides candidates ( ) for every track that has a high probability to be assigned with the tracks, ℝ as is a subset of .
is the Euclidean distance between the center, ( , ), of the bounding box of a given track (t) and a detection (a). The bounding boxes are represented by ( , , , ℎ). The is the max value between the width and height of a tracks bounding box. The detections that are assigned for a track is given by Equation 3.  Equation 3 provides a list of candidates with high matching probability to a given track within . To select a matching candidate with a track we further generate Feature matching scores ( ) and Track overlap score ( ).

Matching
We calculate the spatial similarity of two bounding boxes, the track ( ) and the candidate detection ( ), using a Split Feature matching method (Section 4). Additionally, when the two or more players overlap, the bounding box dimensions differ when there is just one player in the bounding box. To include this feature in the tracking, we calculate the overlap area between the track and candidate detection to measure the similarity of the bounding box area, Area Confidence ( ). Equation 6 is used as criteria for a candidate detection to be considered for tracking by the user.
( , ) = ℳ * Track confidence ( ) is introduced as a helper function to , to identify the candidates allocated to a track that does not have any bounding box nearby. There are two main reasons for the feature matching model to generate low matching scores. By default, if spatially varying images end up very low matching scores. When the detection of players is not perfectly aligned with the bounding box details stored in existing tracks, the feature matching scores are low even though theoretically detection should have been a good match to a given track. This also helps to neglect some of the candidate bounding boxes that overlap more than the threshold (0.3) with other bounding boxes and has a high probability of introducing error in the tracking process.

LONG-TERM PLAYER TRACKING BY DIGIT DETECTION
In a football match, the key features that can be used to differentiate individual players are the temporal feature and player locations. Apart from these features, the t-shirt number of each player is one of the important features that can be helpful to identify each player. But, due to nature the video recording the detection of t-shirt numbers is difficult since the resolution of numbers is low (around 15x15 pixels). Also due to the nature of the game, the t-shirt deformation directly impacts the visibility number. Considering all these challenges, a number detection model is trained with Yolo-V3 as the backbone with a custom dataset.

Digits to Numbers
The dataset (image data) used for the Digit detection model is a subset of football data used for feature extraction. The true bounding box for the players is reused to generate the initial database consist of cropped images of players from the football match videos. This initial database consists of players in a different orientation, various locations within the football field. The images used in the digit detection model has been handpicked by human annotators within the initial database, and later annotated for different numbers ranging from 0-9 (10 classes). The trained digit model predicts the digit in the player images and provides the bounding box detail during inference. Using this information, we can identify the digits on a player t-shirt and locate where exactly the digit is in the image, but there is no information regarding the order of digits. For example, when the model predicts digits like 1, 7, and 3, we are not sure of the order. It can be any of the following 13, 17, 73, 37, and so on. Therefore, it becomes necessary to have an algorithm to sort these digits to numbers For a given ℎ image, let be all the predictions from the digit detection model with bounding box locations at ( 1 , 1 ) and ( 2 , 2 ) with ( , ) as their corresponding centers. Let ( , ℎ) be the width and height of each prediction in . Let be the Euclidean distances for all predictions within . Let and be the two detections under consideration with − being their Euclidean distance. Using the algorithm shown in Table 3, numbers with probability for consideration is at top of − list.

Linking and Tracking Numbers
Since the digits are identified on the t-shirt of players who are in constant motion, the digits do not always stay parallel to the camera. Also, the orientation of the players keeps changing hence there are great chances that the model will be able to see just one digit on the t-shirt out of two digits. Hence the high confidence number detected does not always match the true number on the t-shirt. There is a high probability of a model predict the wrong number with high confidence (false positive) due to the deformation of a t-shirt during gameplay. Also, additional data can be generated by considering previous and current bounding boxes, to determine the relationship between detections and use the result during assigning a number to the corresponding track. Since the changes in player orientation are very frequent during football gameplay, probabilities of the wrong classification to the detected number are acceptable which are false positives. This error is hard to identify with a single detection, and hence information of previous detections of the same player is necessary to provide some degree of confidence in considering the classification of detected numbers.
The distance between two corresponding detections, distance between previous and current detection locations, of the same player, is calculated and used as criteria to assign a t-shirt number to track as well as reduce false positives. For a given short track all digits are converted to numbers using the method explained earlier.
where ( +1 , ) are the consecutive frames of a track. Strength of individual track, for a specific number I of track 0 is given by with ℎ ℎ being a minimum value set by the user.

RESULTS
The metrics used in standard tracking, MOT, are difficult to provide the individual components/events that are essential in this work such as swaps, switches, id-copies. Hence, we have come up with simple metrics that illustrate the key components to decide the performance of tracker as well as other additional methods used to provide a complete solution. The ground truth (GT) detections are matched with Test (T) detections using techniques followed in standard MOT benchmark [10], [11]. Additionally, we keep track of swaps, switches, and id-copies along with the age of events to know how long an event occurs, to measure the tracking quality. For example, if an Id swap happens at the 45 th frame and continues until the 50 th frame, the age of swap event is 5 frames. Similarly, all events are recorded along with its age corresponding to all the GT tracks. We have set age/limits of [1,5,10,30] frames to understand the solution generated by models compared to ground truth annotations. For example, 25 swaps at the limit of 5 frames suggest the tracks has 25 swap event which are more than 5 frames, which needs to be resolved by the operator at later stages. The result of tracking is represented in Table 4. The tracking result is further processed using the number tracking method described in Sec 6. With t-shirt ID as a key, the tracks are joined to further increase the track length. By using this technique, the average track length was increased from 869 frames to 1091 frames by joining 24 tracks out of 365 tracks for AC1 video. Similarly, the average track length of AC5 video was increased from 572 frames to 785 frames by joining 31 tracks out of 338 tracks.
There are few drawbacks to our methods used in this work. Firstly, the SFM+BBM method some time fails especially for the goalkeeper, who tends to have constant location during the football match and far audience. Secondly, the tracking method sometimes generates new tracks (switches), as we have introduced area confidence and track overlap score terms. These parameters override the high feature matching score and introduce a new track to provide high confidence short track.

CONCLUSION
In our work, we have described different deep learning models that were designed and trained to accomplish the task of detecting players, detecting digits, classification, and tracking players. Due to the camera setting and resolution of players, the Yolo-v3 model was customized to detect most of the person class objects in the video. The technique of image splitting has been adopted which detects most of the bounding boxes, as demonstrated in Table 1. The classification of audience and players using SFM and BBM are accurate over 80%, and the clear advantage of using both the methods together is shown in Table 2. The tracking results on both the videos show that the amount of clicks the operators need to perform is under 350 to resolve the switches. Also, the number of swaps is under 25 in both the videos that require fewer more clicks. The additional number tracking model helps to join the tracks and increase the average track length. In both the videos the number tracking model was able to join more than 20 tracks and add around 200 frames worth of tracking data. In our work, few of the deep learning models are customized for the Statmetrix video inputs, to provide results that help to reduce operator effort and hence not compared with other tracking methods. But these models can be used for any other sporting video to generate tracking data and provide identity corresponding to the individual's t-shirt number. As future work, we plan to use split-feature matching for other tracking problems to see the advantage over traditional full-body matching.