Merging clips with AVFoundation creates single video in black-CodePudding

I am using AVFoundation to merge two videos into one. The result of my attempt is a single video with a length equal to the sum of all the clips, and displaying a black screen.

Here is my code:

        public void mergeclips()
        {
            AVMutableComposition mixComposition = new AVMutableComposition();
            CMTime previous_asset_duration = CMTime.Zero;
            CMTime AllAssetDurations = CMTime.Zero;
            AVMutableVideoCompositionLayerInstruction[] Instruction_Array = new AVMutableVideoCompositionLayerInstruction[Clips.Count];
            

            foreach (string clip in Clips)
            {
                #region HoldVideoTrack
                AVAsset asset = AVAsset.FromUrl(NSUrl.FromFilename(clip));

                AVMutableCompositionTrack Track = mixComposition.AddMutableTrack(AVMediaType.Video, 0);

                CMTimeRange range = new CMTimeRange()
                {
                    Start = new CMTime(0, 0),
                    Duration = asset.Duration
                };

                AVAssetTrack track = asset.TracksWithMediaType(AVMediaType.Video)[0];
                Track.InsertTimeRange(range, track, previous_asset_duration, out NSError error);
                #endregion

                #region Instructions
                // 7
                var Instruction = AVMutableVideoCompositionLayerInstruction.FromAssetTrack(Track);

                Instruction.SetOpacity(0, asset.Duration);

                // 8
                Instruction_Array[Clips.IndexOf(clip)] = Instruction;
                #endregion

                previous_asset_duration = asset.Duration;
                AllAssetDurations = asset.Duration;
            }

            // 6
            var mainInstruction = new List<AVMutableVideoCompositionInstruction>();

            CMTimeRange rangeIns = new CMTimeRange()
            {
                Start = new CMTime(0, 0),
                Duration = AllAssetDurations
            };

            mainInstruction[0].TimeRange = rangeIns;
            
            mainInstruction[0].LayerInstructions = Instruction_Array;

            var mainComposition = new AVMutableVideoComposition();
            mainComposition.Instructions = mainInstruction.ToArray();
            mainComposition.FrameDuration = new CMTime(1, 30);
            mainComposition.RenderSize = new CoreGraphics.CGSize(UIScreen.MainScreen.Bounds.Width, UIScreen.MainScreen.Bounds.Height);


            //... export video ...

            AVAssetExportSession exportSession = new AVAssetExportSession(mixComposition, AVAssetExportSessionPreset.MediumQuality)
            {
                OutputUrl = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "temporaryClip/Whole.mov")),
                OutputFileType = AVFileType.QuickTimeMovie,
                ShouldOptimizeForNetworkUse = true,
                //APP crashes here
                VideoComposition = mainComposition
            };
            exportSession.ExportAsynchronously(_OnExportDone);
        }

        private static void _OnExportDone()
        {
            var library = new ALAssetsLibrary();
            library.WriteVideoToSavedPhotosAlbum(NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "temporaryClip/Whole.mov")), (path, e2) =>
            {
                if (e2 != null)
                {
                    new UIAlertView("Error", e2.ToString(), null, "OK", null).Show();
                }
                else
                {
                }
            });
        }

EDIT: I added more code, specifically, I added "ShouldOptimizeForNetworkUse" and VideoCompositions to the AVAssetExportSession. I am using List instead of AVMutableVideoCompositionInstruction because AVMutableVideoComposition.Instructions requires a class of type AVVideoCompositionInstructions[]. With the previous code the App crashes at the following line "VideoComposition = mainComposition"

EDIT: After including transformations for the instructions and making the corrections that Shawn pointed out, I can merge 2 or more videos and save the common video to a file. Unfortunately, the root problem remains, the final video displays only the backgroundColor of AVMutableVideoCompositionInstruction, not all the clips as we would expect. The audio of these videos is also ignored, I don't know if this has to be added apart or not, but knowing it might also be helpful.

Here is my code:

        public void mergeclips()
        {
            AVMutableComposition mixComposition = new AVMutableComposition();
            AVMutableVideoCompositionLayerInstruction[] Instruction_Array = new AVMutableVideoCompositionLayerInstruction[Clips.Count];

            foreach (string clip in Clips)
            {
                #region HoldVideoTrack
                AVAsset asset = AVAsset.FromUrl(NSUrl.FromFilename(clip));

                AVMutableCompositionTrack Track = mixComposition.AddMutableTrack(AVMediaType.Video, 0);

                CMTimeRange range = new CMTimeRange()
                {
                    Start = new CMTime(0, 0),
                    Duration = asset.Duration
                };

                AVAssetTrack track = asset.TracksWithMediaType(AVMediaType.Video)[0];
                Track.InsertTimeRange(range, track, mixComposition.Duration, out NSError error);
                #endregion

                #region Instructions
                Instruction_Array[Clips.IndexOf(clip)] = SetInstruction(asset, mixComposition.Duration, Track);
                #endregion
            }

            // 6
            var mainInstruction = new AVMutableVideoCompositionInstruction();

            CMTimeRange rangeIns = new CMTimeRange()
            {
                Start = new CMTime(0, 0),
                Duration = mixComposition.Duration
            };

            mainInstruction.BackgroundColor = UIColor.FromRGBA(1f, 1f, 1f, 1.000f).CGColor;
            mainInstruction.TimeRange = rangeIns;
            mainInstruction.LayerInstructions = Instruction_Array;

            var mainComposition = new AVMutableVideoComposition()
            {
                Instructions = new AVVideoCompositionInstruction[1] { mainInstruction },
                FrameDuration = new CMTime(1, 30),
                RenderSize = new CoreGraphics.CGSize(UIScreen.MainScreen.Bounds.Width, UIScreen.MainScreen.Bounds.Height)
            };

            //... export video ...

            AVAssetExportSession exportSession = new AVAssetExportSession(mixComposition, AVAssetExportSessionPreset.MediumQuality)
            {
                OutputUrl = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "temporaryClip/Whole.mov")),
                OutputFileType = AVFileType.QuickTimeMovie,
                ShouldOptimizeForNetworkUse = true,
                VideoComposition = mainComposition
            };
            exportSession.ExportAsynchronously(_OnExportDone);
        }

        private AVMutableVideoCompositionLayerInstruction SetInstruction(AVAsset asset, CMTime currentTime, AVMutableCompositionTrack assetTrack)
        {
            var instruction = AVMutableVideoCompositionLayerInstruction.FromAssetTrack(assetTrack);

            var transform = assetTrack.PreferredTransform;
            var transformSize = assetTrack.NaturalSize; //for export session
            var newAssetSize = new CoreGraphics.CGSize(transformSize.Width, transformSize.Height); // for export session

            if (newAssetSize.Width > newAssetSize.Height)//portrait
            {
                //Starting here, all newassetsize have its height and width inverted, height should be width and vice versa 
                var scaleRatio = UIScreen.MainScreen.Bounds.Height / newAssetSize.Width;
                var _coreGraphic = new CoreGraphics.CGAffineTransform(0, 0, 0, 0, 0, 0);
                _coreGraphic.Scale(scaleRatio, scaleRatio);
                var tx = UIScreen.MainScreen.Bounds.Width / 2 - newAssetSize.Height * scaleRatio / 2;
                var ty = UIScreen.MainScreen.Bounds.Height / 2 - newAssetSize.Width * scaleRatio / 2;
                _coreGraphic.Translate(tx, ty);

                instruction.SetTransform(_coreGraphic, currentTime);

            }

            var endTime = CMTime.Add(currentTime, asset.Duration);
            instruction.SetOpacity(0, endTime);


            return instruction;
        }

EDIT: Several mistakes in the code were corrected thanks to Shawn's help. The problem remains (the resulting video has no image)

Here is my code:

        public void mergeclips()
        {
            //microphone
            AVCaptureDevice microphone = AVCaptureDevice.DefaultDeviceWithMediaType(AVMediaType.Audio);

            AVMutableComposition mixComposition = new AVMutableComposition();
            AVMutableVideoCompositionLayerInstruction[] Instruction_Array = new AVMutableVideoCompositionLayerInstruction[Clips.Count];

            foreach (string clip in Clips)
            {
                #region HoldVideoTrack

                AVAsset asset = AVAsset.FromUrl(NSUrl.FromFilename(clip));

                CMTimeRange range = new CMTimeRange()
                {
                    Start = new CMTime(0, 0),
                    Duration = asset.Duration
                };

                AVMutableCompositionTrack videoTrack = mixComposition.AddMutableTrack(AVMediaType.Video, 0);
                AVAssetTrack assetVideoTrack = asset.TracksWithMediaType(AVMediaType.Video)[0];
                videoTrack.InsertTimeRange(range, assetVideoTrack, mixComposition.Duration, out NSError error);

                if (microphone != null)
                {
                    AVMutableCompositionTrack audioTrack = mixComposition.AddMutableTrack(AVMediaType.Audio, 0);
                    AVAssetTrack assetAudioTrack = asset.TracksWithMediaType(AVMediaType.Audio)[0];
                    audioTrack.InsertTimeRange(range, assetAudioTrack, mixComposition.Duration, out NSError error2);
                }
                #endregion


                #region Instructions
                Instruction_Array[Clips.IndexOf(clip)] = SetInstruction(asset, mixComposition.Duration, videoTrack);
                #endregion
            }

            // 6
            var mainInstruction = new AVMutableVideoCompositionInstruction();

            CMTimeRange rangeIns = new CMTimeRange()
            {
                Start = new CMTime(0, 0),
                Duration = mixComposition.Duration
            };

            mainInstruction.BackgroundColor = UIColor.FromRGBA(1f, 1f, 1f, 1.000f).CGColor;
            mainInstruction.TimeRange = rangeIns;
            mainInstruction.LayerInstructions = Instruction_Array;

            var mainComposition = new AVMutableVideoComposition()
            {
                Instructions = new AVVideoCompositionInstruction[1] { mainInstruction },
                FrameDuration = new CMTime(1, 30),
                RenderSize = new CoreGraphics.CGSize(UIScreen.MainScreen.Bounds.Width, UIScreen.MainScreen.Bounds.Height)
            };

            //... export video ...

            AVAssetExportSession exportSession = new AVAssetExportSession(mixComposition, AVAssetExportSessionPreset.MediumQuality)
            {
                OutputUrl = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "temporaryClip/Whole.mov")),
                OutputFileType = AVFileType.QuickTimeMovie,
                ShouldOptimizeForNetworkUse = true,
                VideoComposition = mainComposition
            };
            exportSession.ExportAsynchronously(_OnExportDone);
        }

        private AVMutableVideoCompositionLayerInstruction SetInstruction(AVAsset asset, CMTime currentTime, AVMutableCompositionTrack mixComposition_video_Track)
        {
            //The following code triggers when a device has no camera or no microphone (for instance an emulator)
            var instruction = AVMutableVideoCompositionLayerInstruction.FromAssetTrack(mixComposition_video_Track);

            //Get the individual AVAsset's track to use for transform
            AVAssetTrack assetTrack = asset.TracksWithMediaType(AVMediaType.Video)[0];

            //Set transform the the preferredTransform of the AVAssetTrack, not the AVMutableCompositionTrack
            CGAffineTransform transform = assetTrack.PreferredTransform;
            //Set the transformSize to be the asset natural size AFTER applying preferredTransform.
            CGSize transformSize = transform.TransformSize(assetTrack.NaturalSize);

            //Handle any negative values resulted from applying transform by using the absolute value
            CGSize newAssetSize = new CoreGraphics.CGSize(Math.Abs(transformSize.Width), Math.Abs(transformSize.Height));

            //change back to less than
            if (newAssetSize.Width < newAssetSize.Height)//portrait
            {
                /*newAssetSize should no longer be inverted since preferredTransform handles this. Remember that the asset was never 
                 * actually transformed yet. newAssetSize just represents the size the video is going to be after you call 
                 * instruction.setTransform(transform). Since transform is the first transform in concatenation, this is the size that 
                 * the scale and translate transforms will be using, which is why we needed to reference newAssetSize after applying 
                 * transform. Also you should concatenate in this order: transform -> scale -> translate, otherwise you won't get 
                 * desired results*/
                nfloat scaleRatio = UIScreen.MainScreen.Bounds.Height / newAssetSize.Height;

                //Apply scale to transform. Transform is never actually applied unless you do this.
                transform.Scale(scaleRatio, scaleRatio); 
                nfloat tx = UIScreen.MainScreen.Bounds.Width / 2 - newAssetSize.Width * scaleRatio / 2;
                nfloat ty = UIScreen.MainScreen.Bounds.Height / 2 - newAssetSize.Height * scaleRatio / 2;
                transform.Translate(tx, ty);

                instruction.SetTransform(transform, currentTime);
            }

            var endTime = CMTime.Add(currentTime, asset.Duration);
            instruction.SetOpacity(0, endTime);

            return instruction;
        }

CodePudding user response：

You are inserting every time range at CMTime.zero instead of at the end of the previous asset. Also, are you using a videoComposition when you export?

UPDATE: A long time ago I was doing video playback within the app so I don't actually export, but when I first started I exported first then passed the exported video as an AVAsset into AVPlayer. I changed a lot since then so I don't export the video just to playback within the app because it was inefficient and a waste of time, but my code works perfectly in terms of merging assets together. I had it working when I was exporting, but I also changed my merge function a lot since then so no guarantees this will work with an export session.

func mergeVideos(mixComposition: Binding<AVMutableComposition>, videoComposition: Binding<AVMutableVideoComposition>, mainInstruction: Binding<AVMutableVideoCompositionInstruction>) -> AVPlayerItem {

    guard let documentDirectory = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first else {
        return AVPlayerItem(asset: mixComposition.wrappedValue)
    }
    
    //Remove all existing videos, tracks and instructions
    self.assets.removeAll()
    
    for track in mixComposition.wrappedValue.tracks {
        mixComposition.wrappedValue.removeTrack(track)
    }
    
    //Add all videos to asset array
    for video in videos {
        let url = documentDirectory.appendingPathComponent(video.videoURL)
            let asset = AVURLAsset(url: url, options: [AVURLAssetPreferPreciseDurationAndTimingKey : true])
            self.assets.append(asset)
    }
    
    //add instructions and assets to mixComposition
    assets.forEach { asset in
        self.addTrackToComposition(asset: asset, mixComposition: mixComposition, videoComposition: videoComposition, mainInstruction: mainInstruction)
    }//forEach
    
    //create playerITem with videoComposition
    videoComposition.wrappedValue.instructions = [mainInstruction.wrappedValue]
    videoComposition.wrappedValue.frameDuration = CMTimeMake(value: 1, timescale: 30)
    videoComposition.wrappedValue.renderSize = renderSize
    
    let item = AVPlayerItem(asset: mixComposition.wrappedValue)
    item.seekingWaitsForVideoCompositionRendering = true
    item.videoComposition = videoComposition.wrappedValue
    
    return item
}//mergeVideo

func addTrackToComposition(asset: AVAsset, mixComposition: Binding<AVMutableComposition>, videoComposition: Binding<AVMutableVideoComposition>, mainInstruction: Binding<AVMutableVideoCompositionInstruction>) {
    
    let currentTime = mixComposition.wrappedValue.duration
            
    guard let assetVideoTrack = mixComposition.wrappedValue.addMutableTrack(withMediaType: .video, preferredTrackID: Int32(kCMPersistentTrackID_Invalid)) else {return}
    
    guard let assetAudioTrack = mixComposition.wrappedValue.addMutableTrack(withMediaType: .audio, preferredTrackID: Int32(kCMPersistentTrackID_Invalid)) else {return}
    
    do {
        let timeRange = CMTimeRangeMake(start: .zero, duration: asset.duration)
        // Insert video to Mutable Composition at right time.
        try assetVideoTrack.insertTimeRange(timeRange, of: asset.tracks(withMediaType: .video)[0], at: currentTime)
        try assetAudioTrack.insertTimeRange(timeRange, of: asset.tracks(withMediaType: .audio)[0], at: currentTime)
        
        let videoInstruction = videoCompositionInstruction(track: assetVideoTrack, asset: asset, currentTime: currentTime)
        
        mainInstruction.wrappedValue.layerInstructions.append(videoInstruction)
        mainInstruction.wrappedValue.timeRange = CMTimeRange(start: .zero, duration: mixComposition.wrappedValue.duration)
    } catch let error {
        print(error.localizedDescription)
    }

}//addTrackToComposition

func videoCompositionInstruction(track: AVCompositionTrack, asset: AVAsset, currentTime: CMTime) -> AVMutableVideoCompositionLayerInstruction {
    let instruction = AVMutableVideoCompositionLayerInstruction(assetTrack: track)
    guard let assetTrack = asset.tracks(withMediaType: .video).first else { return instruction }

    let transform = assetTrack.preferredTransform
    let transformSize = assetTrack.naturalSize.applying(transform) //for export session
    let newAssetSize = CGSize(width: abs(transformSize.width), height: abs(transformSize.height)) // for export session
    
    if newAssetSize.width < newAssetSize.height { //portrait

        let scaleRatio = renderSize.height / newAssetSize.height
        let scale = CGAffineTransform(scaleX: scaleRatio, y: scaleRatio)

        let tx = renderSize.width / 2 - newAssetSize.width * scaleRatio / 2
        let ty = renderSize.height / 2 - newAssetSize.height * scaleRatio / 2
        let translate = CGAffineTransform(translationX: tx, y: ty)

        let concatenation = transform.concatenating(scale).concatenating(translate)

        instruction.setTransform(concatenation, at: currentTime)

    } else if newAssetSize.width > newAssetSize.height { //landscape

        let scaleRatio = renderSize.width / newAssetSize.width
        let scale = CGAffineTransform(scaleX: scaleRatio, y: scaleRatio)

        let tx = renderSize.width / 2 - newAssetSize.width * scaleRatio / 2
        let ty = renderSize.height / 2 - newAssetSize.height * scaleRatio / 2
        let translate = CGAffineTransform(translationX: tx, y: ty)

        let concatenation = transform.concatenating(scale).concatenating(translate)

        instruction.setTransform(concatenation, at: currentTime)

    } else if newAssetSize.width == newAssetSize.height {
        //if landscape, fill height first, if portrait fill width first, if square doesnt matter just scale either width or height
        if renderSize.width > renderSize.height { //landscape
            let scaleRatio = renderSize.height / newAssetSize.height
            let scale = CGAffineTransform(scaleX: scaleRatio, y: scaleRatio)

            let tx = renderSize.width / 2 - newAssetSize.width * scaleRatio / 2
            let ty = renderSize.height / 2 - newAssetSize.height * scaleRatio / 2
            let translate = CGAffineTransform(translationX: tx, y: ty)

            let concatenation = transform.concatenating(scale).concatenating(translate)

            instruction.setTransform(concatenation, at: currentTime)
        } else { //portrait and square
            let scaleRatio = renderSize.width / newAssetSize.width
            let scale = CGAffineTransform(scaleX: scaleRatio, y: scaleRatio)

            let tx = renderSize.width / 2 - newAssetSize.width * scaleRatio / 2
            let ty = renderSize.height / 2 - newAssetSize.height * scaleRatio / 2
            let translate = CGAffineTransform(translationX: tx, y: ty)

            let concatenation = transform.concatenating(scale).concatenating(translate)

            instruction.setTransform(concatenation, at: currentTime)
        }
    }
    
    let endTime = CMTimeAdd(currentTime, asset.duration)
    instruction.setOpacity(0, at: endTime)
    
    return instruction
}//videoCompositionInstruction

I'll briefly explain what I am doing here.

You don't need to pass in Bindings for the AVMutableComposition, AVMutableVideoComposition or the AVMutableVideoCompositionInstructions. I only do that for certain functionalities in my app. You can instantiate all of these within in the function before you do anything else.

I have an array in the class that holds all my assets, so thats what self.assets is. "videos" is referencing a Realm Model I use to store the last path component of the videos the user picked from their photo library. You probably don't need to remove all existing videos, tracks, and instructions since you are not passing in references to the compositions and instructions. I do so since I make changes to these objects throughout my app. No need for you to use any wrappedValues either since that is just for the Binding.

Once I populate my assets array, I iterate through it calling addTrackToComposition and passing in each asset. This function adds an audio AND video track to the mixComposition for each asset. Then inside a do-catch block, it tries to insert the assets audio and video track into the empty mutableTracks you just created for the mixComposition. So the mixComposition is going to have 2 tracks for every asset (one audio and one video). I do this so that I can have more control over my instructions and apply different transforms to each asset rather than the entire mixComposition as a whole. You can also just create the empty mutableTracks for the mixComposition outside of the for loop and insert the asset's track into that one track (actually two tracks - audio/video). I know this sounds confusing by try to break it down slowly. It is important to note that in my do-catch block, the timeRange I pass is the assets time range, but I insert it at the end of the mixComposition (currentTime = mixComposition.duration). This is why timeRange starts at kCMTimeZero (.zero), but I pass in currentTime for the at: parameter.

Then I use a function that creates the layer instruction for each asset. This scales and positions each asset so that it is displayed properly in my custom video player. It also sets the opacity to 0 at the end of asset. Here my renderSize is declared in my Realm Model and is a CGSize(width: 1280, height: 720). Now, I am not sure if these transforms will work for your use case, but I know you will definitely need transforms otherwise your assets will export with the wrong orientation and/or size/position. At the very least you need to set the assets preferredTrackTransform. It is important to use the AVAssetTrack's preferredTransform and not the AVCompositionTrack's preferredTransform. This handles orientation for you, but will not handle scale and position. Play around with it. This code took me 1-2 weeks to figure out how to make work for my use case.

Then I append the layer instruction to mainInstruction and set mainInstructions timeRange to equal mixCompositions timeRange. I don't know why I set the timeRange every iteration of the for loop and I could definitely just do it after all instructions and tracks are added so it only happens once rather than every iteration.

Lastly, I set videoCompositions instruction to be an array of just mainInstruction and I set the frame rate and renderSize. Hopefully all of this works for you when you pass it into an export session.

Looking at the way you tried to implement it, I would say you don't need an array for layerInstructions. Just create an AVMutableVideoCompositionInstruction object (mainInstruction) and append the layer instructions to that.

Also there is a problem with you using previous asset duration. You need to pass in mixCompositions duration when you insert the new asset's time range. What you are doing is inserting at just the previous assets duration so you are ending up with a bunch of overlapping assets. You want to insert it after all previous assets duration combined, which would mixCompositions current duration.

Also mainInstruction should not be a List either. It should just be an AVMutableVideoCompositionInstruction(). AVMutableVideoCompositionInstruction has a layerInstructions property that is an array of layerInstructions. You can append directly to this. There should not be more than one mainInstruction. There should only be multiple layerInstructions.

Be patient with this. It took me a very long time to figure out myself, coming from no AVFoundation experience. I honestly still don't know enough to be sure of what's wrong with your current code, but all I know is that this works for me. Hopefully, it works for you too. I've probably changed this function 20 times since I started this app a couple months ago.

UPDATE: So you are on the right path, but there are still a few problems that may be the cause of your issue.

1.) I forgot to mention this last time, but when I was faced with the same problem, multiple people told me that I HAD to handle the audio track separately. Apparently even the video won't work without doing this. I never actually tested to see if this was true, but it's worth a shot. You can refer to my code again to see how I handled the audio track. It is essentially the same thing as the video track but you don't apply any instructions to it.

2.) In your instructions function there are few problems. Your instruction property is correct, but your transform, transformSize, and newAssetSize are not correct. Currently you set transform to assetTrack.preferredTransform. This is actually the mixComposition's transform, but what you want to use is the original AVAsset's preferredTransform.

After you initialize your instruction using assetTrack (mixComposition's track), you need to declare a new property to get the AVAsset's track. Refer back to my code. I actually use the name "assetTrack" to do this so don't be confused with our variable names. Your "assetTrack" is my "track" which I passed in as a parameter. My "assetTrack" is what you need to add, but obviously you can use whatever name you want.

So videos are a little strange when recorded on our devices. A video recorded in portrait orientation is actually landscape. Each asset, however, comes with data that informs the device how it should be displayed (i.e. rotate video so it displays the same way it was recorded). That is what preferredTransform is. It will transform the asset to display in the correct orientation. This is why you need to make sure you are using each individual asset's preferredTransform and not the mixComposition's preferredTransform that you used in your code. The mixComposition's preferredTransform will just be an identity matrix which effectively doesn't do anything at all. This is why your asset's natural size is "inverted". It is not inverted, that is just the way apple stores all videos and pictures. The meta data handles the correct orientation which is in preferredTransform and that will result in the "correct" width and height.

So now that you have the correct transform stored in your transform property, your transformSize property needs to reflect this, however you forget to add "applying(transform)" to the size. This is important. The transformSize you currently have is just the naturalSize, whereas, you want the size after applying the transform to the asset (so that width and height actually reflect the correct orientation of the video).

So now newAssetSize is meant to handle any negative values that are resulted from transformSize. So when you create newAssetSize, you need to make sure you are using the Absolute value of transformSize.width and transformSize.height. That is why I had it as "abs(transformSize.width)" in my code. This is also crucial.

3.) You never applied the preferredTransform to the video, and you instead apply scale transform to a matrix of all 0, which is never going to work. At very least you need to concatenate scale and translate to the identity matrix, although you should really be concatenating them to transform instead. If you don't change this part your video will never display no matter what you do. Any transforms you concatenate with on a zero matrix will have no effect and you will still result in a 0 matrix which means your video will not display at all.

Try to make these changes, especially the changes in the instruction function. I believe you will also have to redo your transform logic after you change those properties as it looks like you tried to compensate for the fact that width and height were inverted.

Your code should look something like this (keep in mind that I am not familiar with c# at all):

  public void mergeclips()
        {
            AVMutableComposition mixComposition = new AVMutableComposition();
            AVMutableVideoCompositionLayerInstruction[] Instruction_Array = new AVMutableVideoCompositionLayerInstruction[Clips.Count];

            foreach (string clip in Clips)
            {
                #region HoldVideoTrack
                AVAsset asset = AVAsset.FromUrl(NSUrl.FromFilename(clip));

                AVMutableCompositionTrack videoTrack = mixComposition.AddMutableTrack(AVMediaType.Video, 0);
                AVMutableCompositionTrack audioTrack = mixComposition.AddMutableTrack(AVMediaType.Audio, 0);

                CMTimeRange range = new CMTimeRange()
                {
                    Start = new CMTime(0, 0),
                    Duration = asset.Duration
                };

                AVAssetTrack assetVideoTrack = asset.TracksWithMediaType(AVMediaType.Video)[0];
                videoTrack.InsertTimeRange(range, assetVideoTrack, mixComposition.Duration, out NSError error);
                
                AVAssetTrack assetAudioTrack = asset.TracksWithMediaType(AVMediaType.Audio)[0];
                audioTrack.InsertTimeRange(range, assetAudioTrack, mixComposition.Duration, out NSError error);
                #endregion

                #region Instructions
                Instruction_Array[Clips.IndexOf(clip)] = SetInstruction(asset, mixComposition.Duration, videoTrack);
                #endregion
            }

            // 6
            var mainInstruction = new AVMutableVideoCompositionInstruction();

            CMTimeRange rangeIns = new CMTimeRange()
            {
                Start = new CMTime(0, 0),
                Duration = mixComposition.Duration
            };

            mainInstruction.BackgroundColor = UIColor.FromRGBA(1f, 1f, 1f, 1.000f).CGColor;
            mainInstruction.TimeRange = rangeIns;
            mainInstruction.LayerInstructions = Instruction_Array;

            var mainComposition = new AVMutableVideoComposition()
            {
                Instructions = new AVVideoCompositionInstruction[1] { mainInstruction },
                FrameDuration = new CMTime(1, 30),
                RenderSize = new CoreGraphics.CGSize(UIScreen.MainScreen.Bounds.Width, UIScreen.MainScreen.Bounds.Height)
            };

            //... export video ...

            AVAssetExportSession exportSession = new AVAssetExportSession(mixComposition, AVAssetExportSessionPreset.MediumQuality)
            {
                OutputUrl = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "temporaryClip/Whole.mov")),
                OutputFileType = AVFileType.QuickTimeMovie,
                ShouldOptimizeForNetworkUse = true,
                VideoComposition = mainComposition
            };
            exportSession.ExportAsynchronously(_OnExportDone);
        }

        private AVMutableVideoCompositionLayerInstruction SetInstruction(AVAsset asset, CMTime currentTime, AVMutableCompositionTrack assetTrack)
        {
            var instruction = AVMutableVideoCompositionLayerInstruction.FromAssetTrack(assetTrack);

    //Get the individual AVAsset's track to use for transform
            AVAssetTrack avAssetTrack = asset.TracksWithMediaType(AVMediaType.Video)[0];

    //Set transform the the preferredTransform of the AVAssetTrack, not the AVMutableCompositionTrack
            var transform = avAssetTrack.PreferredTransform;

    //Set the transformSize to be the asset natural size AFTER applying preferredTransform.

            var transformSize = avAssetTrack.NaturalSize.applying(transform);

    //Handle any negative values resulted from applying transform by using the absolute value
            var newAssetSize = new CoreGraphics.CGSize(Abs(transformSize.Width), Abs(transformSize.Height)); // for export session
            //change back to less than
            if (newAssetSize.Width < newAssetSize.Height)//portrait
            {
                //newAssetSize should no longer be inverted since preferredTransform handles this. Remember that the asset was never actually transformed yet. newAssetSize just represents the size the video is going to be after you call instruction.setTransform(transform). Since transform is the first transform in concatenation, this is the size that the scale and translate transforms will be using, which is why we needed to reference newAssetSize after applying transform. Also you should concatenate in this order: transform -> scale -> translate, otherwise you won't get desired results

                var scaleRatio = UIScreen.MainScreen.Bounds.Height / newAssetSize.Height; //change back to height. Keep in mind that this scaleRatio will fill the height of the screen first and the width will probably exceed the screen bounds. I had it set like this because I was displaying my video in a view that is much smaller than the screen size. If you want to display the video centered on the phone screen, try using scaleRation = UIScreen.MainScreen.Bounds.Width / newAssetSize.Width. This will scale the video to fit the width of the screen perfectly and then the height will be whatever it is with respect to the videos aspect ratio.
                
                //Apply scale to transform. Transform is never actually applied unless you do this.
                var _coreGraphic = transform.Scale(scaleRatio, scaleRatio);
                var tx = UIScreen.MainScreen.Bounds.Width / 2 - newAssetSize.Height * scaleRatio / 2;
                var ty = UIScreen.MainScreen.Bounds.Height / 2 - newAssetSize.Width * scaleRatio / 2;
                _coreGraphic.Translate(tx, ty);

                instruction.SetTransform(_coreGraphic, currentTime);

            }

            var endTime = CMTime.Add(currentTime, asset.Duration);
            instruction.SetOpacity(0, endTime);


            return instruction;
        }

CodePudding user response：

Ok, so thanks to Shawn's help I have accomplished what I was trying to do. There were 2 main mistakes in my code that generated this problem, the first one was how the property of the CMTime given to VideoTrack was set: Start = new CMTime(0,0), instead of Start = new CMTime.Zero,. I still don't know what difference does it make, but it prevented the code from displaying the video and the audio of each asset, leaving a video with the length of all the clips combined and the background of AVMutableVideoCompositionInstruction. The second mistake was how I set the instructions, the configuration that worked for me can be found in the following code.

Here is the final function working as correctly:

public void MergeClips()
        {
            //microphone
            AVCaptureDevice microphone = AVCaptureDevice.DefaultDeviceWithMediaType(AVMediaType.Audio);

            AVMutableComposition mixComposition = AVMutableComposition.Create();
            AVVideoCompositionLayerInstruction[] Instruction_Array = new AVVideoCompositionLayerInstruction[Clips.Count];

            foreach (string clip in Clips)
            {
                var asset = AVUrlAsset.FromUrl(new NSUrl(clip, false)) as AVUrlAsset;
                #region HoldVideoTrack

                //This range applies to the video, not to the mixcomposition
                CMTimeRange range = new CMTimeRange()
                {
                    Start = CMTime.Zero,
                    Duration = asset.Duration
                };

                var duration = mixComposition.Duration;
                NSError error;

                AVMutableCompositionTrack videoTrack = mixComposition.AddMutableTrack(AVMediaType.Video, 0);
                AVAssetTrack assetVideoTrack = asset.TracksWithMediaType(AVMediaType.Video)[0];
                videoTrack.InsertTimeRange(range, assetVideoTrack, duration, out error);
                videoTrack.PreferredTransform = assetVideoTrack.PreferredTransform;

                if (microphone != null)
                {
                    AVMutableCompositionTrack audioTrack = mixComposition.AddMutableTrack(AVMediaType.Audio, 0);
                    AVAssetTrack assetAudioTrack = asset.TracksWithMediaType(AVMediaType.Audio)[0];
                    audioTrack.InsertTimeRange(range, assetAudioTrack, duration, out error);
                }
                #endregion

                #region Instructions
                int counter = Clips.IndexOf(clip);
                Instruction_Array[counter] = SetInstruction(asset, mixComposition.Duration, videoTrack);
                #endregion
            }

            // 6
            AVMutableVideoCompositionInstruction mainInstruction = AVMutableVideoCompositionInstruction.Create() as AVMutableVideoCompositionInstruction;

            CMTimeRange rangeIns = new CMTimeRange()
            {
                Start = new CMTime(0, 0),
                Duration = mixComposition.Duration
            };
            mainInstruction.TimeRange = rangeIns;
            mainInstruction.LayerInstructions = Instruction_Array;
            
            var mainComposition = AVMutableVideoComposition.Create();
            mainComposition.Instructions = new AVVideoCompositionInstruction[1] { mainInstruction };
            mainComposition.FrameDuration = new CMTime(1, 30);
            mainComposition.RenderSize = new CGSize(mixComposition.NaturalSize.Height, mixComposition.NaturalSize.Width);
            
            finalVideo_path = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "Whole2.mov"));
            if (File.Exists(Path.GetTempPath()   "Whole2.mov"))
            {
                File.Delete(Path.GetTempPath()   "Whole2.mov");
            }

            //... export video ...
            AVAssetExportSession exportSession = new AVAssetExportSession(mixComposition, AVAssetExportSessionPreset.HighestQuality)
            {
                OutputUrl = NSUrl.FromFilename(Path.Combine(Path.GetTempPath(), "Whole2.mov")),
                OutputFileType = AVFileType.QuickTimeMovie,
                ShouldOptimizeForNetworkUse = true,
                VideoComposition = mainComposition
            };
            exportSession.ExportAsynchronously(_OnExportDone);
        }

        private AVMutableVideoCompositionLayerInstruction SetInstruction(AVAsset asset, CMTime currentTime, AVAssetTrack mixComposition_video_Track)
        {
            var instruction = AVMutableVideoCompositionLayerInstruction.FromAssetTrack(mixComposition_video_Track);

            var startTime = CMTime.Subtract(currentTime, asset.Duration);

            //NaturalSize.Height is passed as a width parameter because IOS stores the video recording horizontally 
            CGAffineTransform translateToCenter = CGAffineTransform.MakeTranslation(mixComposition_video_Track.NaturalSize.Height, 0);
            //Angle in radiants, not in degrees
            CGAffineTransform rotate = CGAffineTransform.Rotate(translateToCenter, (nfloat)(Math.PI / 2));

            instruction.SetTransform(rotate, (CMTime.Subtract(currentTime, asset.Duration)));

            instruction.SetOpacity(1, startTime);
            instruction.SetOpacity(0, currentTime);

            return instruction;
        }

As I said I solved my problem thanks to Shawn's help, and most of this code was translated to C# from his answers, so please, if you were planning on voting up this answer, vote up Shawn's one instead, or both.