I'm assuming all I need to do is encode 2^64 as base64 to get a 11 character Youtube identifier. I created a Go program https://play.golang.org/p/2nuA3JxVMd0
package main
import (
"crypto/rand"
"encoding/base64"
"encoding/binary"
"fmt"
"math"
"math/big"
"strings"
)
func main() {
// For example Youtube uses 11 characters of base64.
// How many base64 characters would it require to express a 2^64 number? 2^6^x = 2^64 .. so x = 64/6 = 10.666666666 … i.e. eleven rounded up.
// Generate a 64 bit number
val, _ := randint64()
fmt.Println(val)
// Encode the 64 bit number
b := make([]byte, 8)
binary.LittleEndian.PutUint64(b, uint64(val))
encoded := base64.StdEncoding.EncodeToString([]byte(b))
fmt.Println(encoded, len(encoded))
// https://youtu.be/gocwRvLhDf8?t=75
ytid := strings.ReplaceAll(encoded, " ", "-")
ytid = strings.ReplaceAll(ytid, "/", "_")
fmt.Println("Youtube ID from 64 bit number:", ytid)
}
func randint64() (int64, error) {
val, err := rand.Int(rand.Reader, big.NewInt(int64(math.MaxInt64)))
if err != nil {
return 0, err
}
return val.Int64(), nil
}
But it has two issues:
- The identifier is 12 characters instead of the expected 11
- The encoded base64 suffix is "=" which means that it didn't have enough to encode?
So where am I going wrong?
CodePudding user response:
tl;dr
An 8-byte int64
(no matter what value) will always encode to 11
base64 bytes followed by a single padded byte =
, so you can reliably do this to get your 11
character YouTubeID
:
var replacer = strings.NewReplacer(
" ", "-",
"/", "_",
)
ytid := replacer.Replace(encoded[:11])
or (H/T @Crowman & @Peter) one can encode without padding & without replacing
and /
with base64.RawURLEncoding:
//encoded := base64.StdEncoding.EncodeToString(b) // may include or /
ytid := base64.RawURLEncoding.EncodeToString(b) // produces URL-friendly - and _
https://play.golang.org/p/AjlvtfR7RWD
One byte
(i.e. 8-bits) of Base64 output conveys 6-bits of input. So the formula to determine the number of output bytes given a certain inputs is:
out = in * 8 / 6
or
out = in * 4 / 3
With a devisor of 3
this will lead to partial use of output bytes in some cases. If the input bytes length is:
- divisible by 3 - the final byte lands on a byte boundary
- not divisible by 3 - the final byte is not on a byte-boundary and requires padding
In the case of 8 bytes of input:
out = 8 * 4 / 3 = 10 2/3
will utilize 10
fully utilized output base64 bytes - and one partial byte (for the 2/3
) - so 11
base64 bytes plus padding to indicate how many wasted bits.
Padding is indicated via the =
character and the number of =
indicates the number of "wasted" bits:
waste padding
===== =======
0
1/3 =
2/3 ==
Since the output produces 10 2/3
used bytes - then 1/3
bytes were "wasted" so the padding is a single =
So base64 encoding 8
input bytes will always produce 11
base64 bytes followed by a single =
padding character to produce 12
bytes in total.
CodePudding user response:
=
in base64 is padding, but in 64-bit numbers, this padding is extra and does not require 12 characters, but why?
see Encoding.Encode
function source:
func (enc *Encoding) Encode(dst, src []byte) {
if len(src) == 0 {
return
}
// enc is a pointer receiver, so the use of enc.encode within the hot
// loop below means a nil check at every operation. Lift that nil check
// outside of the loop to speed up the encoder.
_ = enc.encode
di, si := 0, 0
n := (len(src) / 3) * 3
//https://golang.org/src/encoding/base64/base64.go
in this (len(src) / 3) * 3
part , used 3
instead of 6
so output of this function always is string with even length, if your input is always 64-bit, you can delete =
after encoding and add it again for decoding.
for i := 8; i <= 18; i {
b := make([]byte, i)
binary.LittleEndian.PutUint64(b, uint64(0))
encoded := base64.StdEncoding.EncodeToString(b)
fmt.Println(encoded)
}
AAAAAAAAAAA=
AAAAAAAAAAAA
AAAAAAAAAAAAAA==
AAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAA==
AAAAAAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAA==
AAAAAAAAAAAAAAAAAAAAAAA=
AAAAAAAAAAAAAAAAAAAAAAAA
What do I mean by 6 (or 3)?
base64 use 64 character, each character map to one value (from 000000 to 111111)
example:
a 64bit value (uint64):
11154013587666973726
binary representation:
1001101011001011000001000100001011110000110001010011010000011110
split each six digit:
001001,101011,001011,000001,000100,001011,110000,110001,010011,010000,011110
J, r, L, B, E, L, w, x, T, Q, e