Python 2 and 3 compatible pickle save and load


Python's pickle is great and convenient to use, however python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load.

Compatibility Issue

Use python 2 save a pickle

import pickle

with open('test.pickle', 'wb') as f:
    pickle.dump(my_object)

and afterward loading in python 3

import pickle

with open('test.pickle', 'rb') as f:
    pickle.load(f)

might result in a UnicodeDecodeError issue

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

Solution

Use python pickle's encoding argument, ref:

https://docs.python.org/3/library/pickle.html#pickle.Unpickler

def load_pickle(pickle_file):
    try:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f)
    except UnicodeDecodeError as e:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f, encoding='latin1')
    except Exception as e:
        print('Unable to load data ', pickle_file, ':', e)
        raise
    return pickle_data

A Humble Introduction to the Basics of Unicode


Unicode诞生之前——单字节编码 (Single-Byte Encoding)

ASCII

!"#$%&'()*+,-./0123456789:;<=>[email protected][\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  • 7位二进制编码
  • 32个控制字符
  • 95个可显示字符
  • 基本通用,但经常使用替换和扩展
  • 适合英语的编码方式,但不适合其他语言

http://en.wikipedia.org/wiki/ASCII

扩展ASCII

  • IBM提出Code Page概念
  • CP437 (Latin US)
  • CP737 (Greek)
  • Apple: Mac OS Roman
  • DEC: Multinational Character Set
  • ISO/IEC 8859
  • … 扩展ASCII种类不计其数

http://en.wikipedia.org/wiki/Extended_ASCII

Code Page 437

  • IBM-PC (MS-DOS) 编码CP437:
  • 西班牙语、法语、葡萄牙语、德语
  • 荷兰货币Florin (ƒ),西班牙货币Peseta (₧)

CP437

ISO/IEC 8859

ISO-8859

http://en.wikipedia.org/wiki/ISO/IEC_8859

ISO/IEC 8859-1 (Latin-1)

¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

ISO/IEC 8859-5 (Latin/Cyrillic)

ЁЂЃЄЅІЇЈЉЊЋЎЏАБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп
рстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќ§ўџ

单字节编码, Good or Bad?

好处

  • 节约空间
  • 处理性能优秀
  • 处理简单:可使用简单迭代、允许随机访问

坏处

字符串Алло\0经过存储,再读出,如果编码方式不同:

ISO/IEC 8859-5 (Latin/Cyrillic)

А    л    л    о    \0
↓    ↓    ↓    ↓    ↓
B0   DB   DB   DE   00

ISO/IEC 8859-5 (Latin/Cyrillic)

B0   DB   DB   DE   00
↓    ↓    ↓    ↓    ↓
А    л    л    о    \0

ISO/IEC 8859-1 (Latin-1)

同样的二进制字节使用 ISO/IEC 8859-1 (Latin-1) 编码,将会得到:

B0   DB   DB   DE   00
↓    ↓    ↓    ↓    ↓
°    Ϋ    Ϋ    ή    \0

Gibberish!!!!

Unicode诞生之前——变长字节编码 (Variable-Length Encodings)

Code Page 936 (ISO/IEC 2022)

  • 一部分字符使用【单字符编码】,另一部分使用【双字符编码】
  • 前127使用ASCII编码,扩展部分使用双字节

前127 ASCII编码:

CP936 Standard Part

以0x81为第一字节的编码表,即接上:

CP936 Extended Part

http://en.wikipedia.org/wiki/Code_page_936

http://msdn.microsoft.com/en-US/goglobal/cc305153

变长字节编码, Good or Bad?

好处

  • 节约空间
  • 支持更多字符
  • 处理性能优秀

坏处

编码重叠

编码重叠意味着,同样的一个二进制值,将会在不同上下文的情况下,被解码为不同的内容,例如:

65 -> e
84 65 -> 别
84 84 -> 剟
84 65    65  65    84 84    84 65
↓        ↓   ↓     ↓        ↓
别       e   e     剟       别

无法随机访问

如果想要进行随机访问,例如有一个指针指在如下位置:

… 65 65 65 …… 65
     ↑
     指针p

你将无法知道指针所指向的内容到底是什么

Unicode 1.0

Unicode设计信条

一种编码,一统江湖

Unicode 1.0 实际情况 (即UTF-16)

  • 每一个字符使用16位二进制进行编码
  • 理论最多可编码65,536个字符(实际上当然不到)
  • 完美兼容Latin-1 (ISO/IEC 8859-1)
  • 使用UCS-2编码,即后来的UTF-16

完全兼容Latin-1: (H->48, e->65, ...)

H      e      l      l      o      !      \0
↓      ↓      ↓      ↓      ↓      ↓      ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000

UTF-16举例:

A      α      Ж      あ     羽     ❤ 
↓      ↓      ↓      ↓      ↓      ↓
U+0041 U+03B1 U+0416 U+3042 U+7FBD U+2764

Unicode 1.0, Good or Bad?

好处

  • 参见Unicode 1.0 实际情况部分
  • 字符串支持下标访问,无额外开销

坏处

  • 无法随机访问
  • 编码重叠
  • 大端小端问题

大端小端问题

大端存储:

00 48  00 65  00 6C  00 6C  00 6F  00 21  00 00
↓      ↓      ↓      ↓      ↓      ↓      ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓      ↓      ↓      ↓      ↓      ↓      ↓
H      e      l      l      o      !      \0

小端读出:

48 00  65 00  6C 00  6C 00  6F 00  21 00  00 00
↓      ↓      ↓      ↓      ↓      ↓      ↓
U+4800 U+6500 U+6C00 U+6C00 U+6F00 U+2100 U+0000
↓      ↓      ↓      ↓      ↓      ↓      ↓
䠀     攀      氀     氀     漀      ℀     \0

为解决大端小端问题引入的BOM (Byte Order Mark)

大端存储:

FE FF  00 48  00 65  00 6C  00 6C  00 6F  00 21  00 00
↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓
U+FEFF U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
       ↓      ↓      ↓      ↓      ↓      ↓      ↓
(BOM)  H      e      l      l      o      !      \0

小端读出:

FF FE  48 00  65 00  6C 00  6C 00  6F 00  21 00  00 00
↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓
U+FEFF U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
       ↓      ↓      ↓      ↓      ↓      ↓      ↓
(BOM)  H      e      l      l      o      !      \0

BOM 又意味着Unicode的分化:

  1. Unicode without BOM
  2. Unicode with BOM

Unicode 1.0 编码的使用情况

Unicode 1.0 Introduction:

"With over 30,000 unallocated character positions, the Unicode character encoding provides sufficient space for foreseeable future expansion."

Unicode 1.0编码空间使用情况:

Unicode 1.0 Space Usage

Unicode 编码使用增长情况

Unicode Space Usage Increase Info

现代Unicode

我们需要扩展容量!

UTF-32

00 00 00 48  00 00 00 69  00 00 00 21
↓            ↓            ↓
U+0048       U+0065       U+0021
↓            ↓            ↓
H            i            !

同样BOM的情况:

00 00 FE FF  00 00 00 48  00 00 00 69  00 00 00 21
↓            ↓            ↓            ↓
U+FEFF       U+0048       U+0065       U+0021
             ↓            ↓            ↓
(BOM)        H            i            !
FF FE 00 00  48 00 00 00  69 00 00 00  21 00 00 00
↓            ↓            ↓            ↓
U+FEFF       U+0048       U+0065       U+0021
             ↓            ↓            ↓
(BOM)        H            i            !

UTF-32中含有的UTF-16中无法编码的emoji表情:

00 00 FE FF  00 00 00 48  00 00 00 69  00 01 F4 30
↓            ↓            ↓            ↓
U+FEFF       U+0048       U+0065       U+1F430
             ↓            ↓            ↓
(BOM)        H            i            🐰

UTF-32, Good or Bad?

好处

  • 容量更大了

坏处

参考Unicode 1.0的坏处

UTF-8

UTF-8编码方式:

UTF-8

例如:

你 (U+4F60)

UTF-8:
  hex: 0xE4 0xBD 0xA0
  oct: 0228 0189 0160
  dec: 14990752
  bin: 11100100 10111101 10100000

∵ U+4F60 ∈ [U+8000, U+FFFF]
∴ "你"需要3个字节,形式如:1110xxxx 10xxxxxx 10xxxxxx

"你"的UTF-8编码为: 11100100 10111101 10100000

形式:        1110xxxx 10xxxxxx 10xxxxxx
编码:        11100100 10111101 10100000
Unicode部分: ----0100 --111101 --100000

放在一起,就是它的Unicode编码了: 0100 111101 100000,即0x4F60 (U+4F60)

同理

🐰 (U+1F430)

形式:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
编码:        11110000 10011111 10010000 10110000
Unicode部分: -----000 --011111 --010000 --110000

hex("000 011111 010000 110000") ==> 0x1F430 (U+1F430)

字符串的形式情况如下:

48     65     6C     6C     6F     21     00
↓      ↓      ↓      ↓      ↓      ↓      ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓      ↓      ↓      ↓      ↓      ↓      ↓
H      e      l      l      o      !      \0
A      α      Ж      あ     羽     ❤         🐰 
↓      ↓      ↓      ↓      ↓      ↓         ↓
U+0041 U+03B1 U+0416 U+3042 U+7FBD U+2764    U+1F430
↓      ↓      ↓      ↓      ↓      ↓         ↓
41     C1 A6  A8 A7  A2 A4  D3 F0  A4 9D E2  B0 90 9F F0

UTF-8 BOM?

由于UTF-8是按照字节的,所以并没有大端小端问题,不需要BOM啦~

UTF-8, Good or Bad?

好处

  • 不需要BOM
  • 前127个字符与ASCII统一
  • 可以在小开销的情况下,进行随机访问

坏处

  • 下标访问性能堪忧

Unicode 字符生成

基本字符:A
组合字符:ÀÁÂÃÄÅ
A        +  ̈        →  Ä
U+0041      U+0308
e       +   ̃        +   ̽        +   ̪        →  ẽ̪̽
U+0065      U+0303      U+033D      U+032A
e       +   ̽        +   ̃        +   ̪        →  e̪̽̃
U+0065      U+033D      U+0303      U+032A

Unicode 字符串长度?

字符串: 1 Ä 🍸

1      A      ¨      🍸
U+0031 U+0041 U+0308 U+1F378

字符串长度:string().length()

| 编码   | 1        | A        | ¨        | 🍸          | Length |
|--------|----------|----------|----------|-------------|--------|
| UTF-8  | 31       | 44       | CC 88    | F0 9F 8D B8 | ?      |
| UTF-16 | 0031     | 0041     | 0308     | D83C DF78   | ?      |
| UTF-32 | 00000031 | 00000041 | 00000308 | 0001F378    | ?      |

字符串长度:字节数

| 编码   | 1        | A        | ¨        | 🍸          | Length |
|--------|----------|----------|----------|-------------|--------|
| UTF-8  | 31       | 44       | CC 88    | F0 9F 8D B8 | 8      |
| UTF-16 | 0031     | 0041     | 0308     | D83C DF78   | 10     |
| UTF-32 | 00000031 | 00000041 | 00000308 | 0001F378    | 16     |

字符串长度:编码单元数

UTF-8 编码单元为1个字节 UTF-16编码单元为2个字节 UTF-32编码单元为3个字节

| 编码   | 1        | A        | ¨        | 🍸          | Length |
|--------|----------|----------|----------|-------------|--------|
| UTF-8  | 31       | 44       | CC 88    | F0 9F 8D B8 | 8      |
| UTF-16 | 0031     | 0041     | 0308     | D83C DF78   | 5      |
| UTF-32 | 00000031 | 00000041 | 00000308 | 0001F378    | 4      |

字符串长度:码位

| 编码   | 1        | A        | ¨        | 🍸          | Length |
|--------|----------|----------|----------|-------------|--------|
| UTF-8  | 31       | 44       | CC 88    | F0 9F 8D B8 | 4      |
| UTF-16 | 0031     | 0041     | 0308     | D83C DF78   | 4      |
| UTF-32 | 00000031 | 00000041 | 00000308 | 0001F378    | 4      |

字符串长度:字符个数

| 编码   | 1        | A        | ¨        | 🍸          | Length |
|--------|----------|----------|----------|-------------|--------|
| UTF-8  | 31       | 44       | CC 88    | F0 9F 8D B8 | 4      |
| UTF-16 | 0031     | 0041     | 0308     | D83C DF78   | 4      |
| UTF-32 | 00000031 | 00000041 | 00000308 | 0001F378    | 4      |

EFI Boot Windows from Local Hard Drive (No need for USB)


If you do not have a usb here available, or usb not bootable for some reason (i.e. motherboard issue, usb issue, or whatever), you might need to boot your windows from local hard drive.

Make available for an empty partition and format it into NTFS

It just needs to hold the installation files, so make it 8G will suffice.

# Using GParted will be fine
# i.e. /dev/sda4 is formatted into NTFS with label "WIN10"
mkdir -p "~/mounted/WIN10"
mount /dev/sda4 "~/mounted/WIN10"

mount your disk iso to any directory:

mount -o loop [path to your iso] [dest directory]
# i.e. mount -o loop "~/isos/win10.iso" "/mnt"

Copy over everything to the new drive

cp -r "/mnt/*" "~/mounted/WIN10"

Add a menuentry to your grub.cfg

search command:

search --no-floppy --set=root --label [USB drive label] --hint [your partition number]

your partition number: hd0,msdos4 ==> /dev/sda4 where:

  • hd0 means the first drive
  • msdos means the partition table (msdos or gpt)
  • msdos4 means the 5th partition of your drive, corresponds to /dev/sda4

Example menuentry here:

# grub.cfg should be located at '/boot/grub/grub.cfg'
# Append the following menuentry after your last memuentry

menuentry "Start Windows Installation" {
    insmod ntfs
    insmod search_label
    insmod chain
    insmod part_gpt

    set root=hd0,msdos4
    search --no-floppy --set=root --label "WIN10" --hint ($root)
    chainloader ($root)/boot/efi/bootx64.efi
}

Q&A:

  1. Do we need to run update-grub after adding menuentry?
    • No. Otherwise, update-grub will overwrite your grub.cfg file
  2. I'm booted into grub rescue. What can I do?
    • It doesn't matter. Booting into grub rescue means that the menuentry has some errors inside. You can manually type the menuentry commands in the grub rescue.

Reference:

  • http://onetransistor.blogspot.com/2014/09/make-bootable-windows-usb-from-ubuntu.html
  • http://onetransistor.blogspot.com/2015/09/uefi-ntfs-bootable-windows-usb-linux.html

Using MPI in Python


Environment Setup

# For Ubuntu
apt-get install python3 python3-dev libopenmpi-dev openmpi-bin mpi-default-bin
pip3 install mpi4py numpy matplotlib
# For CentOS
yum-config-manager --disable source
yum install python python-devel python-pip python-matplotlib python-numpy mpich mpich-devel
echo 'export PATH="/usr/lib64/mpich/bin/:$PATH"' >> ~/.bashrc
echo 'export C_INCLUDE_PATH="/usr/include/mpich-x86_64/:$C_INCLUDE_PATH"' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH="/usr/lib64/mpich/lib/:$LD_LIBRARY_PATH"' >> ~/.bashrc
# CentOS install matplotlib
yum-builddep python-matplotlib
python-pip install matplotlib

MPI Code

# file: mandelbrot-seq.py
import numpy as np
from matplotlib import pyplot as plt


def mandelbrot(c, maxit):
    z = c
    for n in range(maxit):
        if abs(z) > 2:
            return n
        z = z*z + c
    return 0


xmin,xmax = -2.0, 1.0
ymin,ymax = -1.0, 1.0
width,height = 320,200

maxit = 127

xlin = np.linspace(xmin, xmax, width)
ylin = np.linspace(ymin, ymax, height)
C = np.empty((width,height), np.int64)

for w in range(width):
    for h in range(height):
        C[w, h] = mandelbrot(xlin[w] + 1j*ylin[h], maxit)

plt.imshow(C, aspect='equal')
plt.spectral()
plt.show()
# file: mandelbrot-mpi.py
import numpy as np
from mpi4py import MPI
from matplotlib import pyplot as plt

def mandelbrot(c, maxit):
    z = c
    for n in range(maxit):
        if abs(z) > 2:
            return n
        z = z*z + c
    return 0


xmin,xmax = -2.0, 1.0
ymin,ymax = -1.0, 1.0
width,height = 320,200

maxit = 127

comm = MPI.COMM_WORLD
comm_size = comm.Get_size()
comm_rank = comm.Get_rank()

ncols = width // comm_size + (width % comm_size > comm_rank)
col_start = comm.scan(ncols) - ncols

xlin = np.linspace(xmin, xmax, width)
ylin = np.linspace(ymin, ymax, height)

C_local = np.empty((ncols,height), np.int64)

for w in range(ncols):
    for h in range(height):
        C_local[w, h] = mandelbrot(xlin[w+col_start] + 1j*ylin[h], maxit)

# Gather Results here
comm_gather_num = comm.gather(ncols, root=0)
C = None
if comm_rank == 0:
    C = np.empty((width,height), np.int64)
else:
    C = None
rowtype = MPI.INT64_T.Create_contiguous(height)
rowtype.Commit()

comm.Gatherv(sendbuf=[C_local, MPI.INT64_T],
        recvbuf=[C, (comm_gather_num, None), rowtype],
        root=0)
rowtype.Free()

if comm_rank == 0:
    plt.imshow(C, aspect='equal')
    plt.spectral()
    plt.show()

Setup your hosts and have fun!