ePub漫画转换为CBZ的python脚本

Randir · 2022 年11 月 18 日 12:36

从推荐一个 Epub在线转换工具继续讨论：
SumatraPDF渲染 https://mox.moe 的ePub漫画有点小问题，加上漫画几乎没有排版问题，CBZ更适合我，而Calibre的ePub转Zip基本只是重命名，图片文件名乱序就会出问题
epub.to 网站的下载速度是个位数KB……

脚本来自下贴

# -*- coding: utf-8 -*-
#使用方法，将本文件放置到和待转换文件的同级目录
import sys,time
import zipfile
import os
def find_str(content):
    content = content.decode()
    begin = content.find("<img src=")
    end = content.find("\"",begin+10)
    return content[begin+10:end].replace("../","")
def get_map(filename):
    zread =zipfile.ZipFile(filename, 'r')
    zwrite = zipfile.ZipFile(filename.replace(".epub",".cbz"), 'w')
    for name in zread.namelist(): 
        if name.endswith(".html"):
            name_item = name.split(".")
            name_begin = name_item[0].split('/')[1]
            content = zread.read(name)
            img_path = find_str(content)
            img_end = img_path.split(".")[-1]
            if len(img_end)>4:
                continue
            image_name = "%s.%s"%(name_begin.rjust(3,'0') , img_end)
            if img_path in zread.namelist():
                #print (image_name,img_path)
                zwrite.writestr(image_name,zread.read(img_path))
    print("Successfully Converted",filename)
    zread.close()
    zwrite.close()
import os  
def fn(file_dir):
    L=[]   
    for root, dirs, files in os.walk(file_dir):  
        for f in files:  
            if os.path.splitext(f)[1] == '.epub':  # 想要保存的文件格式
                L.append(os.path.join(f))  #root 代表路径
    return L
now = os.getcwd()#当前目录
for filename in fn(now):#读取当前目录下所有的epub文件
    get_map(filename)

Eric_Ma · 2023 年12 月 9 日 15:11

最近使用这个脚本时，发现这个脚本默认vol上下载的漫画page都是以001.html类似的开头的。我这里有部分epub并不是这个规则，所以重新写了一下这个脚本，加了一些功能。

解析了epub的spine部分，依据spine的顺序对图片进行命名，确保顺序不错。
递归的转化，可以把所有子文件夹下的epub都转成cbz。
尝试通过epub和文件名生成comicInfo信息，方便kavita和komga索引。
重命名以满足komga默认的字典序规则，最大程度保证komga索引顺序。
将图片转化成webp格式，在SSIM为99.9的情况下减少了大概20%的体积。
添加进度条，更加友好。

环境为python3，最好版本新一点。
依赖第三方库 pillow （转化webp），tqdm（进度条），ebookmeta（解析epub meta），ebooklib（解析epub）
下面为脚本

# -*- coding: utf-8 -*-
# 使用方法，将本文件放置到和待转换文件的同级目录
# import sys, time
import zipfile
import os
import ebookmeta
import ebooklib
import tqdm
from ebooklib import epub
from io import BytesIO
from PIL import Image
import xml.etree.ElementTree as ET
import pathlib
import re
from typing import Tuple, Optional, List, Union



class PageInfo:
    def __init__(self, idx: int):
        self.image = idx
        self.Type = ""
        self.double_page = ""
        self.image_size = ""
        self.key = ""
        self.book_mark = ""
        self.image_width = ""
        self.image_height = ""

    def to_xml_ele(self):
        ele = ET.Element("Page")
        # ET.ident(ele)
        ele.set("Image", str(self.image))
        if self.Type:
            ele.set("Type", self.type)
        if self.double_page is True:
            ele.set("DoublePage", "true")
        elif self.double_page is False:
            ele.set("DoublePage", "false")
        if self.image_size:
            ele.set("ImageSize", self.image_size)
        if self.key:
            ele.set("Key", self.key)
        if self.book_mark:
            ele.set("Bookmark", self.book_mark)
        if self.image_width:
            ele.set("ImageWidth", self.image_width)
        if self.image_height:
            ele.set("ImageHeight", self.image_height)
        return ele


class ComicInfo:
    def __init__(self):
        self.series = ""
        self.series_sort = ""
        self.writer = ""
        self.publisher = ""
        self.title = ""
        self.number = ""
        self.volume = ""
        self.language_iso = "zh-CN"
        self.year = ""
        self.month = ""
        self.day = ""
        self.GTIN = ""
        self.tags = ""
        self.notes = ""
        self.summary = ""
        self.locations = ""
        self.pages = []

    def add_page(self, page: PageInfo):
        self.pages.append(page)

    def merge_with_epub_info(self, meta):
        if meta.identifier:
            self.GTIN = meta.identifier
        if len(meta.author_list):
            self.writer = ",".join(meta.author_list)
        if meta.series:
            self.series = meta.series
            self.series_sort = meta.series
        if meta.series_index:
            self.volume = str(int(float(meta.series_index)))
        if len(meta.tag_list):
            self.tags = ",".join(meta.tag_list)
        if meta.description:
            self.summary = meta.description
        if meta.lang:
            self.language_iso = meta.lang
        if meta.title:
            self.title = meta.title
        self.notes = str(meta)
        pub_info = meta.publish_info
        if pub_info.title:
            self.title = pub_info.title
        if pub_info.publisher:
            self.publisher = pub_info.publisher
        if pub_info.year:
            self.year = pub_info.year
        if pub_info.city:
            self.locations = pub_info.city
        if pub_info.series:
            self.series = pub_info.series
        if pub_info.series_index:
            self.volume = str(int(float(pub_info.series_index)))
        if pub_info.isbn:
            self.GTIN = pub_info.isbn

    def merge_with_name_info(self, series, vol, chapter, publisher):
        if series:
            self.series = series
            self.series_sort = series
        if vol:
            self.volume = str(vol)
        if chapter:
            self.number = chapter
        if publisher:
            self.publisher = publisher

    def build_comic_info_xml(self):
        try:
            root = ET.Element("ComicInfo")
            root.attrib["xmlns:xsi"] = "https://www.w3.org/2001/XMLSchema-instance"
            root.attrib["xmlns:xsd"] = "https://www.w3.org/2001/XMLSchema"

            def assign(cix_entry: str, md_entry: Optional[Union[str, int]]) -> None:
                if md_entry is not None and md_entry:
                    et_entry = root.find(cix_entry)
                    if et_entry is not None:
                        et_entry.text = str(md_entry)
                    else:
                        et_entry = ET.SubElement(root, cix_entry)
                        et_entry.text = str(md_entry)
                    # return et_entry
                else:
                    et_entry = root.find(cix_entry)
                    if et_entry is not None:
                        root.remove(et_entry)

            assign("Title", self.title)
            assign("Series", self.series)
            assign("SeriesSort", self.series_sort)
            assign("Writer", self.writer)
            assign("Publisher", self.publisher)
            assign("Number", self.number)
            assign("Volume", self.volume)
            assign("LanguageISO", self.language_iso)
            assign("Year", self.year)
            assign("Month", self.month)
            assign("Day", self.day)
            assign("GTIN", self.GTIN)
            assign("Tags", self.tags)
            assign("Notes", self.notes)
            assign("Summary", self.summary)
            assign("Locations", self.locations)
            if len(self.pages):
                pages_node = root.find("Pages")
                if pages_node is not None:
                    pages_node.clear()
                else:
                    pages_node = ET.SubElement(root, "Pages")
                for p in self.pages:
                    pages_node.append(p.to_xml_ele())
            ET.indent(root)
            tree = ET.ElementTree(root)
            return True, ET.tostring(tree.getroot(), encoding="utf-8", xml_declaration=True).decode(), ""
        except Exception as e:

            m = f"convert comic info xml failed with {e}"
            print(m)
            return False, "", m


# name_Vol.01_Ch.001-002_[publisher].epub
VOL_CH_RE_PAIR = (re.compile(r"([^_]+)_Vol\.(\d+)_Ch\.([^_]+)_\[([^\]]+)\]\."),
                  (2, 3, 1, 4, -1))  # series:1 vol:2 ch:3 publish:4,subname:-1
# name_Vol.01_[publisher].epub # series:1 vol:2 ch:-1 publish:3,subname:-1
VOL_RE_PAIR = (re.compile(r"([^_]+)_Vol\.(\d+)_\[([^\]]+)\]\."), (2, -1, 1, 3, -1))
# [publisher][series]sub_name第01卷.kepub.epub
MOE_SUBNAME_RE = (re.compile(r"\[([^\[]+)\](\[[^\[]+\])(.+)第(\d+)卷"), (4, -1, 2, 1, 3))

# [publisher][series]卷01.kepub.epub # publisher:1 series:2 vol:3,ch:-1,subname:-1
MOE_SUBNAME_RE = (re.compile(r"\[([^\[]+)\]\[([^\[]+)\](.+)第(\d+)卷"), (4, -1, 2, 1, 3))

# [publisher][series]話01-002.kepub.epub # publisher:1 series:2 vol:-1,ch:3,subname:-1
MOE_CH_RE_PAIR = (re.compile(r"\[([^\[]+)\]\[([^\[]+)\]話([\d-]+)"), (-1, 3, 2, 1, -1))

NAME_RULE=[
    VOL_CH_RE_PAIR,
    VOL_RE_PAIR,
    MOE_CH_RE_PAIR,
    MOE_SUBNAME_RE,
    MOE_VOL_RE_PAIR
]

class Converter:
    def __init__(self):
        self.error_msg = ""
        pass

    def produce_metda_data_name(self, path) -> (str, str):
        cm = ComicInfo()
        obj_path = pathlib.Path(path)
        name = str(obj_path.name)
        res = False
        for rules in NAME_RULE:
            res, vol, ch, series, publisher = self.extract_base_info_from_name(name, rules)
            if res:
                cm.merge_with_name_info(series, vol, ch, publisher)
                break
        if res is False:
            m = f"filename {path} not support"
            self.error_msg += m + "\n"
            res = False
            print(m)
        if res:
            cm.merge_with_name_info(series, vol, ch, publisher)
        try:
            metadata = ebookmeta.get_metadata(path)
            cm.merge_with_epub_info(metadata)
        except Exception as e:
            m = f"parse metadata from epub failed with {e}"
            self.error_msg += m + "\n"
            print(m)

        if res:
            _, name = self.produce_new_name(series, vol, ch, publisher)
        else:
            name = ""
        return cm, name

    def convert_to_webp(self, img_bytes) -> (bool, bytes):
        try:
            img = Image.open(BytesIO(img_bytes))
            # import pdb
            # pdb.set_trace()
            out = BytesIO()
            img.save(out, format="webp", quality=80)
            # img.save(out,format='webp',lossless=True,quality=100,method=6)
            return True, out.getvalue(), img.size

        except Exception as e:
            m = f"convert to webp failed with {e}"
            self.error_msg += m + "\n"
            print(m)
            return False, img_bytes, (-1, -1)

    def extract_base_info_from_name(self, name, re_pair) -> (
            bool, int, str, str,
            str):  # repr, group_index: Tuple[int, int, int, int]) ->  # (vol,chapter,series,publisher) not kown use "" or 1000
        repr = re_pair[0]
        group_index = re_pair[1]
        if len(group_index) != 5:
            return False, 1, "", "", "", "", ""
        res = repr.search(name)
        if res:
            try:
                vol = 1000
                chapter = ""
                series = ""
                publisher = ""
                vol_idx = group_index[0]
                chapter_idx = group_index[1]
                series_idx = group_index[2]
                publisher_idx = group_index[3]
                sub_name_idx = group_index[4]
                if vol_idx != -1:
                    vol = int(float(res.group(vol_idx)))
                if chapter_idx != -1:
                    chapter = res.group(chapter_idx)
                if series_idx != -1:
                    series = res.group(series_idx)
                if publisher_idx != -1:
                    publisher = res.group(publisher_idx)
                if sub_name_idx != -1:
                    sub_name = res.group(sub_name_idx)
                    if sub_name:
                        series=f"{series}_{sub_name}"
                return True, vol, chapter, series, publisher
            except Exception as e:
                m = f"extract info from {name} use {repr.pattern} Failed for{e}"
                self.error_msg += m + "\n"
                print(m)
                return False, 1, "", "", ""
        else:
            return False, 1, "", "", ""

    def produce_new_name(self, series, vol: int, chapter: str, publisher) -> (bool, str):
        # vol padding on len 3,chapter padding on 4
        try:
            if not publisher:
                publisher = "ericma"
            if "-" in chapter:
                chapter = [f"{int(float(i)):04}" for i in chapter.split("-")]
                chapter = "-".join(chapter)
            elif chapter:
                chapter = f"{int(float(chapter)):04}"
            if chapter:
                return True, f"{series}_[{publisher}]_Vol.{vol:04}_Ch.{chapter}.cbz"
            else:
                return True, f"{series}_[{publisher}]_Vol.{vol:04}.cbz"
        except Exception as e:
            m = f"build name on ({series},{vol, chapter, publisher}) failed for {e}"
            self.error_msg += m + "\n"
            print(m)
            return False, ""

    def resolve_path_on_any_platform(self, root_path, rel_path):
        root = pathlib.PurePosixPath(root_path)
        rel_path = pathlib.PurePosixPath(rel_path)
        for p in rel_path.parts:
            if p == "..":
                root = root.parent
            elif p != '.':
                root = root / p
        return root.as_posix()

    def process(self, path):
        new_name = None
        try:
            print(f"process {path}")
            self.error_msg = ""
            cm, new_name = self.produce_metda_data_name(path)
            old_name = pathlib.Path(path).name
            if not new_name:
                new_name = path.replace(".epub", ".cbz")
            else:
                new_name = path.replace(old_name, new_name)
            if os.path.exists(new_name):
                print(f"cbz {new_name} already exists")
                return True, ""
            with zipfile.ZipFile(new_name, 'w') as zwrite:
                # if data:
                #    zwrite.writestr("ComicInfo.xml", data)  # ,zipfile.ZIP_DEFLATED)
                ebook = ebooklib.epub.read_epub(path, options={"ignore_ncx": True})
                idx = 1
                img_list = []
                for ref_id, is_show in ebook.spine:
                    page = ebook.get_item_with_id(ref_id)
                    if type(page) == ebooklib.epub.EpubHtml:
                        xml_content = page.content
                        root_path = str(pathlib.PurePosixPath(page.file_name).parent)
                        ele = ET.fromstring(xml_content)
                        for item in ele.findall(".//"):
                            if "img" in item.tag:
                                if "src" in item.attrib:
                                    src = item.attrib["src"]
                                    # process imag_path
                                    abs_path = self.resolve_path_on_any_platform(root_path, src)
                                    img_list.append((idx, abs_path, ref_id, item.attrib))
                                    idx += 1
                paddinglen = len(str(len(img_list)))
                for idx, abs_path, ref_id, attr_dict in tqdm.tqdm(img_list):
                    try:
                        img_block = ebook.get_item_with_href(abs_path)
                        s = pathlib.Path(abs_path).suffix
                        if s in set([".jpg", ".png", ".jpeg"]) or img_block.media_type in set(
                                ["image/jpeg", "image/png"]):
                            res, img_d, shape = self.convert_to_webp(img_block.content)
                            if res:
                                newname = f"{str(idx).rjust(paddinglen, '0')}-{ref_id}.webp"
                            else:
                                newname = f"{str(idx).rjust(paddinglen, '0')}-{ref_id}{s}"
                            page = PageInfo(idx)
                            if "class" in attr_dict:
                                if attr_dict["class"] == "singlePage":
                                    page.double_page = False
                                elif attr_dict["class"] == "twoPage":
                                    page.double_page = True
                            page.image_size = str(len(img_d))
                            page.key = ref_id
                            page.image_width = str(shape[0])
                            page.image_height = str(shape[1])
                            cm.add_page(page)
                            zwrite.writestr(newname, img_d)  # , zipfile.ZIP_DEFLATED)
                    except Exception as e:
                        m = f"process image on {ref_id} name {abs_path} failed with {e} "
                        self.error_msg += m + "\n"
                        if new_name:
                            if os.path.exists(new_name):
                                os.remove(new_name)
                        return False, self.error_msg
                res, data, msg = cm.build_comic_info_xml()
                if msg:
                    self.error_msg += msg + "\n"
                if data:
                    zwrite.writestr("ComicInfo.xml", data, zipfile.ZIP_DEFLATED)
            return True, self.error_msg
        except Exception as e:
            m = f"process {path} failed with {e}"
            self.error_msg += m + '\n'
            print(e)
            if new_name:
                if os.path.exists(new_name):
                    os.remove(new_name)
            return False, self.error_msg


if __name__ == '__main__':
    c = Converter()
    now = os.getcwd()


    # import pdb
    # pdb.set_trace()

    def fn(file_dir):

        for root, dirs, files in os.walk(file_dir):
            for f in files:
                if os.path.splitext(f)[1] == '.epub':  # 处理epub
                    yield os.path.relpath(os.path.join(root, f), now)


    res_warning_dict = dict()
    res_failed_dict = dict()
    for filename in fn(now):   # 读取当前以及子目录下所有的epub文件
        res, msg = c.process(filename)
        if res:
            print(f"process {filename} succeed")
            if msg:
                res_warning_dict[filename] = msg
        else:
            print(f"process {filename} failed")
            res_failed_dict[filename] = msg

    print("==============below is convert with some warning ==============")
    for k, v in res_warning_dict.items():
        print(f"> {k}\n {v}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
    print("==============below is convert failed ==============")
    for k, v in res_failed_dict.items():
        print(f"> {k}\n {v}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")

huamei025 · 2023 年12 月 10 日 11:04

最近正在学习QT的内容，顺手写了一个简单的UI，使用的是pyside6。
在楼主的代码基础上，import 一下Converter类即可，写一个启动脚本或者配置一个快捷方式就可以直接用起来。

def main_app(now_dir: str):
    c = Converter()

    def get_epub_files(file_dir):
        for root, dirs, files in os.walk(file_dir):
            for _file in files:
                if os.path.splitext(_file)[1] == '.epub':  # 处理epub
                    yield os.path.relpath(os.path.join(root, _file), now_dir)

    res_warning_dict = dict()
    res_failed_dict = dict()
    for filename in get_epub_files(now_dir):  # 读取当前以及子目录下所有的epub文件
        res, msg = c.process(filename)
        if res:
            print(f"process {filename} succeed")
            if msg:
                res_warning_dict[filename] = msg
        else:
            print(f"process {filename} failed")
            res_failed_dict[filename] = msg
    return (
        '\n'.join([f"【{_f}】：{_m}" for _f, _m in res_warning_dict.items()]),
        '\n'.join([f"【{_f}】：{_m}" for _f, _m in res_failed_dict.items()])
    )


class AppWidgets(QtW.QWidget):

    def __init__(self, parent=None):
        super().__init__(parent)
        self.setLayout(QtW.QVBoxLayout())
        # 导入按钮
        self.button_load = QtW.QPushButton('选择目录')
        self.button_load.clicked.connect(self.trans_epub)
        # 结果显示框
        self.text_result = QtW.QTextBrowser()
        self.text_result.setText('执行结果显示在此处')
        # 设置组件的布局
        self.setup_layout()

    def layout(self) -> QtW.QVBoxLayout:
        return super().layout()

    def setup_layout(self):
        self.layout().addWidget(self.text_result)
        self.layout().addWidget(QtW.QLabel('选择文件夹进行转换操作'))
        self.layout().addWidget(self.button_load)

    def trans_epub(self):
        choose_dir = QtW.QFileDialog.getExistingDirectory()
        if not choose_dir:
            return
        suc_list, fail_list = main_app(choose_dir)
        self.text_result.setText(f"{suc_list}\n\n{fail_list}")


class MainWindow(QtW.QMainWindow):
    def __init__(self, parent=None):
        super().__init__(parent)
        self.setWindowTitle('EPUB')
        self.setMinimumSize(450, 350)
        self.setCentralWidget(AppWidgets())


if __name__ == '__main__':
    app = QtW.QApplication()
    main_win = MainWindow()
    main_win.show()
    sys.exit(app.exec())

hzhbest · 2023 年12 月 11 日 09:39

CBZ格式有什么优势啊？我一般都是直接zip压缩包看……
。。。。。。
另外觉得mobi格式中漫画白边太大，所以想转成其他格式，结果一圈下来没几个靠谱的本地转换方案，Calibre在UI上操作又很麻烦……
知乎上说NeatConverter免费又好用，结果尝试发现其转出来的文件体积明显比Calibre转的小；其初次使用还是下载了一堆Calibre运行库，感觉就是只做了个UI还体积老大（还不算另外的运行库，库还可能是老旧的），放弃……
然后发现Calibre程序文件夹中有ebook-convert.exe，发现只能接收单个文件，然后又自学bat批量处理……最终成功实现使用Calibre来转换格式的批处理程序（麻烦在不懂得其传参过程，总被甩一脸“命令不正确”折腾好久）。
。。。。。。
开始的时候转成epub，但似乎转换图片尺寸（也有可能是旋转参数）有差异的mobi时转换出来的epub在sumatrapdf上显示不正常，最后还是选择转成zip，反正用sumatra还是一样看。

mengyvleihen · 2023 年12 月 11 日 09:49

确实，完全不理解CBZ的优势在哪里，我下载的漫画都是直接压缩成zip，cbz也会解压再压缩为zip。

ICEBOX · 2023 年12 月 11 日 11:01

cbz文件把后缀改成zip就行了，完全不需要解压再压缩。。。

mengyvleihen · 2023 年12 月 11 日 11:24

因为我觉得别人的压缩比不行，习惯用压缩软件重新压缩一遍，个人习惯而已，可能是很多时候毫无意义的习惯。

lisansas · 2023 年12 月 11 日 13:02

cbz的优势在于很多阅读器的支持比较好啊，你用pc看就没啥优势了

hzhbest · 2023 年12 月 12 日 01:18

呃……所以我想知道的是CBZ究竟比ZIP多了什么？

xelx · 2023 年12 月 12 日 02:51

cbz就是zip，是同一种格式。

charleswei1024 · 2023 年12 月 12 日 05:04

cbz就是zip改个后缀，只是不同的后缀能设置不同的默认应用来打开，阅读起来比较方便。同理 cbr 就是 rar，cb7 就是 7z。

mobi 和 epub 也是压缩包，想转格式可以直接解压把里面的图片提取出来。命名规律的情况下直接把图片压缩了就是cbz，命名不规律就要用楼主的脚本了。

charleswei1024 · 2023 年12 月 12 日 05:08

图片这种文件已经压缩过，再怎么也不会有多高压缩比。想节省体积要用专门的图片压缩软件。

Eric_Ma · 2023 年12 月 12 日 05:23

cbz （comic book zip）就是zip呀，这里不过是用webp把图片重新压缩了一遍，体积缩小也是因为webp比jpg要小。另外补全了cbz的comic info信息方便，一些阅读器进行处理。

Eric_Ma · 2023 年12 月 12 日 05:26

我测试过图片再zip一遍，体积不会有太大的变化，除非换图片编码格式。
所以图片打成zip选择归档最好，使用时减少cpu消耗。

hzhbest · 2023 年12 月 12 日 07:36

那我懂了，自己将zip改成cbz也行，就是方便双击打开是阅读器而不是压缩软件。

mobi是什么格式的压缩包？什么压缩软件能解包？

charleswei1024 · 2023 年12 月 12 日 07:43

我弄错了，mobi是一种私有的二进制，不是通用的压缩包。
解压可以使用
https://github.com/kevinhendricks/KindleUnpack/

话题		回复	浏览量
自用网文漫画脚本讨论分享 python	2	983	2023 年4 月 25 日
批量保存网页内容并集合生成一个本地离线的电子书或者文档的方案（类似chm）问题求助❓ windows	9	1496	2022 年7 月 3 日
求助！有无软件可以爬取网页（自目录页获取网址），然后保存成mobi或epub一类电子书格式？问题求助❓ windows	6	3378	2023 年1 月 4 日
关于使用 ExifTool 批量修复照片、视频的媒体创建时间，使其在相册正确排序，学习参考代码讨论分享	0	1071	2021 年10 月 21 日
有没有什么软件可以管理PDF书签？问题求助❓ windows , shell , pdf	7	1629	2022 年4 月 22 日

ePub漫画转换为CBZ的python脚本

相关话题