2024-10-27-telegram自动批量译文追加在原文下方以及appleid注册验证码自动识别

11月 3 2024 日记 37 分钟读完 (约 5554 字)

telegram有插件月费40可以实时翻译任何语言，我粗看了下大概是发送到火山翻译然后把译文用js附加html元素显示在原文下，目前一天下来总会花不少时间刷知乎、rt今日俄罗斯看看有没有什么刺激的新闻，消息有限是一方面另外长期如此也容易信息茧房，于是想在telegram看看有什么关注度大的频道利用chatgpt来逆向这个telegram的翻译插件。这个程序用到了beautiful soup格式化html提取文本，分段发送给火山翻译。需要注意的有几点：

1.translate_tab = tab.get_tab(url=’volcengine.com‘) 用get_tab可以获取当前已有的标签页，我一开始是打开telegram再跳转到火山翻译，反复横跳。这样会造成每次跳转需要设置页面加载等待时间，另外火山翻译每次打开默认是翻译为英文，如果页面始终保持打开不刷新的话，可以设置好中文，这样不用反复切换翻译语言，那样也会耗时。

2.有时会莫名其妙翻译语言切换为英文，这里检测翻译语言，若因为不明原因发生变动则使其切换为中文。

3.过滤掉一些出现文本的html元素以及某些文本，比如点赞数的数字，日期，http www链接等等，@用户名，这些都会浪费时间没有翻译的意义。

4.一开始翻译遇到了译文顺序问题，多段译文会倒叙排列，因为后翻译的会去紧贴原文的div，我想了下不如把同一个div中的原文合并提交，这样还可以节约提交次数省事且解决了顺序问题，因为同一个div下的原文会合并一次性翻译出来。这样做还能避免重复出现过的内容检测第1个div从而使译文错位的问题。

5.译文因为重复出现过的内容检测第1个div从而使译文错位的问题，可以通过添加属性给属性值translated作为标记从而跳过，这个手法或许在4的使用下变得没有用武之地，不过其他项目可以参考。

目前实测下来40多条消息大约20分钟出头可以翻译完。抽空多找几个热门的多语种频道，测试一下。目前俄罗斯的频道找了5个，印度1个，中东的3个，巴西的2个，阿根廷的2个。以后可以尝试在zhudian.xyz开个栏目，每天自动汇总各国的新闻翻译版？

2024-11-1更新：

试了下，思路是通过blob链接用js解析base64下载图片，不过不知道为什么容易丢失，目前不知道原因即使检测本地文件夹是否存在文件也查不出来，明明没有却没print出来异样有待进一步研究。考虑到所有的div都是一次加载图片、视频容易刷不出来，这里加入js lazy加载图片视频，先把src改为data-src随着看到图片时再切回来，所以图片和视频的soup html里面要对class为full-media的加入lazy并把src删除。视频则是通过xpath定位到位置以后右键下载的，似乎视频的丢失概率比较小，我原本想用下载视频的方式处理图片，不过出错了，由于一个div下包含多个图片（不过也有一个div包含多个视频的貌似没出问题），导致用这种方法会重复下载第1个图，我按照enumerate索引div下每个图片，但是运行出错。

用chrome的css插件，拼凑了1个css，成功把telegram的消息框、图片、视频、奇怪的各种符号排版到位。soup用decompose去掉了评论、点赞那些东西为了不让翻译变得麻烦，再输出html时同样利用这个把不必要的那些都扔了。

hexo上传方面每次记得hexo clean清除缓存，不然的话html和css都会出问题，正常来说hexo d成功后，大约5分钟后就可以看到远程zhudian.xyz上的页面效果。由于在线的视频压缩都要钱（我不明白它是如何在我删除历史记录且不登录的情况下还能判断我的试用到期），还是用moo0来压缩，写了两个双击就能执行的程序，先把大于25mb的视频转移到指定目录，然后moo0打开默认目录就是它全选压缩设定为25mb也是默认的，压缩完毕后双击程序自动重命名并移动回telegram的hexo目录下。

hexo d顶多2分钟部署超时卡着不动么，就ctrl c终止再来一遍。

晚上试了一下在hexo的source下新开文件夹作为目录存放telegram页面，手动黏贴了几个消息，用chrome插件解析css手动修改了一些css，本地测试效果如下。hexo 渲染页面要在config里面用skip render跳过这个文件夹，之后hexo clean不然skip render不起作用。

2024-11-3更新：

我把img class 包含full-media的和不包含full-media的分两种下载方式前者用右键后者则是base64，另外full-media还有一种情况是多图，在这种情况下也采用base64。telegram里面右键full-media多图下载会一次性下载全部且无法对每个重命名，这种情况下只有base64下载可以解决重命名问题。在昨天的图片下载测试中全部显示了，几乎没有丢失。

视频下载的问题在于要把目标div逐个滚动条定位一遍使得video标签加载到soup解析的html，然后再下载。下载的视频有时候是MP4扩展名，再用程序批量修改为小写mp4。不知道为什么我在测试中发生了视频文件名相同的情况导致报错退出，修改了程序避免文件名冲突而退出。另外video要添加muted属性，不然网页中无法启动自动播放。

moo0压缩时要选x264那个选项不然网页上会因为编码原因显示不了。目前测试下来25x6=150个消息约需要1.5小时完成图片、视频下载和翻译。这样的话通宵开着1.5x4大约一晚上搞个600条消息不成问题也就是24个频道收集各种消息。

suno那边discord和google账号都需要手机号验证了，没法注册了，而且discord的注册需要完成比较复杂的验证。我今天搜了一下发现了ddddocr这个项目通过ai来识别验证码，目前appleid这边可以反复用中国手机注册账号，只不过有个图片包含字母数字的验证码，我测试了ddddocr可以成功识别appleid注册时的验证码，这样的话批量注册appleid应该就不是问题了。suno作曲每天appleid和microsoft两个账号轮流就行了。appleid批量注册么还需要一个手机验证码自动发送到电脑的程序，这周抽空搞一下。

2024-11-1更新：

暂时不用折腾了，我发现在suno create error后可以对歌曲做extend操作，而extend之后就不再有那个error的问题了。又可以恢复到大量作曲的时光了。我甚至发现多个chrome账号都可以用，退出后自选即可，这样的话appleid这边就可以不用多考虑了。理论上我可以4个chrome号方便切换加1个discord和1个微软，1天6个号作曲。

批量自动翻译指定频道指定消息数下载相应图片视频并排版输出html

 

from DrissionPage import ChromiumPage, ChromiumOptions
from DrissionPage.common import By
import time
import re
from bs4 import BeautifulSoup
import os
import requests
import json
import base64

# 设置Chromium选项
do1 = ChromiumOptions().set_paths(local_port=9111, user_data_path=r'C:/Users/A/AppData/Local/Google/Chrome/User Data')
tab = ChromiumPage(addr_or_opts=do1)

def download_images(top_limit_message_divs, download_folder, max_retries=6):
    if not os.path.exists(download_folder):
        os.makedirs(download_folder)  # 创建下载目录

    total_blob_images = 0  # 统计所有 blob 图片总数

    for div in top_limit_message_divs:
        # 找到所有 img 标签
        img_tags = div.find_all('img')

        # 替换 ./ 为完整路径
        for img in img_tags:
            img_url = img.get('src')
            if img_url and img_url.startswith('./'):
                img['src'] = 'https://web.telegram.org/a/' + img_url.lstrip('./')

        # 只保留不与 video 标签并列且不以 ./ 开头的 img 标签
        filtered_img_tags = [
            img for img in img_tags
            if not (img.find_previous_sibling('video') or img.find_next_sibling('video'))
            and not img.get('src', '').startswith('./')
        ]

        blob_img_count = sum(1 for img in filtered_img_tags if img.get('src') and img.get('src').startswith('blob:'))
        total_blob_images += blob_img_count
        print(f"Found {blob_img_count} blob images in current div after filtering.")

        for i, img in enumerate(filtered_img_tags):
            img_url = img.get('src')
            if img_url and img_url.startswith('blob:') and 'full-media' in img.get('class', []):
                message_div_id = div['id']
                imgxpath = (By.XPATH, f'(//div[@id="{message_div_id}"]//img[contains(@class, "full-media") and starts-with(@src, "blob:")])')
                
                img_element = tab.ele(imgxpath)
                # 检查是否存在多个符合条件的 img 标签
                imgMorethan1xpath = (By.XPATH, f'(//div[@id="{message_div_id}"]//img[contains(@class, "full-media") and starts-with(@src, "blob:")])[2]')
                img_element2 = tab.ele(imgMorethan1xpath)

                if img_element2:  # 多个 blob 图片，使用 base64 下载
                    print(f"Preparing to download multi imgs")
                    file_name = img_url.split('/')[-1] + '.jpg'
                    img_path = os.path.join(download_folder, file_name)
                    
                    for attempt in range(max_retries):
                        result = tab.run_js(f"""
                            return fetch('{img_url}')
                                .then(response => response.blob())
                                .then(blob => {{
                                    return new Promise((resolve, reject) => {{
                                        const reader = new FileReader();
                                        reader.onloadend = () => resolve(reader.result.split(',')[1]);
                                        reader.onerror = reject;
                                        reader.readAsDataURL(blob);
                                    }});
                                }});
                        """)

                        img_data = base64.b64decode(result)  # 转换为二进制数据
                        with open(img_path, 'wb') as img_file:
                            img_file.write(img_data)

                        time.sleep(1)
                        if os.path.exists(img_path) and os.path.getsize(img_path) > 0:
                            print(f"Saved image to: {img_path}")
                            break
                        else:
                            print(f"Retry {attempt + 1}/{max_retries} for image {img_url} failed.")
                            time.sleep(1)
                    else:
                        print(f"Failed to download image {img_url} after {max_retries} attempts.")

                else:  # 只有一张符合条件的图片，模拟右键下载
                    print(f"Preparing to download single img: {img_url}")
                    img_element.scroll.to_see()
                    time.sleep(1)
                    tab.actions.r_click(img_element)
                    # img_element.click.right()
                    time.sleep(1)

                    downloadxpath = (By.XPATH, f'//div[@id="{message_div_id}"]//div[@class="MenuItem compact" and normalize-space(.) = "Download"]')
                    download = tab.ele(downloadxpath)

                    # 设置下载路径和文件名
                    tab.set.download_path(download_folder)
                    tab.set.download_file_name(img_url.split('/')[-1].strip())
                    download.click()
                    time.sleep(1)

    print(f"Total blob images found: {total_blob_images}")

def mute_autoplay_videos(message_divs):
    for div in message_divs:
        # 找到所有 video 标签
        video_tags = div.find_all('video')
        
        for video in video_tags:
            # 如果存在 autoplay 属性，替换为 autoplay muted
            if 'autoplay' in video.attrs:
                video['muted'] = ''  # 添加 muted 属性
                print(f"Updated video tag: {video}")  # 打印更新后的 video 标签

def rename_file_extensions(directory):
    for filename in os.listdir(directory):
        # 分离文件名和扩展名
        base, ext = os.path.splitext(filename)

        # 检查扩展名是否为 .MP4
        if ext == '.MP4':
            # 构造旧文件路径和新文件路径
            old_file = os.path.join(directory, filename)
            new_file = os.path.join(directory, base + '.mp4')

            # 如果文件已存在，则在文件名后添加序号
            count = 1
            while os.path.exists(new_file):
                new_file = os.path.join(directory, f"{base}_{count}.mp4")
                count += 1

            # 重命名文件
            os.rename(old_file, new_file)
            print(f"Renamed {filename} to {new_file}")

# 判断是否为非中文且非链接文本
def is_non_chinese_and_non_link(text):
    # 检查文本是否不包含中文、链接，且不包含30个字符以内的英文字母和以@开头的英文字母
    return (
        text and
        not re.search(r'[\u4e00-\u9fff]', text) and
        'http' not in text and
        'www' not in text and
        not re.search(r'^[a-zA-Z]{1,30}$', text) and  # 检查文本是否为1到30个字符的英文字母
        not re.search(r'^@[a-zA-Z]{1,30}$', text)    # 检查文本是否以@开头并且为1到30个字符的英文字母
    )

# 执行翻译并返回结果
def translate_text(text):
    translate_tab = tab.get_tab(url='volcengine.com')

    languageEn = (By.XPATH, "//div[@class='reverse']/following-sibling::div[@class='sc-ipEyDJ dqurTv']/div[@class='lang' and text()='英语']")
    if translate_tab.ele(languageEn):
        print("由于不明原因改成翻译为英文")
        translate_tab.ele(languageEn).click()
        time.sleep(2)
        language2 = (By.XPATH, "//div[@class='lang-search-recently']/div[@data-lang='zh']")
        language_option = translate_tab.ele(language2)
        language_option.click()
        print("强制改为翻译成中文")

    result1 = (By.XPATH, "//div[@class='slate-editor' and @contenteditable='false']")
    translated_text = translate_tab.ele(result1)
    input1 = (By.XPATH, '//div[@role="textbox" and @aria-multiline="true" and contains(@class, "slate-editor")]')
    input_box = translate_tab.ele(input1)
    input_box.clear()
    input_box.input(text)

    time.sleep(5)
    # print("translated_text为：" + translated_text.text)
    return translated_text.text

# 处理网页内容
def process_webpage(url_base, message_limit):
    # 指定输出文件路径
    output_file_path = r'D:\hexoblog\source\telegram\telegram_translated_messages.html'

    # 创建文件并写入 HTML 内容
    with open(output_file_path, 'w', encoding='utf-8') as file:
        # 写入 HTML 模板
        file.write("""<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Telegram Styled Message</title>
    <link rel="stylesheet" href="telegram2.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
</head>

<script type="text/javascript">
    document.addEventListener("DOMContentLoaded", function() {
    const lazyElements = document.querySelectorAll('.lazy');

    const observer = new IntersectionObserver((entries, observer) => {
        entries.forEach(entry => {
            if (entry.isIntersecting) {
                const el = entry.target;
                const src = el.getAttribute('data-src');
                if (src) {
                    if (el.tagName === 'IMG' || el.tagName === 'VIDEO') {
                        el.src = src;
                        el.removeAttribute('data-src'); // 移除 data-src
                    }
                    observer.unobserve(el); // 停止观察已加载元素
                }
            }
        });
    }, { rootMargin: "0px 0px 200px 0px" }); // 提前 200px 加载

    lazyElements.forEach(el => observer.observe(el));
});

</script>
<body>
""")  # 开始写入 HTML 模板

    # 遍历编号数组
    for chat_id in chat_ids:
        # 拼接URL和menu XPATH
        url = url_base + chat_id
        tab.get(url)
        time.sleep(3)  # 等待页面加载
        menu1 = (By.XPATH, f'//a[@href="#-{chat_id}"]')
        menu_list = tab.ele(menu1)
        time.sleep(1)
        menu_list.click()

        button1 = (By.XPATH, '//div[@class="Y2NKrpKj u62x81QI"]/button')
        buttondown = tab.ele(button1)
        buttondown.click()
        time.sleep(3)
        container1 = (By.XPATH, '//div[@class="messages-container"]')
        messages_container = tab.ele(container1)
        time.sleep(2)

        # 使用BeautifulSoup解析HTML
        soup = BeautifulSoup(messages_container.html, 'html.parser')

        # 删除不需要的元素
        for meta_span in soup.find_all('span', class_='MessageMeta'):
            meta_span.decompose()
        for message_title in soup.find_all('span', class_='message-title-name'):
            message_title.decompose()
        for video_duration in soup.find_all('div', class_='message-media-duration'):
            video_duration.decompose()
        for button_react in soup.find_all('button', class_='message-reaction'):
            button_react.decompose()
        for reply in soup.find_all('div', class_='CommentButton'):
            reply.decompose()
        for reply2 in soup.find_all('div', class_='recent-repliers'):
            reply2.decompose()

        # 提取 message-list-item 中的消息
        message_divs = soup.find_all('div', class_='message-list-item', id=lambda x: x and x.startswith('message-'))

        # 提取 id 中的数字部分并排序
        message_divs_sorted = sorted(
            message_divs,
            key=lambda div: int(div['id'].split('-')[1]),
            reverse=True
        )

        # 根据指定的消息数量保留消息
        top_limit_message_divs = message_divs_sorted[:message_limit]

        for message_div in top_limit_message_divs:
            element = tab.ele(f'@id={message_div['id']}')
            element.scroll.to_see()
            time.sleep(1)

        buttondown.click()
        time.sleep(3)
        container1 = (By.XPATH, '//div[@class="messages-container"]')
        messages_container = tab.ele(container1)
        time.sleep(2)

        # 使用BeautifulSoup解析HTML
        soup = BeautifulSoup(messages_container.html, 'html.parser')

        # 删除不需要的元素
        for meta_span in soup.find_all('span', class_='MessageMeta'):
            meta_span.decompose()
        for message_title in soup.find_all('span', class_='message-title-name'):
            message_title.decompose()
        for video_duration in soup.find_all('div', class_='message-media-duration'):
            video_duration.decompose()
        for button_react in soup.find_all('button', class_='message-reaction'):
            button_react.decompose()
        for reply in soup.find_all('div', class_='CommentButton'):
            reply.decompose()
        for reply2 in soup.find_all('div', class_='recent-repliers'):
            reply2.decompose()

        # 提取 message-list-item 中的消息
        message_divs = soup.find_all('div', class_='message-list-item', id=lambda x: x and x.startswith('message-'))

        # 提取 id 中的数字部分并排序
        message_divs_sorted = sorted(
            message_divs,
            key=lambda div: int(div['id'].split('-')[1]),
            reverse=True
        )

        # 根据指定的消息数量保留消息
        top_limit_message_divs = message_divs_sorted[:message_limit]
        # min_message_div = min(top_limit_message_divs, key=lambda div: int(div['id'].split('-')[1]))

        total_videos = 0
        total_imgs = 0

        for message_div in top_limit_message_divs:

            # 先处理 text-content
            print("处理text-content")
            text_content_divs = message_div.find_all('div', class_='text-content')
            print(f"当前的message_div的id为{message_div['id']}")

            for div in text_content_divs:
                text_content_list = []
                for text_content in div.find_all(text=True):
                    stripped_text = text_content.strip()  # 去除两端空格
                    if is_non_chinese_and_non_link(stripped_text):
                        text_content_list.append(stripped_text)

                combined_text = '\n'.join(text_content_list)
                if combined_text:
                    # print(f"Processing text-content combined_text: {combined_text}")
                    translated_text = translate_text(combined_text)

                    # 将 translated_text 转义成 JavaScript 字符串
                    safe_translated_text = json.dumps(f'<p style="color: purple;">{translated_text}</p>')

                    # 使用原文内容匹配的 XPath 查询，找到浏览器中的目标元素
                    target_div = tab.ele((By.XPATH, f"//div[contains(@class, 'text-content') and contains(., '{text_content_list[0]}')][not(@data-translated)]"))

                    if target_div:
                        target_div.run_js("this.setAttribute('data-translated', 'true');")
                        target_div.run_js(f"""
                            function insertAfter(newElement, targetElement) {{
                                var parentElement = targetElement.parentNode;
                                if (parentElement.lastChild === targetElement) {{
                                    parentElement.appendChild(newElement);
                                }} else {{
                                    parentElement.insertBefore(newElement, targetElement.nextSibling);
                                }}
                            }}
                            
                            var transDiv = document.createElement('div');
                            transDiv.className = 'translated_text';
                            transDiv.innerHTML = {safe_translated_text};
                            
                            insertAfter(transDiv, this);
                        """)
                    else:
                        print("未能找到任何元素 for text-content")

            # 处理 WebPage-text
            print("处理WebPage-text")
            webpage_text_divs = message_div.find_all('div', class_='WebPage-text')
            print(f"当前的message_div的id为{message_div['id']}")

            for div in webpage_text_divs:
                text_content_list = []
                for text_content in div.find_all(text=True):
                    stripped_text = text_content.strip()  # 去除两端空格
                    if is_non_chinese_and_non_link(stripped_text):
                        text_content_list.append(stripped_text)

                combined_text = '\n'.join(text_content_list)
                if combined_text:
                    # print(f"Processing WebPage-text combined_text: {combined_text}")
                    translated_text = translate_text(combined_text)

                    # 将 translated_text 转义成 JavaScript 字符串
                    safe_translated_text = json.dumps(f'<p style="color: purple;">{translated_text}</p>')

                    # 使用原文内容匹配的 XPath 查询，找到浏览器中的目标元素

                    if len(text_content_list) > 1:
                        # 尝试使用 text_content_list[1]
                        target_div = tab.ele((By.XPATH, f"//div[contains(@class, 'WebPage-text') and contains(., '{text_content_list[1]}')][not(@data-translated)]"))
                    else:
                        # 如果 text_content_list[1] 不存在，使用 text_content_list[0]
                        target_div = tab.ele((By.XPATH, f"//div[contains(@class, 'WebPage-text') and contains(., '{text_content_list[0]}')][not(@data-translated)]"))


                    if target_div:
                        target_div.run_js("this.setAttribute('data-translated', 'true');")
                        target_div.run_js(f"""
                            function insertAfter(newElement, targetElement) {{
                                var parentElement = targetElement.parentNode;
                                if (parentElement.lastChild === targetElement) {{
                                    parentElement.appendChild(newElement);
                                }} else {{
                                    parentElement.insertBefore(newElement, targetElement.nextSibling);
                                }}
                            }}
                            
                            var transDiv = document.createElement('div');
                            transDiv.className = 'translated_text';
                            transDiv.innerHTML = {safe_translated_text};
                            
                            insertAfter(transDiv, this);
                        """)
                    else:
                        print("未能找到任何元素 for WebPage-text")


        for message_div in top_limit_message_divs:
            element = tab.ele(f'@id={message_div['id']}')
            element.scroll.to_see()
            time.sleep(1)

        soup = BeautifulSoup(messages_container.html, 'html.parser')

        # 删除不需要的元素
        for meta_span in soup.find_all('span', class_='MessageMeta'):
            meta_span.decompose()
        for message_title in soup.find_all('span', class_='message-title-name'):
            message_title.decompose()
        for video_duration in soup.find_all('div', class_='message-media-duration'):
            video_duration.decompose()
        for button_react in soup.find_all('button', class_='message-reaction'):
            button_react.decompose()
        for reply in soup.find_all('div', class_='CommentButton'):
            reply.decompose()
        for reply2 in soup.find_all('div', class_='recent-repliers'):
            reply2.decompose()

        # 提取 message-list-item 中的消息
        message_divs = soup.find_all('div', class_='message-list-item', id=lambda x: x and x.startswith('message-'))

        # 提取 id 中的数字部分并排序
        message_divs_sorted = sorted(
            message_divs,
            key=lambda div: int(div['id'].split('-')[1]),
            reverse=True
        )

        # 根据指定的消息数量保留消息
        top_limit_message_divs = message_divs_sorted[:message_limit]
        
        download_folder = r'D:\hexoblog\source\telegram'
        download_images(top_limit_message_divs, download_folder)

        for div in top_limit_message_divs:
            # 找到所有 img 标签
            img_tags = div.find_all('img')

            for img in img_tags:
                img_url = img.get('src')
                if img_url and img_url.startswith('blob:'):
                    # 获取图片的文件名并添加后缀
                    new_img_url = img_url.split('/')[-1] + ".jpg"
                    # 修改 img 的 src 属性为 data-src
                    img['data-src'] = new_img_url
                    # 删除原有的 src 属性
                    del img['src']
                    if 'full-media' in img['class']:
                        # 如果 'lazy' 不在 class 列表中，则添加
                        if 'lazy' not in img['class']:
                            img['class'].append('lazy')

                if img_url and img_url.startswith('./'):
                    # 将 './' 替换为完整路径
                    img_url = 'https://web.telegram.org/a/' + img_url.lstrip('./')
                    img['src'] = img_url

        max_retries = 25
        for message_div in top_limit_message_divs:
            # 获取当前 message_div 的视频标签
            video_tags = message_div.find_all('video')
            print("Video tags found:", video_tags)

            video_count = sum(1 for video in video_tags if video.get('src'))
            total_videos += video_count
            print(f"Found {video_count} videos in current div.")

            if video_count > 0:  # 确保当前 div 中有视频
                for video in video_tags:  # 遍历每个视频
                    video_src = video.get('src')
                    if video_src and video_src.startswith('./progressive/document'):
                        # 提取文件名
                        file_name = video_src.replace('./progressive/document', '').strip()
                        print(f"Preparing to download video with filename: {file_name}")

                        # 定义视频标签的 XPath
                        message_div_id = message_div['id']
                        videoxpath = (By.XPATH, f'//div[@id="{message_div_id}"]//video')
                        video_element = tab.ele(videoxpath)  # 获取视频元素

                        # 模拟右键点击
                        tab.actions.r_click(video_element)
                        time.sleep(1)  # 等待菜单出现

                        # 定义下载选项的 XPath
                        downloadxpath = (By.XPATH, f'//div[@id="{message_div_id}"]//div[@class="MenuItem compact" and normalize-space(.) = "Download"]')
                        download = tab.ele(downloadxpath)  # 获取下载按钮

                        retries = 0
                        while retries < max_retries:
                            try:
                                # 设置下载路径和文件名
                                tab.set.download_path(r'D:\hexoblog\source\telegram')  # 设置文件保存路径
                                tab.set.download_file_name(file_name)  # 设置重命名文件名
                                time.sleep(1)
                                download.click()
                                
                                # 修改视频标签属性
                                video_src = file_name + ".mp4"
                                video['data-src'] = video_src
                                del video['src']
                                video['class'] = 'full-media lazy'
                                
                                print(f"Downloaded {file_name} successfully.")
                                break  # 成功完成后跳出重试循环
                            except Exception as e:
                                retries += 1
                                print(f"错误: {e}. 重试 ({retries}/{max_retries})...")
                                time.sleep(3)  # 等待 5 秒再重试
                        else:
                            print(f"Failed to download {file_name} after {max_retries} attempts.")
                            
                        # 可选：等待下载完成
                        time.sleep(15)  # 根据下载时间进行调整

        mute_autoplay_videos(top_limit_message_divs)

        directory_path = "D:/hexoblog/source/telegram"

        for filename in os.listdir(directory_path):
            # 检查文件名前缀是否为 'video'
            if filename.startswith("video"):
                # 构造新的文件名（移除 'video' 前缀）
                new_filename = filename[len("video"):]
                
                # 构造旧文件路径和新文件路径
                old_file = os.path.join(directory_path, filename)
                new_file = os.path.join(directory_path, new_filename)

                # 如果新文件名已存在，添加一个序号后缀
                count = 1
                while os.path.exists(new_file):
                    new_file = os.path.join(directory_path, f"{new_filename}_{count}")
                    count += 1

                # 重命名文件
                os.rename(old_file, new_file)
                print(f"Renamed {filename} to {new_file}")

        rename_file_extensions(directory_path)

        for filename in os.listdir(directory_path):
            # 检查文件是否以 '_1' 结尾并且紧接着文件扩展名
            if filename.endswith('_1.mp4') or filename.endswith('_1.MOV'):
                # 移除文件名中的 '_1'
                new_filename = filename.replace('_1', '', 1)
                
                # 构建旧文件和新文件的完整路径
                old_file = os.path.join(directory_path, filename)
                new_file = os.path.join(directory_path, new_filename)
                
                # 检查是否存在同名文件，若存在则添加后缀 '_new' 或编号
                count = 1
                while os.path.exists(new_file):
                    name, ext = os.path.splitext(new_filename)
                    new_file = os.path.join(directory_path, f"{name}_new{count}{ext}")
                    count += 1

                # 重命名文件
                os.rename(old_file, new_file)
                print(f"Renamed {filename} to {new_file}")
            else:
                print(f"Skipping {filename}, does not match pattern.")

        translated_text_divs = ''.join(str(div) for div in top_limit_message_divs)

        for div in translated_text_divs:
            with open(output_file_path, 'a', encoding='utf-8') as file:
                file.write(str(div))  # 将每个 div 的 HTML 内容写入文件
            
    with open(output_file_path, 'a', encoding='utf-8') as file:
            file.write("\n</body>\n</html>")

    print(f"Translated messages saved to {output_file_path}")


chat_ids = ['1001036240821', '1001374600389', '1001576917998', '1001375124677', '1001001746107', '1001394050290']
url_base = 'https://web.telegram.org/a/#-'
message_limit = 25

process_webpage(url_base, message_limit)

#python #telegram #媒体新闻

2024-10-27-telegram自动批量译文追加在原文下方以及appleid注册验证码自动识别

评论

Your browser is out-of-date!