得到YouTube前1000熱門頻道列表

7 min readOct 3, 2022

最近找到一個列出當下YouTube最熱門頻道的網站
https://hypeauditor.com/top-youtube/
該網站沒有提供匯出功能，但我發現他們沒有抵抗爬蟲，爬起來蠻容易的，這邊記錄一下使用python爬該網站的方法

該網站除了總體最熱門，還有各細項
* 依項目最熱門：如動畫、電影、音樂等
* 依國家最熱門：如日本、埃及、美國等 (沒台灣嗚嗚)
該網站除了YouTube以外還有分析其他平台(如 IG、tiktok)
以上的基本功能是可以免費觀看的
看平台也有付費功能，下方連結有詳細介紹
https://influencermarketinghub.com/hypeauditor/
簡化問題，我們這邊是爬總體項目，不爬細項
以下是在 colab 中跑的結果
程式碼列於最後
程式碼撰寫時間為2022十月，若未來hypeauditor更新網頁格式可能會讓爬蟲出問題

0. 首先安裝套件

!pip install beautifulsoup4 requests pandas

1. 引入套件

import requests
import bs4
from IPython import display
import pandas as pd

2. 用Request得到網頁

resp = requests.get("https://hypeauditor.com/top-youtube/")
display.HTML(resp.text)

網頁的主體有正常顯示，可以開始做html解析了

3. 回去原網站解析html編碼

# 懶得看網頁原始碼的人可以跳過
# 透過按鍵選取Rank底下的元素來得到他在html原始碼的位置
1.(mac) cmd + option + c
1.(windows) ctrl + shift + c2. 左鍵點選rank下方的第一個元素3. 得到選擇器
右邊對應元素右鍵後，選擇 copy > copy selector4. 貼回colab做解析
#__layout > div > div.page__content.page__content--with-hello-bar > div > div.main.ranking-content > div > div.tab_3MsJF > div > div:nth-child(2) > div.row__top > div.row-cell.rank > span5. 該網站架構寫的蠻好的，可以只使用下方字串來做選擇器
div.row-cell.rank > span

若想了解 Selector 運作原理:
https://developer.mozilla.org/zh-TW/docs/Web/CSS/CSS_Selectors

4. 得到rank資料

soup = bs4.BeautifulSoup(resp.text, "html.parser")
rank_selector = "div.row-cell.rank > span"
ranks_raw = soup.select(rank_selector)
print(ranks_raw)# output : 
# [<span data-v-81473954="">1</span>,
#  <span data-v-81473954="">2</span>,
#  ....,
#  <span data-v-81473954="">50</span>]# 處理得到內文
def select_and_innerHTML(selector):
  return [
    elem.encode_contents().decode()
    for elem in soup.select(selector)
  ]
ranks = select_and_innerHTML(rank_selector)
print(ranks)# output:
# ["1","2", ..., "50"]

5. 同樣處理方式可以得到頻道名稱、追蹤者數、國家、觀看數、按讚數

names = select_and_innerHTML("div.contributor__name > div")
followers = select_and_innerHTML("div.row-cell.subscribers")
country = select_and_innerHTML("div.row-cell.audience")
views = select_and_innerHTML("div.row-cell.avg-views")
likes = select_and_innerHTML("div.row-cell.avg-likes")

6. 連結跟標籤需要用不太一樣的處理方式

6.1 頻道連結

links = [elem["href"] for elem in soup.select("a.contributor-link")]

6.2 頻道標籤

categories = [
 [ 
    elem.encode_contents().decode()
    for elem in midHTML.select("div.tag__content")
  ]
  for midHTML in soup.select("div.row-cell.category")
]

7. 整理成Pandas DataFrame

df = pd.DataFrame(
    dict(
      ranks=ranks,
      names=names,
      links=likes,
      categories=categories,
      followers=followers,
      country=country,
      views=views,
      likes=likes
    )
).set_index("ranks")
df