Eurolygos ir Eurocup krepšinio turnyrų duomenų palyginimas¶
1. Duomenų surinkimas¶
1.1 Reikiamų bibliotekų importavimas¶
In [58]:
!pip install kaggle
!pip install pandas mysql-connector-python
Requirement already satisfied: kaggle in c:\users\sauli\anaconda3\lib\site-packages (1.6.17) Requirement already satisfied: six>=1.10 in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi>=2023.7.22 in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (2024.12.14) Requirement already satisfied: python-dateutil in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (2.9.0.post0) Requirement already satisfied: requests in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (2.32.3) Requirement already satisfied: tqdm in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (4.66.5) Requirement already satisfied: python-slugify in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (5.0.2) Requirement already satisfied: urllib3 in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (2.2.3) Requirement already satisfied: bleach in c:\users\sauli\anaconda3\lib\site-packages (from kaggle) (4.1.0) Requirement already satisfied: packaging in c:\users\sauli\anaconda3\lib\site-packages (from bleach->kaggle) (24.1) Requirement already satisfied: webencodings in c:\users\sauli\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\sauli\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\sauli\anaconda3\lib\site-packages (from requests->kaggle) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\sauli\anaconda3\lib\site-packages (from requests->kaggle) (3.7) Requirement already satisfied: colorama in c:\users\sauli\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6) Requirement already satisfied: pandas in c:\users\sauli\anaconda3\lib\site-packages (2.2.2) Requirement already satisfied: mysql-connector-python in c:\users\sauli\anaconda3\lib\site-packages (9.1.0) Requirement already satisfied: numpy>=1.26.0 in c:\users\sauli\anaconda3\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\sauli\anaconda3\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\sauli\anaconda3\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\sauli\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\sauli\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
In [60]:
import zipfile
import os
import shutil
import pandas as pd
import kaggle
import mysql.connector
from sqlalchemy import create_engine, inspect, text
import glob
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_sco
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) Cell In[60], line 14 12 from sklearn.linear_model import LinearRegression 13 from sklearn.model_selection import train_test_split ---> 14 from sklearn.metrics import mean_squared_error, r2_sco ImportError: cannot import name 'r2_sco' from 'sklearn.metrics' (C:\Users\sauli\anaconda3\Lib\site-packages\sklearn\metrics\__init__.py)
In [ ]:
1.2 Atsiunčiami duomenys iš kaggle ir išskleidžiame parsiųstą archyvą¶
In [7]:
# susikurtas ir parsisiųstas kaggel API raktas kaggle.json {"username":"sauliuspnis","key":"282826122d40ea01d255b6b053343ca9"}, rkatas išsaugoas atitinkamae aplanke.
In [7]:
# Atsisiųsti duomenis į nurodytą aplanką
!kaggle datasets download --force -d babissamothrakis/euroleague-datasets
Dataset URL: https://www.kaggle.com/datasets/babissamothrakis/euroleague-datasets License(s): unknown Downloading euroleague-datasets.zip to C:\Users\sauli\OneDrive\Dokumentai\Saulius\DataAcademy\Baigiamasis darbas\eurolyga_eurocup
0%| | 0.00/102M [00:00<?, ?B/s] 1%| | 1.00M/102M [00:00<01:05, 1.62MB/s] 2%|1 | 2.00M/102M [00:00<00:37, 2.79MB/s] 3%|2 | 3.00M/102M [00:00<00:26, 3.87MB/s] 4%|3 | 4.00M/102M [00:01<00:20, 5.13MB/s] 7%|6 | 7.00M/102M [00:01<00:09, 10.2MB/s] 9%|8 | 9.00M/102M [00:01<00:08, 11.5MB/s] 12%|#1 | 12.0M/102M [00:01<00:06, 14.5MB/s] 17%|#6 | 17.0M/102M [00:01<00:06, 14.8MB/s] 19%|#8 | 19.0M/102M [00:02<00:06, 14.2MB/s] 21%|## | 21.0M/102M [00:02<00:05, 14.6MB/s] 24%|##4 | 25.0M/102M [00:02<00:04, 19.2MB/s] 28%|##8 | 29.0M/102M [00:02<00:04, 16.1MB/s] 32%|###2 | 33.0M/102M [00:02<00:04, 15.1MB/s] 34%|###4 | 35.0M/102M [00:03<00:05, 12.7MB/s] 36%|###6 | 37.0M/102M [00:03<00:05, 13.2MB/s] 38%|###8 | 39.0M/102M [00:03<00:05, 12.6MB/s] 42%|####2 | 43.0M/102M [00:03<00:03, 17.1MB/s] 45%|####4 | 46.0M/102M [00:03<00:03, 18.0MB/s] 48%|####7 | 49.0M/102M [00:03<00:02, 20.4MB/s] 51%|##### | 52.0M/102M [00:04<00:03, 17.0MB/s] 58%|#####7 | 59.0M/102M [00:04<00:01, 27.4MB/s] 62%|######1 | 63.0M/102M [00:04<00:02, 20.3MB/s] 65%|######5 | 67.0M/102M [00:04<00:01, 23.8MB/s] 71%|#######1 | 73.0M/102M [00:04<00:01, 29.4MB/s] 75%|#######5 | 77.0M/102M [00:04<00:00, 27.7MB/s] 79%|#######9 | 81.0M/102M [00:05<00:00, 30.0MB/s] 83%|########3 | 85.0M/102M [00:05<00:00, 26.9MB/s] 87%|########6 | 89.0M/102M [00:05<00:00, 30.0MB/s] 91%|######### | 93.0M/102M [00:05<00:00, 27.7MB/s] 94%|#########3| 96.0M/102M [00:05<00:00, 20.1MB/s] 97%|#########6| 99.0M/102M [00:06<00:00, 19.5MB/s] 100%|##########| 102M/102M [00:06<00:00, 17.5MB/s]
In [9]:
# Išskleidžiamas ZIP failas
with zipfile.ZipFile("euroleague-datasets.zip", 'r') as zip_ref: zip_ref.extractall()
print("Failai išskleisti.")
Failai išskleisti.
1.3 Nuskaitome duomenų failus ir įkeliame duomenis į mySQL¶
In [62]:
# prisijungimas į mysql duomenų bazę euro_league_cup
engine = create_engine('mysql+mysqlconnector://euro:euro@localhost:3306/euro_league_cup')
In [20]:
# pašalinam play_by_play csv failus
trinti = ['eurocup_play_by_play.csv', 'euroleague_play_by_play.csv','eurocup_points.csv', 'euroleague_points.csv']
for failas in trinti:
try:
os.remove(failas) # Ištrina failą
except FileNotFoundError:
print(f"Failas '{failas}' nerastas.")
In [14]:
# eurocup_players.csv faile pakeisti -inf į 0.0
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#
In [24]:
lenteles = glob.glob(f"*.csv")
lenteles
Out[24]:
['eurocup_box_score.csv', 'eurocup_comparison.csv', 'eurocup_header.csv', 'eurocup_players.csv', 'eurocup_teams.csv', 'euroleague_box_score.csv', 'euroleague_comparison.csv', 'euroleague_header.csv', 'euroleague_players.csv', 'euroleague_teams.csv']
In [26]:
for lentele in lenteles:
name = os.path.splitext(os.path.basename(lentele))[0]
print(f'nuskaitoma lentelė {name}')
data = pd.read_csv(lentele, low_memory=False)
data.to_sql(name=name, con=engine, index=False, if_exists='replace')
print(f'Įkelta {data.shape[0]} eilučių į lentelę "{name}"\n')
nuskaitoma lentelė eurocup_box_score Įkelta 91289 eilučių į lentelę "eurocup_box_score" nuskaitoma lentelė eurocup_comparison Įkelta 3675 eilučių į lentelę "eurocup_comparison" nuskaitoma lentelė eurocup_header Įkelta 3675 eilučių į lentelę "eurocup_header" nuskaitoma lentelė eurocup_players Įkelta 8195 eilučių į lentelę "eurocup_players" nuskaitoma lentelė eurocup_teams Įkelta 554 eilučių į lentelę "eurocup_teams" nuskaitoma lentelė euroleague_box_score Įkelta 115626 eilučių į lentelę "euroleague_box_score" nuskaitoma lentelė euroleague_comparison Įkelta 4513 eilučių į lentelę "euroleague_comparison" nuskaitoma lentelė euroleague_header Įkelta 4513 eilučių į lentelę "euroleague_header" nuskaitoma lentelė euroleague_players Įkelta 5929 eilučių į lentelę "euroleague_players" nuskaitoma lentelė euroleague_teams Įkelta 372 eilučių į lentelę "euroleague_teams"
1.4 Sukuriame Views mysql serveryje sujungiant turnyrų duomenis¶
In [28]:
v_lenteles = ['box_score', 'comparison', 'header', 'players', 'teams']
with engine.connect() as conn:
for table_suffix in v_lenteles:
view_name = f'combined_{table_suffix}'
# SQL užklausa „View“ sukūrimui
query = f"""
CREATE OR REPLACE VIEW {view_name} AS
SELECT *,'Euroleague' AS tournament FROM euroleague_{table_suffix}
UNION ALL
SELECT *,'Eurocup' AS tournament FROM eurocup_{table_suffix}
"""
# print(text(query))
# Vykdome užklausą
conn.execute(text(query))
print(f"View '{view_name}' sukurtas.")
View 'combined_box_score' sukurtas. View 'combined_comparison' sukurtas. View 'combined_header' sukurtas. View 'combined_players' sukurtas. View 'combined_teams' sukurtas.
In [85]:
# views = ['combined_box_score', 'combined_comparison', 'combined_header', 'combined_players', 'combined_teams']
# # Tikriname kiekvieno „view“ stulpelius
# with engine.connect() as conn:
# for view_name in views:
# query = text(f"DESCRIBE {view_name};")
# result = conn.execute(query)
# : print(f"\nStulpeliai „{view_name}“:")
# columns = [row[0] for row in result]
# print(columns)
2. Darbas powerBI¶
2.1 Įkeliame combined views į powerBI kaip faktų lenteles¶
2.2 Sukuriame DAX dimensijų lenteles¶
In [ ]:
# dim_metai =
# DISTINCT(
# SELECTCOLUMNS(
# 'fact_header',
# "Metai", RIGHT('fact_header'[season_code], 4)
# )
# )
In [ ]:
# dim_seasons = ADDCOLUMNS(
# DISTINCT(fact_header[season_code]),
# "Metai",
# VALUE(RIGHT(fact_header[season_code],4)
# ))
In [ ]:
# dim_teams = SELECTCOLUMNS(
# DISTINCT(
# 'fact_header'[team_id_a]
# ),
# "TeamID", 'fact_header'[team_id_a],
# "TeamName", CALCULATE(MAX('fact_header'[team_a])),
# "Tournament", CALCULATE(MAX('fact_header'[tournament]))
# )
In [ ]:
# dim_players = SELECTCOLUMNS(
# DISTINCT(
# 'fact_box_score'[player_id]
# ),
# "PlayerId", 'fact_box_score'[player_id] ,
# "PlayerName", CALCULATE(MAX('fact_box_score'[player])),
# "Tournament", CALCULATE(MAX('fact_box_score'[tournament]))
# )
2.3 Sujungiame lenteles ryšiais.¶
3. Analizė su Python¶
3.1 Komandų palyginimas panadojant seaborn¶
In [64]:
# pasiimamae duomenis iš csv_failo
# Views nuskaitymas į DataFrame
view_name = 'combined_teams' # Pakeiskite pagal savo view pavadinimą
query = f"SELECT * FROM {view_name}"
teams = pd.read_sql(query, con=engine)
teams.head()
Out[64]:
season_team_id | season_code | team_id | games_played | minutes | points | two_points_made | two_points_attempted | three_points_made | three_points_attempted | ... | total_rebounds_per_game | assists_per_game | steals_per_game | turnovers_per_game | blocks_favour_per_game | blocks_against_per_game | fouls_committed_per_game | fouls_received_per_game | valuation_per_game | tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E2007_ARI | E2007 | ARI | 20.0 | 805.0 | 1502 | 381 | 773 | 142 | 395 | ... | 30.10 | 10.60 | 9.15 | 13.35 | 2.60 | 3.05 | 21.05 | 22.00 | 73.75 | Euroleague |
1 | E2007_BAM | E2007 | BAM | 14.0 | 559.1 | 879 | 210 | 421 | 100 | 292 | ... | 27.57 | 11.50 | 5.71 | 15.71 | 1.79 | 2.57 | 22.14 | 19.36 | 53.29 | Euroleague |
2 | E2007_BAR | E2007 | BAR | 23.0 | 919.8 | 1708 | 483 | 895 | 150 | 514 | ... | 33.87 | 14.26 | 9.39 | 12.74 | 3.78 | 2.87 | 19.65 | 19.52 | 81.57 | Euroleague |
3 | E2007_BAS | E2007 | BAS | 25.0 | 1005.0 | 2066 | 528 | 937 | 204 | 534 | ... | 32.32 | 13.32 | 7.96 | 12.64 | 2.60 | 2.68 | 20.20 | 21.44 | 89.16 | Euroleague |
4 | E2007_CIB | E2007 | CIB | 14.0 | 560.0 | 1080 | 281 | 524 | 106 | 295 | ... | 27.07 | 10.00 | 9.29 | 14.43 | 1.43 | 3.50 | 23.29 | 19.71 | 68.29 | Euroleague |
5 rows × 46 columns
In [1]:
# print("Stulpeliai:", teams.columns)
In [94]:
# import pandas as pd
# import seaborn as sns
# import matplotlib.pyplot as plt
# # Importuojame duomenis iš Power BI
# df = teams
# # Grupavimas ir duomenų paruošimas
# performance_data = df[['tournament', 'TeamName', 'three_points_made','three_points_made','fouls_received']]
# # Grupavimas pagal turnyrą ir komandą
# aggregated_data = performance_data.groupby(['tournament', 'TeamName']).sum().reset_index()
# # TOP 8 komandos pagal tritaškius
# top_teams = aggregated_data.nlargest(4, 'three_points_made')
# # Vizualizacija 1: TOP 8 komandų tritaškiai
# plt.figure(figsize=(12, 6))
# sns.barplot(x='TeamName', y='three_points_made', hue='tournament', data=top_teams, palette='viridis')
# plt.title('TOP 8 Komandų tritaškiai pagal turnyrą')
# plt.xlabel('Komanda')
# plt.ylabel('Tritaškiai')
# plt.xticks(rotation=10)
# plt.legend(title='Turnyras')
# plt.show()
# # Vizualizacija 2: TOP 8 komandų baudos
# plt.figure(figsize=(12, 6))
# sns.barplot(x='TeamName', y='fouls_received', hue='tournament', data=top_teams, palette='coolwarm')
# plt.title('TOP 8 Komandų baudos pagal turnyrą')
# plt.xlabel('Komanda')
# plt.ylabel('Baudos')
# plt.xticks(rotation=45)
# plt.legend(title='Turnyras')
# plt.show()
# # Vizualizacija 3: TOP 8 komandų atkovoti kamuoliai
# plt.figure(figsize=(12, 6))
# sns.barplot(x='TeamName', y='total_rebounds', hue='tournament', data=top_teams, palette='Blues')
# plt.title('TOP 8 Komandų atkovoti kamuoliai pagal turnyrą')
# plt.xlabel('Komanda')
# plt.ylabel('Atkovoti kamuoliai')
# plt.xticks(rotation=45)
# plt.legend(title='Turnyras')
# plt.show()
In [ ]:
# Dėl prastos kokybės atsisakyta pateikti plt šiame lape. :(
In [66]:
view_name = 'combined_box_score' # Pakeiskite pagal savo view pavadinimą
query = f"SELECT * FROM {view_name}"
box_score = pd.read_sql(query, con=engine)
box_score.head()
Out[66]:
game_player_id | game_id | game | round | phase | season_code | player_id | is_starter | is_playing | team_id | ... | assists | steals | turnovers | blocks_favour | blocks_against | fouls_committed | fouls_received | valuation | plus_minus | tournament | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E2007_191_PADS | E2007_191 | BAR-ROM | 17 | TOP 16 | E2007 | PADS | 1.0 | 1.0 | BAR | ... | 0 | 1 | 0 | 0 | 0 | 1 | 4 | 18 | 18 | Euroleague |
1 | E2007_191_PAZU | E2007_191 | BAR-ROM | 17 | TOP 16 | E2007 | PAZU | 0.0 | 1.0 | BAR | ... | 3 | 2 | 2 | 0 | 0 | 1 | 0 | 8 | 17 | Euroleague |
2 | E2007_191_PJJF | E2007_191 | BAR-ROM | 17 | TOP 16 | E2007 | PJJF | 1.0 | 1.0 | BAR | ... | 0 | 1 | 2 | 0 | 0 | 0 | 1 | 14 | 0 | Euroleague |
3 | E2007_191_PAHW | E2007_191 | BAR-ROM | 17 | TOP 16 | E2007 | PAHW | 0.0 | 1.0 | BAR | ... | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 6 | -5 | Euroleague |
4 | E2007_191_PATA | E2007_191 | BAR-ROM | 17 | TOP 16 | E2007 | PATA | 1.0 | 1.0 | BAR | ... | 3 | 1 | 4 | 0 | 0 | 1 | 4 | 18 | 12 | Euroleague |
5 rows × 33 columns
In [72]:
print(box_score[['minutes', 'three_points_made', 'total_rebounds', 'assists', 'points']].isna().sum())
minutes 13 three_points_made 0 total_rebounds 0 assists 0 points 0 dtype: int64
3.2 Žaidėjų paplyginimas panaudojant seaborn¶
In [ ]:
# Sukurti 4 slicer po 2 Eurolygos ir Eurocup turnyrams ,pasirinkti komandosm ir žaidėjams
# Turi būti parinkit bent du žaidėjai ( multiselect) kiekvienam turnyrui.
# Jei neparinkta nieko - rodo du pirmus žaidėjus pagal plaeyrs sarašą iš fact_box_score lentelės :(
In [ ]:
# Įdedami grafikai kiekvienam turnųrui pagal parinktus filtrus. :)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Imame duomenis iš Power BI (kintamasis dataset)
df = dataset
# pyplot nustatymas
plt.figure(figsize=(17, 12))
# Patikriname, ar duomenų yra
if not df.empty:
# Unikalūs žaidėjų vardai iš slicer'ų
players = df["player"].unique()
if len(players) >= 2:
player1 = players[0]
player2 = players[1]
# Filtruojame kiekvieno žaidėjo duomenis
player1_data = df[df["player"] == player1]
player2_data = df[df["player"] == player2]
# Vizualizacija
metrics = ["points", "three_points_made", "total_rebounds", "fouls_received"]
for i, metric in enumerate(metrics):
plt.subplot(2, 2, i + 1)
sns.histplot(player1_data[metric], color="royalblue", label=f"{player1}", kde=True, alpha=0.5)
sns.histplot(player2_data[metric], color="orange", label=f"{player2}", kde=True, alpha=0.5)
plt.title(f"{metric}")
plt.legend()
plt.xlabel(metric.capitalize())
plt.ylabel("Frequency")
plt.gca().set_facecolor((1, 1, 1, 0.2))
plt.tight_layout()
plt.gcf().patch.set_alpha(0)
plt.show()
else:
print("Pasirinkite bent du skirtingus žaidėjus filtruose.")
text_output = "Pasirinkite bent du skirtingus žaidėjus filtruose."
lt.text(0.5, 0.5, text_output, fontsize=14, ha='center', va='center')
plt.axis('off') # Pašalina ašis
plt.show()
else:
print("Pasirinkite žaidėjus filtruose.")
text_output = "Pasirinkite bent du skirtingus žaidėjus filtruose."
lt.text(0.5, 0.5, text_output, fontsize=14, ha='center', va='center')
plt.axis('off') # Pašalina ašis
plt.show()
In [ ]:
# Atspausdina 4 dažnumo grafikus pagal parinktus žaidėjus. :)
4. Baigiamojo darbo santrauka¶
4.1 Įvadas¶
• Darbo tema: Eurolygos ir Eurocup krepšinio čempionatų palyginimas.¶
• Darbo tikslas: Išanalizuoti Eurolygos ir Eurocup krepšinio varžybų statistinius duomenis, siekiant patvirtinti ar paneigti suformuluotas hipotezes, naudojant SQL, Power BI ir Python.¶
4.2 Hipotezės¶
4.3.1 Eurolygos komandos meta daugiau tritaškių ir daugiau jų pataiko nei Eurocup komandos.¶
4.3.2 Eurolygos žaidejai daugiua buaduojasi todėl jų gynyba geresnė ir priešinikai įmeta mažiau taškų nei Eurocup žaidejai.¶
4.3.3 Eurolygos žaidėjia atkovoja daugiau kamuolių nei Eurocup žaidėjai.¶
4.3 Rezultatai¶
4.3.1 Eurolygos ir Eurocup komandos meta panašiai tritaškių ir panašiai jų pataiko. Atskitais sezonais pastebimas nežymu vieno ar kito turnyro rezultatų pranašumas.¶
4.3.2 Eurolygos ir Eurocup žaidėjai priklausomai nuo priešininko ginasi aktyviau ir atlieka panašiai pražangų stabdant varžovus.¶
4.3.3 Eurolygos žaidėjai meta daugiau metimų ir atkovoja daugiau kamuolių nei Eurocup turnyre.¶
In [ ]: