龙空技术网

机器学习算法:聚类

冰镇火锅聊AI 58

前言:

眼前同学们对“kmodes聚类算法r语言”大约比较看重,你们都想要了解一些“kmodes聚类算法r语言”的相关内容。那么小编同时在网上汇集了一些有关“kmodes聚类算法r语言””的相关知识,希望咱们能喜欢,我们快快来了解一下吧!

什么是聚类?

聚类是一种无监督机器学习技术,涉及根据相似性将一组数据点分组为簇。聚类的目标是确保同一簇内的数据点彼此之间的相似性高于簇内的数据点之间的相似性。这种相似性可以使用各种距离度量来衡量,例如欧几里得距离或余弦相似性,具体取决于数据的性质。

1. K均值聚类

K-Means通过最小化数据点与其各自聚类质心之间的距离将数据划分为k 个聚类。

用例:

客户细分:电子商务平台使用 K-Means 聚类根据购买行为对客户进行分组,以进行有针对性的营销。

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler# Load the dataset - Dataset can be downloaded from Kagglepath = r"Clustering\archive\Mall_Customers.csv"data = pd.read_csv(path)# Display the first few rows of the datasetprint(data.head())# Select the features 'Annual Income' and 'Spending Score'X = data[['Annual Income (k$)', 'Spending Score (1-100)']]# Standardize the data to normalize itscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Apply K-Means clustering with 5 clusterskmeans = KMeans(n_clusters=5, random_state=42)kmeans.fit(X_scaled)# Add the cluster labels to the original datasetdata['Cluster'] = kmeans.labels_# Plot the clustersplt.figure(figsize=(10, 6))plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', s=50)plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x', label='Centroids')plt.title("K-Means Clustering (Customer Segmentation)")plt.xlabel("Annual Income (scaled)")plt.ylabel("Spending Score (scaled)")plt.legend()plt.grid(True)plt.show()
可视化:2. K中位数聚类

K-Medians 使用中位数而不是平均值来计算聚类质心,这使得它对异常值更具鲁棒性。

用例:

根据中位数收入进行聚类:K-Medians可以根据中位数收入对城市进行聚类,这在经济研究中很有用。

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScalerdata = {    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],    'Median Income': [50000, 62000, 58000, 30000, 95000, 42000, 74000, 36000, 87000, 40000],    'Population': [100000, 150000, 120000, 80000, 250000, 90000, 130000, 70000, 200000, 110000]}# Convert to a Dataframedata = pd.DataFrame(data)# Display the datasetprint(data)# Select the relevant features: 'Median Income' and 'Population'X = data[['Median Income', 'Population']]# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Apply K-Means clustering as a basekmeans = KMeans(n_clusters=3, random_state=42).fit(X_scaled)# Function to compute medians for clustersdef compute_medians(X, labels):    unique_labels = np.unique(labels)    medians = np.zeros((len(unique_labels), X.shape[1]))    for label in unique_labels:        cluster_points = X[labels == label]        medians[label] = np.median(cluster_points, axis=0)    return medians# Iterate to approximate K-Mediansfor _ in range(5): # Run iterations to refine medians    labels = kmeans.labels_    medians = compute_medians(X_scaled, labels)    # Update KMeans with the computed medians as initial centroids    kmeans = KMeans(n_clusters=3, init=medians, n_init=1, random_state=42)    kmeans.fit(X_scaled)# Add cluster labels to the original datasetdata['Cluster'] = kmeans.labels_# Plot the clustersplt.figure(figsize=(10, 6))plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', s=100)plt.scatter(medians[:, 0], medians[:, 1], s=300, c='red', marker='x', label='Medians')# Annotate the points with city namesfor i, city in enumerate(data['City']):    plt.text(X_scaled[i, 0], X_scaled[i, 1], city, fontsize=9, ha='right', color='black')plt.title("Approximate K-Medians Clustering (Synthetic Data: Median Income and Population)")plt.xlabel("Median Income (scaled)")plt.ylabel("Population (scaled)")plt.legend()plt.grid(True)plt.show()
可视化:3. BIRCH(使用层次结构的平衡迭代减少和聚类)

BIRCH逐步对传入数据进行聚类,对于大型数据集非常有效。

用例:

实时数据聚类:BIRCH 非常适合在物联网系统等实时应用中对数据进行聚类。

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import Birchfrom sklearn.preprocessing import StandardScaler# Create synthetic dataset representing cities, population, and sensor data (e.g., temperature or energy consumption)data = {    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],    'Population': [100000, 150000, 120000, 80000, 250000, 90000, 130000, 70000, 200000, 110000],    'Average Temperature (°C)': [22.4, 25.8, 21.3, 18.9, 27.5, 19.6, 23.1, 15.7, 26.3, 20.8],  # Synthetic sensor data    'Energy Consumption (MWh)': [1200, 1500, 1100, 800, 2000, 900, 1300, 600, 1900, 950]  # Another form of sensor data}# Convert to a DataFramedata = pd.DataFrame(data)# Display the synthetic datasetprint(data)# Select the relevant features: 'Population', 'Average Temperature', and 'Energy Consumption'X = data[['Population', 'Average Temperature (°C)', 'Energy Consumption (MWh)']]# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Apply BIRCH clusteringbirch_model = Birch(n_clusters=3)  # Using n_clusters=3 as an examplebirch_model.fit(X_scaled)# Add cluster labels to the original datasetdata['Cluster'] = birch_model.labels_# Plot the clusters using Population vs Average Temperature for simplicityplt.figure(figsize=(10, 6))plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=birch_model.labels_, cmap='viridis', marker='o', s=100)plt.title("BIRCH Clustering (Synthetic IoT Data: Population, Temperature, and Energy Usage)")# Annotate the points with city namesfor i, city in enumerate(data['City']):    plt.text(X_scaled[i, 0], X_scaled[i, 1], city, fontsize=9, ha='right', color='black')plt.xlabel("Population (scaled)")plt.ylabel("Average Temperature (°C) (scaled)")plt.grid(True)plt.show()# Display the dataset with cluster labelsprint(data)
可视化:4.模糊C均值

模糊 C 均值将数据点分配给具有不同成员资格程度的多个集群,而不是硬分配。

用例:

图像分割:模糊 C 均值可用于分割医学图像,其中像素可能属于多种组织。

代码:

# pip install scikit-fuzzy scikit-image numpy matplotlibimport numpy as npimport matplotlib.pyplot as pltimport skfuzzy as fuzzfrom skimage import datafrom skimage.color import rgb2grayfrom sklearn.preprocessing import StandardScaler# Load a sample image from skimage (replace this with actual medical image data if available)# Convert the image to grayscale to simulate pixel intensities for segmentationimage = rgb2gray(data.astronaut())  # Using 'astronaut' image for demonstration; replace with medical imageimage = image[100:200, 100:200]  # Crop part of the image for simplicity# Display the original grayscale imageplt.figure(figsize=(5, 5))plt.imshow(image, cmap='gray')plt.title("Original Grayscale Image")plt.show()# Reshape the image into a 1D array (each pixel as a data point)pixels = image.reshape(-1, 1)# Standardize the pixel intensitiesscaler = StandardScaler()pixels_scaled = scaler.fit_transform(pixels)# Apply Fuzzy C-Means Clustering with 3 clusters (simulating 3 tissue types)n_clusters = 3cntr, u, _, _, _, _, _ = fuzz.cluster.cmeans(pixels_scaled.T, c=n_clusters, m=2, error=0.005, maxiter=1000)# Assign each pixel to the cluster with the highest membershipcluster_labels = np.argmax(u, axis=0)# Reshape the cluster labels back into the original image dimensionssegmented_image = cluster_labels.reshape(image.shape)# Display the segmented imageplt.figure(figsize=(5, 5))plt.imshow(segmented_image, cmap='viridis')plt.title("Fuzzy C-Means Segmentation")plt.show()
可视化:5. 小批量 K 均值

Mini Batch K-Means 是 K-Means 的更快版本,它使用随机数据子集来更新质心,使其可扩展至大型数据集。

用例:

文档聚类:小批量 K-Means 适用于将大型文本语料库(例如,新闻文章按主题进行聚类) 。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import MiniBatchKMeans# Create a synthetic dataset of documents (simulating news articles)documents = [    "The economy is growing steadily with the GDP increasing by 3%.",    "New advancements in technology are driving growth in various sectors.",    "The recent election has resulted in significant changes in policies.",    "Healthcare improvements have led to better outcomes for patients.",    "The stock market is volatile with mixed results from major companies.",    "Climate change continues to be a pressing issue around the world.",    "Sports events are being affected by ongoing global health concerns.",    "Travel restrictions are being lifted as vaccination rates rise.",    "New trends in fashion are emerging from various parts of the globe.",    "Education reform is essential for preparing future generations."]# Convert documents to a DataFramedf = pd.DataFrame(documents, columns=['Document'])# Display the first few documentsprint(df)# Convert text documents to TF-IDF featuresvectorizer = TfidfVectorizer(stop_words='english')X = vectorizer.fit_transform(df['Document'])# Apply Mini Batch K-Means clusteringn_clusters = 3  # Adjust the number of clusters as neededmbkmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=2, random_state=42)mbkmeans.fit(X)# Add cluster labels to the original datasetdf['Cluster'] = mbkmeans.labels_# Display the documents with their corresponding cluster labelsprint(df)# Plotting the cluster assignments# For visualization, we'll use the first two dimensions from the TF-IDF featuresX_dense = X.toarray()  # Convert sparse matrix to denseplt.figure(figsize=(10, 6))# Scatter plot based on the first two featuresplt.scatter(X_dense[:, 0], X_dense[:, 1], c=mbkmeans.labels_, cmap='viridis', marker='o', s=100)# Annotate the points with document indicesfor i, doc in enumerate(df['Document']):    plt.text(X_dense[i, 0], X_dense[i, 1], str(i), fontsize=12, ha='right')plt.title("Mini Batch K-Means Clustering (Synthetic Document Data)")plt.xlabel("Feature 1 (TF-IDF)")plt.ylabel("Feature 2 (TF-IDF)")plt.grid(True)plt.show()
6. DBSCAN

DBSCAN根据密度对数据进行聚类。识别高密度区域中的聚类,同时将低密度区域中的点视为噪声。

用例:

地理空间聚类:DBSCAN 用于聚类 GPS 坐标,以识别活动频繁的区域(例如交通拥堵)。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import DBSCAN# Simulate a dataset of GPS coordinates (latitude and longitude)data = {    'Latitude': [37.7749, 37.775, 37.7748, 37.7751, 37.7752, 37.758, 37.759, 37.760, 37.780, 37.781],    'Longitude': [-122.4194, -122.4195, -122.4193, -122.4196, -122.4197, -122.428, -122.429, -122.430, -122.400, -122.401]}# Convert to DataFramedf = pd.DataFrame(data)# Display the first few rows of the datasetprint(df)# Extract latitude and longitude for DBSCANX = df[['Latitude', 'Longitude']].values# Apply DBSCAN for Geospatial Clusteringdbscan = DBSCAN(eps=0.01, min_samples=2)  # eps is the maximum distance between points in the same clusterdf['Cluster'] = dbscan.fit_predict(X)# Plot the clustersplt.figure(figsize=(10, 8))# Assign different colors to different clusters; outliers will be marked as -1unique_labels = np.unique(df['Cluster'])for cluster_label in unique_labels:    if cluster_label == -1:        # Plot noise points (outliers)        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'Longitude'],                     df.loc[df['Cluster'] == cluster_label, 'Latitude'],                     color='red', label='Noise', marker='x', s=100)    else:        # Plot clustered points        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'Longitude'],                     df.loc[df['Cluster'] == cluster_label, 'Latitude'],                     label=f'Cluster {cluster_label}', s=100)# Add titles and labelsplt.title("DBSCAN Geospatial Clustering of GPS Coordinates", fontsize=16)plt.xlabel("Longitude", fontsize=14)plt.ylabel("Latitude", fontsize=14)plt.legend()plt.grid(True)# Show plotplt.show()# Print the resulting DataFrame with cluster labelsprint(df)
可视化:7. OPTICS

OPTICS 是一种基于密度的聚类算法,类似于 DBSCAN,但可以检测不同密度的聚类。

用例:

天文数据聚类:OPTICS 用于对天文数据进行聚类,例如识别具有不同密度的星系团。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import OPTICS# Simulate a dataset of astronomical data (e.g., positions of galaxies)np.random.seed(42)n_points = 500# Cluster 1 (dense cluster of galaxies)galaxies_1 = np.random.normal(loc=[100, 200], scale=5, size=(150, 2))# Cluster 2 (less dense cluster of galaxies)galaxies_2 = np.random.normal(loc=[300, 100], scale=15, size=(100, 2))# Cluster 3 (another dense cluster)galaxies_3 = np.random.normal(loc=[400, 400], scale=5, size=(200, 2))# Outliers (random points)outliers = np.random.uniform(low=[50, 50], high=[450, 450], size=(50, 2))# Combine all data points into one datasetastronomical_data = np.vstack([galaxies_1, galaxies_2, galaxies_3, outliers])# Create a DataFrame for the datadf = pd.DataFrame(astronomical_data, columns=['X', 'Y'])# Apply OPTICS clusteringoptics = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.1)df['Cluster'] = optics.fit_predict(astronomical_data)# Plot the clustersplt.figure(figsize=(10, 8))# Assign different colors to different clusters; outliers will be marked as -1unique_labels = np.unique(df['Cluster'])for cluster_label in unique_labels:    if cluster_label == -1:        # Plot noise points (outliers)        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'X'],                     df.loc[df['Cluster'] == cluster_label, 'Y'],                     color='red', label='Noise', marker='x', s=50)    else:        # Plot clustered points        plt.scatter(df.loc[df['Cluster'] == cluster_label, 'X'],                     df.loc[df['Cluster'] == cluster_label, 'Y'],                     label=f'Cluster {cluster_label}', s=50)# Add titles and labelsplt.title("OPTICS Clustering of Astronomical Data (Simulated Galaxies)", fontsize=16)plt.xlabel("X Coordinate", fontsize=14)plt.ylabel("Y Coordinate", fontsize=14)plt.legend()plt.grid(True)# Show plotplt.show()# Print the resulting DataFrame with cluster labelsprint(df)
可视化:8.模糊K模式

模糊 K 模式用于对分类数据进行聚类,其模糊分配与模糊 C 均值类似,但使用模式而不是均值。

用例:

分类数据聚类:模糊 K 模式对于调查回应有用,或者对于聚类调查回应或客户偏好有用。

代码:

# pip install kmodesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom kmodes.kmodes import KModes# Create a synthetic dataset representing survey responses (categorical data)data = {    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male'],    'Preferred Product': ['A', 'B', 'A', 'C', 'C', 'B', 'A', 'C', 'B', 'C'],    'Payment Method': ['Credit Card', 'Debit Card', 'Cash', 'Credit Card', 'Cash', 'Cash', 'Debit Card', 'Credit Card', 'Cash', 'Credit Card'],    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South']}# Convert to a DataFramedf = pd.DataFrame(data)# Display the first few rows of the datasetprint(df)# Convert the categorical data to numeric using Label Encoding (KModes handles categorical data internally, but this helps with visualization)df_numeric = df.apply(lambda x: pd.factorize(x)[0])# Apply Fuzzy K-Modes (we're using standard K-Modes here as Fuzzy K-Modes isn't readily available in libraries)kmodes = KModes(n_clusters=3, init='Huang', n_init=5, verbose=1)# Fit the model and predict clustersclusters = kmodes.fit_predict(df_numeric)# Add cluster labels to the original datasetdf['Cluster'] = clusters# Display the DataFrame with cluster labelsprint(df)# Set up markers and colors for each clustermarkers = ['o', 's', 'D']  # Circle, square, and diamond for different clusterscolors = ['blue', 'green', 'orange']# Create the plotplt.figure(figsize=(10, 6))# Plot each cluster with a different color and markerfor i, cluster_label in enumerate(np.unique(clusters)):    clustered_data = df[df['Cluster'] == cluster_label]    plt.scatter(clustered_data.index, [i] * len(clustered_data), color=colors[i], marker=markers[i], s=150, label=f'Cluster {cluster_label}')        # Annotate each point with its categorical values (Gender, Product, Payment Method, Region)    for j in clustered_data.index:        plt.text(j, i, f"{df.loc[j, 'Gender']}, {df.loc[j, 'Preferred Product']}, {df.loc[j, 'Payment Method']}",                  fontsize=9, ha='left', color='black')# Add titles and labelsplt.title("Fuzzy K-Modes Clustering of Survey Responses", fontsize=16)plt.xlabel("Survey Respondent Index", fontsize=14)plt.ylabel("Cluster", fontsize=14)plt.yticks([0, 1, 2], ['Cluster 0', 'Cluster 1', 'Cluster 2'])plt.legend()# Show plotplt.grid(True)plt.tight_layout()plt.show()
可视化:9. Expectation-Maximization(EM)

EM 算法估计概率模型的参数,通常与高斯混合模型 (GMM) 一起用于聚类。

用例:

金融中的异常检测:通过拟合 GMM,EM 可用于检测金融交易数据中的异常值。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.mixture import GaussianMixture# Simulate a dataset representing financial transaction data# Columns: 'Amount' (transaction amount), 'Time' (time of transaction)np.random.seed(42)# Cluster 1 (normal transactions, low amounts)normal_1 = np.random.normal(loc=[100, 50], scale=[10, 5], size=(200, 2))# Cluster 2 (normal transactions, high amounts)normal_2 = np.random.normal(loc=[1000, 100], scale=[50, 10], size=(100, 2))# Anomalous transactions (outliers)anomalies = np.random.uniform(low=[5000, 150], high=[10000, 200], size=(10, 2))# Combine the normal transactions and anomaliestransaction_data = np.vstack([normal_1, normal_2, anomalies])# Create a DataFrame for the datadf = pd.DataFrame(transaction_data, columns=['Amount', 'Time'])# Fit a Gaussian Mixture Model (GMM) with 2 components (normal transaction clusters)gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=42)gmm.fit(transaction_data)# Predict the probability of each point belonging to the Gaussian mixtureprobs = gmm.score_samples(transaction_data)# Define a threshold for anomalies (e.g., points with low probability are flagged as anomalies)threshold = np.percentile(probs, 2)  # The lowest 2% of points are flagged as anomaliesdf['Anomaly'] = probs < threshold# Plot the transactions and anomaliesplt.figure(figsize=(10, 8))# Plot normal transactionsnormal_transactions = df[df['Anomaly'] == False]plt.scatter(normal_transactions['Amount'], normal_transactions['Time'], c='green', label='Normal Transactions', s=50)# Plot anomaliesanomalous_transactions = df[df['Anomaly'] == True]plt.scatter(anomalous_transactions['Amount'], anomalous_transactions['Time'], c='red', label='Anomalies', s=100, marker='x')# Add titles and labelsplt.title("Anomaly Detection in Financial Transactions Using GMM", fontsize=16)plt.xlabel("Transaction Amount", fontsize=14)plt.ylabel("Transaction Time", fontsize=14)plt.legend()plt.grid(True)# Show plotplt.show()# Display the first few rows of the dataset with the anomaly flagprint(df.head())
可视化:10.层次聚类

层次聚类构建了聚类层次结构,方法是将每个数据点作为自己的聚类并合并,或从一个大聚类开始并将其拆分。它不需要事先指定聚类数量。

用例:

分类法构建:该算法通常用于构建生物分类法,例如根据遗传相似性对物种进行分类。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy.cluster.hierarchy import dendrogram, linkagefrom sklearn.preprocessing import StandardScaler# Simulate a dataset representing species genetic similarities (e.g., based on genetic markers)np.random.seed(42)# Genetic features for 6 species (simulating genetic markers or characteristics)data = {    'Species': ['Species A', 'Species B', 'Species C', 'Species D', 'Species E', 'Species F'],    'Genetic Marker 1': [0.1, 0.3, 0.2, 0.5, 0.7, 0.9],    'Genetic Marker 2': [0.2, 0.4, 0.2, 0.6, 0.8, 1.0],    'Genetic Marker 3': [0.3, 0.1, 0.5, 0.4, 0.9, 0.8],    'Genetic Marker 4': [0.4, 0.3, 0.6, 0.3, 0.6, 0.7]}# Convert to a DataFramedf = pd.DataFrame(data)# Display the datasetprint(df)# Extract only the genetic markers for clusteringgenetic_data = df.iloc[:, 1:].values  # Exclude species names# Standardize the data (important for genetic similarity clustering)scaler = StandardScaler()genetic_data_scaled = scaler.fit_transform(genetic_data)# Apply hierarchical clustering using the 'ward' method (agglomerative)linked = linkage(genetic_data_scaled, method='ward')# Plot the dendrogramplt.figure(figsize=(10, 6))dendrogram(linked, labels=df['Species'].values, leaf_rotation=90, leaf_font_size=12)plt.title('Hierarchical Clustering Dendrogram (Genetic Similarities)', fontsize=16)plt.xlabel('Species', fontsize=14)plt.ylabel('Distance (Genetic Similarity)', fontsize=14)plt.grid(True)plt.tight_layout()plt.show()
可视化:11.最小生成树(MST)

最小生成树 (MST) 算法构建一棵连接所有数据点且总边权重最小的树。它通常用于最小化网络中连接节点的成本,例如铺设电缆。

然而,MST 并不是像 K-Means 或 DBSCAN 那样的传统意义上的聚类算法,但它可以用作辅助聚类过程的工具。

用例:

网络设计:MST 可用于设计高效的网络结构,例如电网或管道,其目标是用最少的代码连接所有节点。

代码:

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport networkx as nxfrom scipy.spatial import distance_matrix# Simulate a dataset representing cities (nodes) with x, y coordinates (positions)np.random.seed(42)# Generate random coordinates for 10 cities (nodes)cities = pd.DataFrame({    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H', 'City I', 'City J'],    'X': np.random.uniform(0, 100, 10),    'Y': np.random.uniform(0, 100, 10)})# Display the datasetprint(cities)# Create a distance matrix representing the cost between each pair of citiesdistance_matrix_df = pd.DataFrame(distance_matrix(cities[['X', 'Y']], cities[['X', 'Y']]),                                   index=cities['City'], columns=cities['City'])# Display the distance matrixprint("\nDistance Matrix:\n", distance_matrix_df)# Create a graph from the distance matrixG = nx.Graph()# Add nodes (cities) to the graphfor i, city in cities.iterrows():    G.add_node(city['City'], pos=(city['X'], city['Y']))# Add edges (distances between cities) to the graphfor i in range(len(cities)):    for j in range(i + 1, len(cities)):        G.add_edge(cities['City'][i], cities['City'][j], weight=distance_matrix_df.iloc[i, j])# Compute the Minimum Spanning Tree (MST) using Kruskal's algorithmmst = nx.minimum_spanning_tree(G, algorithm='kruskal')# Plot the cities and the MSTplt.figure(figsize=(10, 8))# Get positions for the citiespos = {city['City']: (city['X'], city['Y']) for i, city in cities.iterrows()}# Draw the nodes (cities)nx.draw_networkx_nodes(G, pos, node_size=500, node_color='lightblue', alpha=0.9, label='Cities')# Draw the MST edgesnx.draw_networkx_edges(mst, pos, edgelist=mst.edges(), edge_color='green', width=2, label='MST Edges')# Draw the edge labels (distances)edge_labels = nx.get_edge_attributes(mst, 'weight')nx.draw_networkx_edge_labels(mst, pos, edge_labels=edge_labels, font_size=8)# Draw the city labelsnx.draw_networkx_labels(G, pos, font_size=10, font_family="sans-serif")# Add title and labelsplt.title("Minimum Spanning Tree for Network Design (Simulated Cities)", fontsize=16)plt.legend(loc='upper left')plt.grid(True)plt.show()# Print the total weight (cost) of the MSTmst_total_cost = sum(nx.get_edge_attributes(mst, 'weight').values())print(f"\nTotal cost of the Minimum Spanning Tree: {mst_total_cost:.2f}")
可视化:

参考:

标签: #kmodes聚类算法r语言