Clustering and Unsupervised Machine Learning
dealing with non-numeric data
- get the unique value of the columns
- set the numbers of value id
use the id as the numeric data of this column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19def handle_non_numerical_data(df):
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df[column] = list(map(convert_to_int, df[column]))
return dfset()
—> get the unique valuesmap()
—> apply the function for element in the list
K - Means
X = preprocessing.scale(X)
is necessarily!df.drop(['ticket'], 1, inplace = True)
Do not forget the Parametersthe ‘1’ and ‘inplace = True’
np.linalg.norm() —> calculate the distance between two points, there is many options , use it well
np.average —> axis = 0 means A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1).
An 3d figure see mplot3d
1
2
3
4
5
6
7import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
...
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
Mean - Shift
initialize one point as centroid, set bandwidth and class order
find all points in the circle (centroid, bandwidth), and set the possibility of class(point) += 1
calculate the vector of all the points in class, shift = sum(vectors)
center = center + shift
replay the 2-4 step, make a convergence
if the centroid is in the circle of an existed centroid, merge the two class, otherwise make a new class
replay 1-5 until all the points are marked
- classify: find index of max(class(point)) set index as the class
print pd.df.describe()
—> show more simply one