机器学习离不开数据科学,于是尝试下用Python对数据通过表格和图表直观显示。
这篇不包括任何预测结果和数据分析,只是运用numpy,pandas以及matplotlib对大量数据进行关键信息提取。

对比SPSS,用Python处理数据更加自由,结果图片也更加精致,尤其在线性代数例如矩阵的处理很方便。

知识不难很基础,但不可或缺,必须要会,除了常用的函数,还有很多很多,以后用到补充进来。

covid_19_data的数据结果实验

数据来源:

表示covid_19_data的数据结果

这里数据太多,用head()只显示前五行

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import Imputer
df = pd.read_csv('covid_19_data.csv')
df.head()
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered
0 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1 0 0
1 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14 0 0
2 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6 0 0
3 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1 0 0
4 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0 0 0

对不必要信息的剔除

df.drop(['SNo','Last Update'],axis=1,inplace=True)
df.head()
ObservationDate Province/State Country/Region Confirmed Deaths Recovered
0 01/22/2020 Anhui Mainland China 1 0 0
1 01/22/2020 Beijing Mainland China 14 0 0
2 01/22/2020 Chongqing Mainland China 6 0 0
3 01/22/2020 Fujian Mainland China 1 0 0
4 01/22/2020 Gansu Mainland China 0 0 0

更改数据信息名称

df.rename(columns={'ObservationDate':'Date','Province/State':'Province','Country/Region':'Country'},inplace=True)
df.head()
Date Province Country Confirmed Deaths Recovered
0 01/22/2020 Anhui Mainland China 1 0 0
1 01/22/2020 Beijing Mainland China 14 0 0
2 01/22/2020 Chongqing Mainland China 6 0 0
3 01/22/2020 Fujian Mainland China 1 0 0
4 01/22/2020 Gansu Mainland China 0 0 0

将信息按照时间显示

df['Date'] = pd.to_datetime(df['Date'])
df.head()
Date Province Country Confirmed Deaths Recovered
0 2020-01-22 Anhui Mainland China 1 0 0
1 2020-01-22 Beijing Mainland China 14 0 0
2 2020-01-22 Chongqing Mainland China 6 0 0
3 2020-01-22 Fujian Mainland China 1 0 0
4 2020-01-22 Gansu Mainland China 0 0 0

describe可以很方便地显示出特征点

df.describe()
Confirmed Deaths Recovered
count 53927.000000 53927.000000 53927.000000
mean 8629.160625 509.275558 3567.531812
std 28030.847686 2438.846476 20210.864799
min 0.000000 0.000000 0.000000
25% 76.000000 1.000000 0.000000
50% 712.000000 11.000000 75.000000
75% 4039.500000 124.000000 985.000000
max 405843.000000 41128.000000 720631.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53927 entries, 0 to 53926
Data columns (total 6 columns):
Date         53927 non-null datetime64[ns]
Province     32870 non-null object
Country      53927 non-null object
Confirmed    53927 non-null int64
Deaths       53927 non-null int64
Recovered    53927 non-null int64
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 2.5+ MB

将缺失的数据显示为NA的命令

df=df.fillna('NA')
df.head()
Date Province Country Confirmed Deaths Recovered
0 2020-01-22 Anhui Mainland China 1 0 0
1 2020-01-22 Beijing Mainland China 14 0 0
2 2020-01-22 Chongqing Mainland China 6 0 0
3 2020-01-22 Fujian Mainland China 1 0 0
4 2020-01-22 Gansu Mainland China 0 0 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53927 entries, 0 to 53926
Data columns (total 6 columns):
Date         53927 non-null datetime64[ns]
Province     53927 non-null object
Country      53927 non-null object
Confirmed    53927 non-null int64
Deaths       53927 non-null int64
Recovered    53927 non-null int64
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 2.5+ MB
df.head()
Date Province Country Confirmed Deaths Recovered
0 2020-01-22 Anhui Mainland China 1 0 0
1 2020-01-22 Beijing Mainland China 14 0 0
2 2020-01-22 Chongqing Mainland China 6 0 0
3 2020-01-22 Fujian Mainland China 1 0 0
4 2020-01-22 Gansu Mainland China 0 0 0

这个命令是按国家分行,reset_index是按照字典序排列

df2 = df.groupby('Country')[['Country','Confirmed','Deaths',"Recovered"]].sum().reset_index()
df2.head()
Country Confirmed Deaths Recovered
0 Azerbaijan 1 0 0
1 ('St. Martin',) 2 0 0
2 Afghanistan 1005001 20790 211850
3 Albania 97617 3015 62645
4 Algeria 623538 50022 367869

再尝试一下按照国家和时间排序

df2 = df.groupby(['Country','Date'])[['Country','Date','Confirmed','Deaths',"Recovered"]].sum().reset_index()
df2.head()
Country Date Confirmed Deaths Recovered
0 Azerbaijan 2020-02-28 1 0 0
1 ('St. Martin',) 2020-03-10 2 0 0
2 Afghanistan 2020-02-24 1 0 0
3 Afghanistan 2020-02-25 1 0 0
4 Afghanistan 2020-02-26 1 0 0

筛选出Confirmed人数多于100的信息

df3 = df2[df2['Confirmed']>100]
df3.head()
Country Date Confirmed Deaths Recovered
34 Afghanistan 2020-03-27 110 4 2
35 Afghanistan 2020-03-28 110 4 2
36 Afghanistan 2020-03-29 120 4 2
37 Afghanistan 2020-03-30 170 4 2
38 Afghanistan 2020-03-31 174 4 5

下面是用matplotlib画图

countries = df3['Country'].unique()
len(countries)
170
for idx in range(0,len(countries)):
    C = df3[df3['Country']==countries[idx]].reset_index()
    plt.plot(np.arange(0,len(C)),C['Confirmed'],color='blue',label='Confirmed')
    plt.scatter(np.arange(0,len(C)),C['Recovered'],color='green',label='Recovered')
    plt.scatter(np.arange(0,len(C)),C['Deaths'],color='red',label='Deaths')
    plt.title(countries[idx])
    plt.xlabel('Days')
    plt.ylabel('Numbers')
    plt.legend()
    plt.show()

第一个国家的效果图

numpy常用知识点

整理一下最近用到的不多的几个函数,以后忘了就回来抄抄...

import numpy as np

对numpy数列的基本操作和格式

代码在此

numpy的list要这样写

a = np.array([1,2,3,5,7])
b = np.array((2,3,5),dtype='f')
print(a)

[1 2 3 5 7]

type(a)

numpy.ndarray

a.dtype

dtype('int32')

显示list数

a = np.array([[1,2,3],[4,5,6]])
a.ndim

2

取得某坐标的点

a[0,2]

3

c = np.array([[[1,2,3],[4,5,6],[0,0,-1]],[[-1,-2,-3],[-4,-5,-6],[0,0,-1]]])
c.ndim

3

c[1,0,2]

看看数据维度

c.shape

(2, 3, 3)

c.size

18

c.nbytes

72

取得数据范围,间隔

代码在此

A = np.arange(20,100,3)
print(A)

[20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98]

print(list(range(10)))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

A = np.random.permutation(np.arange(10))
print(A)

[2 6 8 7 3 4 9 1 5 0]

v = np.random.randint(20,30)
type(v)

int

A = np.random.rand(1000)
B = np.random.randn(100000)
C = np.random.rand(2,3)
C

array([[0.19953212, 0.07064521, 0.8146567 ], [0.2638225 , 0.67922568, 0.15299028]])

C = np.random.rand(2,3,4,2)
D = np.arange(100).reshape(4,5,5)
D.shape

(4, 5, 5)

A = np.arange(100)
b = A[3:10]
print(b)

[3 4 5 6 7 8 9]

b[0] = -1200
b

array([-1200, 4, 5, 6, 7, 8, 9])

此处原来的A也改变

A

array([ 0, 1, 2, -1200, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98,
99])

b = A[3:10].copy()
b

array([-1200, 4, 5, 6, 7, 8, 9])

改变元素

b[2] = -111111
b

array([ -1200, 4, -111111, 6, 7, 8, 9])

区别!!copy后的B更改后原来的A不改变

A

array([ 0, 1, 2, -1200, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98,
99])

输出

A[::2]

array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])

A[::-2]

array([ 99, 97, 95, 93, 91, 89, 87, 85, 83,
81, 79, 77, 75, 73, 71, 69, 67, 65,
63, 61, 59, 57, 55, 53, 51, 49, 47,
45, 43, 41, 39, 37, 35, 33, 31, 29,
27, 25, 23, 21, 19, 17, 15, 13, 11,
9, 7, 5, -1200, 1])

提取范围

B = A[(A<40)&(A>30)]
B

array([31, 32, 33, 34, 35, 36, 37, 38, 39])

矩阵的处理很方便

代码在此

A = np.round(10*np.random.rand(5,3))
A

array([[ 4., 7., 0.],[10., 2., 8.],[ 3., 7., 9.],[ 1., 8., 1.],[ 2., 2., 9.]])

A = np.round(100*np.random.rand(5,3))
A

array([[13., 5., 54.],
[93., 34., 42.],
[74., 97., 66.],
[20., 77., 86.],
[75., 28., 94.]])

每个元素都+3

A + 3
A

array([[13., 5., 54.],
[93., 34., 42.],
[74., 97., 66.],
[20., 77., 86.],
[75., 28., 94.]])

A+(np.arange(5).reshape(5,1))

array([[13., 5., 54.],
[94., 35., 43.],
[76., 99., 68.],
[23., 80., 89.],
[79., 32., 98.]])

B = np.round(10*np.random.rand(5,2))
B

array([[ 9., 9.],
[ 5., 7.],
[ 7., 2.],
[ 3., 9.],
[ 6., 10.]])

整合矩阵

C = np.hstack((A,B))
C
array([[13.,  5., 54.,  9.,  9.],

[93., 34., 42., 5., 7.],
[74., 97., 66., 7., 2.],
[20., 77., 86., 3., 9.],
[75., 28., 94., 6., 10.]])

A = np.random.permutation(np.arange(10))
A

array([6, 5, 4, 2, 9, 3, 1, 8, 7, 0])

排序处理

代码在此

A.sort()
A

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

A[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

字符串排序

A = np.array(["abc","how are you","akas"])
A.sort()
A

array(['abc', 'akas', 'how are you'], dtype='<U11')

pandas知识点

主要是绘制表格

import pandas as pd

代码在此

键值对

A = pd.Series([2,3,4,5],index = ['a','b','c','d'])
A.index

Index(['a', 'b', 'c', 'd'], dtype='object')

A['a']

2

A['a':'c']

a 2
b 3
c 4
dtype: int64

例子:成绩等级

grads_dict = {'A':4,'B':3.5,'C':3,'D':2.5}
grads = pd.Series(grads_dict)
grads.values

array([4. , 3.5, 3. , 2.5])

取数据的例子

marks_dic = {'A':85,'B':75,'C':65,'D':55}
marks = pd.Series(marks_dic)
marks

A 85
B 75
C 65
D 55
dtype: int64

marks['B' : 'D']

B 75
C 65
D 55
dtype: int64

marks[0:2]

A 85
B 75
dtype: int64

marks

A 85
B 75
C 65
D 55
dtype: int64

grads

A 4.0
B 3.5
C 3.0
D 2.5
dtype: float64

画表格

D = pd.DataFrame({'Marks':marks,'Grades':grads})
D
Marks Grades
A 85 4.0
B 75 3.5
C 65 3.0
D 55 2.5
D.T
A B C D
Marks 85.0 75.0 65.0 55.0
Grades 4.0 3.5 3.0 2.5
D.values

array([[85. , 4. ],
[75. , 3.5],
[65. , 3. ],
[55. , 2.5]])

D.values[2,0]

65.0

D.columns

Index(['Marks', 'Grades'], dtype='object')

D['ScalesMarks'] = 100*D['Marks']/90
D
Marks Grades ScalesMarks
A 85 4.0 94.444444
B 75 3.5 83.333333
C 65 3.0 72.222222
D 55 2.5 61.111111
del D['ScalesMarks']
D
Marks Grades
A 85 4.0
B 75 3.5
C 65 3.0
D 55 2.5

筛选

G = D[D['Marks']>70]
G
Marks Grades
A 85 4.0
B 75 3.5
A = pd.DataFrame([{'a':1,'b':4},{'b':-3,'c':9}])
A
a b c
0 1.0 4 NaN
1 NaN -3 9.0

填充缺的值

A.fillna(0)
a b c
0 1.0 4 0.0
1 0.0 -3 9.0
A = pd.Series(['a','b','c'],index=[1,3,5])
A[1]

'a'

注意下标

A[1:3]

3 b
5 c
dtype: object

A.loc[1:3]

1 a
3 b
dtype: object

A = pd.Series(['a','b','c'],index = [1,3,5])
A[1]

'a'

Last modification:July 5th, 2020 at 05:37 pm
请赏我杯奶茶,让我快乐长肉