[雪峰磁针石博客] python模块介绍-数据分析利器pandas-数据清洗和准备技术指南

数据清洗和准备

在进行数据分析和建模的过程中，需要花费大量的时间(80％或更多)在数据准备上：加载，清理，转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。

在本章讨论缺失数据，重复数据，字符串操作，和其他一些分析数据转换。

处理缺失数据

数值用浮点数NaN (Not a Number)表示缺失。

In [1]: import numpy as np  In [2]: import pandas as pd  In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])  In [4]: string_data Out[4]:  0     aardvark 1    artichoke 2          NaN 3      avocado dtype: object  In [5]: string_data.isnull() Out[5]:  0    False 1    False 2     True 3    False dtype: bool  In [6]: string_data[0] = None  In [7]: string_data.isnull() Out[7]:  0     True 1    False 2     True 3    False dtype: bool  NA相关的处理方法

数据缺失用NA(not available)表示， python内置的None也为NA。

Argument	Description
dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how
much missing data to tolerate.
fillna	Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .
isnull	Return boolean values indicating which values are missing/NA.
notnull	Negation of isnull .

In [8]: from numpy import nan as NA  In [9]: data = pd.Series([1, NA, 3.5, NA, 7])  In [10]: data.dropna() Out[10]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [11]: data[data.notnull()] Out[11]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],    ....: [NA, NA, NA], [NA, 6.5, 3.]])  In [13]: cleaned = data.dropna()  In [14]: data Out[14]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0  In [15]: cleaned Out[15]:       0    1    2 0  1.0  6.5  3.0  In [16]: data.dropna(how='all') Out[16]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 3  NaN  6.5  3.0  In [17]: data[4] = NA  In [18]: data Out[18]:       0    1    2   4 0  1.0  6.5  3.0 NaN 1  1.0  NaN  NaN NaN 2  NaN  NaN  NaN NaN 3  NaN  6.5  3.0 NaN  In [19]: data.dropna((axis='columns', how='all') Out[19]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0

how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。

In [21]: df = pd.DataFrame(np.random.randn(7, 3))  In [22]: df.iloc[:4, 1] = NA  In [23]: df.iloc[:2, 2] = NA  In [24]: df Out[24]:            0         1         2 0 -0.843340       NaN       NaN 1 -1.305941       NaN       NaN 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [25]: df.dropna() Out[25]:            0         1         2 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [26]: df.dropna(thresh=2) Out[26]:            0         1         2 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058

fillna用来对缺失值进行填充。可以针对列进行填充，用上一行的值填充，用平均值填充等。

In [27]: df.fillna(0) Out[27]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [28]: df.fillna({1: 0.5, 2: 0}) Out[28]:            0         1         2 0 -0.843340  0.500000  0.000000 1 -1.305941  0.500000  0.000000 2  1.026378  0.500000  2.176567 3  0.048885  0.500000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [29]: _ = df.fillna(0, inplace=True)  In [30]: df Out[30]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [31]: df = pd.DataFrame(np.random.randn(6, 3))  In [32]: df.iloc[2:, 1] = NA  In [33]: df.iloc[4:, 2] = NA  In [34]: df Out[34]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679       NaN  1.272047 3  0.293133       NaN  0.273399 4  0.235116       NaN       NaN 5  1.365186       NaN       NaN  In [35]: df.fillna(method='ffill') Out[35]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116  0.977842  0.273399 5  1.365186  0.977842  0.273399  In [36]: df.fillna(method='ffill', limit=2) Out[36]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116       NaN  0.273399 5  1.365186       NaN  0.273399  In [37]: data = pd.Series([1., NA, 3.5, NA, 7])  In [38]: data.fillna(data.mean()) Out[38]:  0    1.000000 1    3.833333 2    3.500000 3    3.833333 4    7.000000 dtype: float64

Argument	Description
value	Scalar value or dict-like object to use to fill missing values
method	Interpolation; by default 'ffill' if function called with no other arguments
axis	Axis to fill on; default axis=0
inplace	Modify the calling object without producing a copy
limit	For forward and backward filling, maximum number of consecutive periods to fill

数据转换

去重

In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],    ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})  In [40]: data Out[40]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4 6  two   4  In [41]: data.duplicated() Out[41]:  0    False 1    False 2    False 3    False 4    False 5    False 6     True dtype: bool  In [42]: data.drop_duplicates() Out[42]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4  In [43]: data['v1'] = range(7)  In [44]: data.drop_duplicates(['k1']) Out[44]:      k1  k2  v1 0  one   1   0 1  two   1   1  In [45]: data.drop_duplicates(['k1', 'k2'], keep='last') Out[45]:      k1  k2  v1 0  one   1   0 1  two   1   1 2  one   2   2 3  two   3   3 4  one   3   4 6  two   4   6

交流QQ群：python 测试开发 144081101

付费技术支持wechat： pythontesting

本文代码地址：https://github.com/xurongzhong/mobile_data/

本文最新版本地址：http://t.cn/R8tJ9JH

使用函数或者映射(map)转换数据

import pandas as np  import pandas as pd  data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})  data Out[5]:            food  ounces 0        bacon     4.0 1  pulled pork     3.0 2        bacon    12.0 3     Pastrami     6.0 4  corned beef     7.5 5        Bacon     8.0 6     pastrami     3.0 7    honey ham     5.0 8     nova lox     6.0  meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }  lowercased = data['food'].str.lower()  lowercased Out[8]:  0          bacon 1    pulled pork 2          bacon 3       pastrami 4    corned beef 5          bacon 6       pastrami 7      honey ham 8       nova lox Name: food, dtype: object  data['animal'] = lowercased.map(meat_to_animal)  data Out[10]:            food  ounces  animal 0        bacon     4.0     pig 1  pulled pork     3.0     pig 2        bacon    12.0     pig 3     Pastrami     6.0     cow 4  corned beef     7.5     cow 5        Bacon     8.0     pig 6     pastrami     3.0     cow 7    honey ham     5.0     pig 8     nova lox     6.0  salmon  data['food'].map(lambda x: meat_to_animal[x.lower()]) Out[11]:  0       pig 1       pig 2       pig 3       cow 4       cow 5       pig 6       cow 7       pig 8    salmon Name: food, dtype: object

替换

In [2]: import pandas as pd  In [3]: import numpy as np  In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])  In [5]: data Out[5]:  0       1.0 1    -999.0 2       2.0 3    -999.0 4   -1000.0 5       3.0 dtype: float64  In [6]: data.replace(-999, np.nan) Out[6]:  0       1.0 1       NaN 2       2.0 3       NaN 4   -1000.0 5       3.0 dtype: float64  In [7]: data.replace([-999, -1000], np.nan) Out[7]:  0    1.0 1    NaN 2    2.0 3    NaN 4    NaN 5    3.0 dtype: float64  In [8]: data.replace([-999, -1000], [np.nan, 0]) Out[8]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64  In [9]: data.replace({-999: np.nan, -1000: 0}) Out[9]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64

索引和列名修改

In [2]: import pandas as pd  In [3]: import numpy as np    In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),    ....: index=['Ohio', 'Colorado', 'New York'],    ....: columns=['one', 'two', 'three', 'four'])  In [11]: data Out[11]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [5]: data.replace(4, 40) Out[5]:            one  two  three  four Ohio        0    1      2     3 Colorado   40    5      6     7 New York    8    9     10    11   In [12]: transform = lambda x: x[:4].upper()  In [13]: data.index.map(transform) Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')  In [14]: data Out[14]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [15]: data.index = data.index.map(transform)  In [16]: data Out[16]:        one  two  three  four OHIO    0    1      2     3 COLO    4    5      6     7 NEW     8    9     10    11  In [17]: data.rename(index=str.title, columns=str.upper) Out[17]:        ONE  TWO  THREE  FOUR Ohio    0    1      2     3 Colo    4    5      6     7 New     8    9     10    11  In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) Out[18]:           one  two  peekaboo  four INDIANA    0    1         2     3 COLO       4    5         6     7 NEW        8    9        10    11  In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)  In [20]: data Out[20]:           one  two  three  four INDIANA    0    1      2     3 COLO       4    5      6     7 NEW        8    9     10    11

离散化和面元划分

以下暂略

字符串处理

In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',    ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}  In [8]: data = pd.Series(data)  In [9]: data Out[9]:  Dave     dave@google.com Rob        rob@gmail.com Steve    steve@gmail.com Wes                  NaN dtype: object  In [10]: data.isnull() Out[10]:  Dave     False Rob      False Steve    False Wes       True dtype: bool  In [11]: data.str.contains('gmail') Out[11]:  Dave     False Rob       True Steve     True Wes        NaN dtype: object  In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'  In [13]: data.str.findall(pattern, flags=re.IGNORECASE) --------------------------------------------------------------------------- NameError                                 Traceback (most recent call last) <ipython-input-13-085c16e4dbfe> in <module>() ----> 1 data.str.findall(pattern, flags=re.IGNORECASE)  NameError: name 're' is not defined  In [14]: import re  In [15]: data.str.findall(pattern, flags=re.IGNORECASE) Out[15]:  Dave     [(dave, google, com)] Rob        [(rob, gmail, com)] Steve    [(steve, gmail, com)] Wes                        NaN dtype: object  In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)  In [17]: matches Out[17]:  Dave     True Rob      True Steve    True Wes       NaN dtype: object  In [18]: matches.str.get(1) Out[18]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [19]: matches.str[0] Out[19]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [20]: data.str[:5] Out[20]:  Dave     dave@ Rob      rob@g Steve    steve Wes        NaN dtype: object

Method	Description
cat	Concatenate strings element-wise with optional delimiter
contains	Return boolean array if each string contains pattern/regex
count	Count occurrences of pattern
extract	Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
endswith	Equivalent to x.endswith(pattern) for each element
startswith	Equivalent to x.startswith(pattern) for each element
findall	Compute list of all occurrences of pattern/regex for each string
get	Index into each element (retrieve i-th element)
isalnum	Equivalent to built-in str.alnum
isalpha	Equivalent to built-in str.isalpha
isdecimal	Equivalent to built-in str.isdecimal
isdigit	Equivalent to built-in str.isdigit
islower	Equivalent to built-in str.islower
isnumeric	Equivalent to built-in str.isnumeric
isupper	Equivalent to built-in str.isupper
join	Join strings in each element of the Series with passed separator
len	Compute length of each string
lower, upper	Convert cases; equivalent to x.lower() or x.upper() for each element
match	Use re.match with the passed regular expression on each element, returning matched groups as list
pad	Add whitespace to left, right, or both sides of strings
center	Equivalent to pad(side='both')
repeat	Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
replace	Replace occurrences of pattern/regex with some other string
slice	Slice each string in the Series
split	Split strings on delimiter or regular expression
strip	Trim whitespace from both sides, including newlines
rstrip	Trim whitespace on right side
lstrip	Trim whitespace on left side

本文发表于2018年02月24日 10:31
(c)注：本文转载自https://my.oschina.net/u/1433482/blog/1623037，转载目的在于传递更多信息，并不代表本网赞同其观点和对其真实性负责。如有侵权行为，请联系我们，我们会及时删除.

阅读 2898 讨论 0 喜欢 0

快捷链接
网站地图
提交友链

[雪峰磁针石博客] python模块介绍-数据分析利器pandas-数据清洗和准备